[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-netdev
Subject:    Re: [PATCH v3] net/tg3: fix race condition in tg3_reset_task()
From:       Thinh Tran <thinhtr () linux ! vnet ! ibm ! com>
Date:       2023-11-30 22:29:34
Message-ID: edafdffb-c1d4-4461-abb6-595624dd7710 () linux ! vnet ! ibm ! com
[Download RAW message or body]


On 11/17/2023 12:31 PM, Michael Chan wrote:
> On Fri, Nov 17, 2023 at 8:19 AM Thinh Tran <thinhtr@linux.vnet.ibm.com> wrote:
>>
>>
>> On 11/16/2023 3:34 PM, Michael Chan wrote:
>>>
>>> I think it will be better to add these 2 checks in tg3_reset_task().
>>> We are already doing a similar check there for tp->pcierr_recovery so
>>> it is better to centralize all the related checks in the same place.
>>> I don't think tg3_dump_state() below will cause any problems.  We'll
>>> probably read 0xffffffff for all registers and it will actually
>>> confirm that the registers are not readable.
>>>
>>
>> I'm concerned that race conditions could still occur during the handling
>> of Partitionable Endpoint (PE) reset by the EEH driver. The issue lies
>> in the dependency on the lower-level FW reset procedure. When the driver
>> executes tg3_dump_state(), and then follows it with tg3_reset_task(),
>> the EEH driver possibility changes in the PCI devices' state, but the
>> MMIO and DMA reset procedures may not have completed yet. Leading to a
>> crash in tg3_reset_task().  This patch tries to prevent that scenario.
> 
> It seems fragile if you are relying on such timing.  TG3_TX_TIMEOUT is
> 5 seconds but the actual time tg3_tx_timeout() is called varies
> depending on when the TX queue is stopped.  So tg3_tx_timeout() will
> be called 5 seconds or more after EEH if there are uncompleted TX
> packets but we don't know precisely when.
> 
>>
>> While tg3_dump_state() is helpful, it also poses issues as it takes a
>> considerable amount of time, approximately 1 or 2 seconds per device.
>> Given our 4-port adapter, this could extend to more than 10 seconds to
>> write to the dmesg buffer. With the default size, the dmesg buffer may
>> be over-written and not retain all useful information.
>>
> 
> If tg3_dump_state() is not useful and fills the dmesg log with useless
> data, we can add the same check in tg3_dump_state() and skip it.
> tg3_dump_state() is also called by tg3_process_error() so we can avoid
> dumping useless data if we hit tg3_process_error() during EEH or AER.
> 
> Thanks.

I implemented the fix as you suggested and passed the tests.
I will send the next version of patch shortly.

Thanks.

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic