[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gentoo-hardened
Subject:    Re: [gentoo-hardened] tg3 driver - transmit timed out, resetting
From:       David Sommerseth <gentoo.list () topphemmelig ! net>
Date:       2009-02-25 14:02:38
Message-ID: 49A54F7E.4090603 () topphemmelig ! net
[Download RAW message or body]

atoth@atoth.sote.hu wrote:
> On Pén, December 12, 2008 19:09, David Sommerseth wrote:
>>
>> David Sommerseth wrote:
>>> atoth@atoth.sote.hu wrote:
>>>> PCI-X dual port Broadcom NetXtreme BCM5704 Gigabit Ethernet (rev 03)
>>>> adapter is working fine here driven by tg3, 2.6.27-hardened-r1. The
>>>> driver
>>>> doesn't seem to be borked with my card.
>>>>
>>>> Did you check out the "error" field of ifconfig's output for the
>>>> interface
>>>> of your card?
>>>>
>>>> Regards,
>>>> Dw.
>>> Hmmm ... No, I have not had that opportunity.  The server is located
>>> 2000km away from me, and I
>>> usually call a guy (who is not a technician)to go in and press
>>> CTRL-ALT-DEL on a keyboard.  That is
>>> the short-time "fix".  But I'm going to have a look physically on the
>>> server in a couple of weeks,
>>> so if I get positive feedbacks from others as well regarding 2.6.27
>>> kernel, I'm willing to try that
>>> upgrade.
>>>
>>> This interface is an on-board interface in an IBM eServer.  The first
>>> time it happened, it was no
>>> problems for about 28 days.  Now it was 13 days.  So I expect it to
>>> happen again, soon enough.
>>>
>>> I'll try to hack the shutdown scripts to dump the ifconfig info
>>> somewhere somehow.
>> Then it happened again ... and I have ifconfig stats for the interface:
>>
>> eth0      Link encap:Ethernet  HWaddr 00:14:5e:5d:3c:d0
>>            inet6 addr: fe80::214:5eff:fe5d:3cd0/64 Scope:Link
>>            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>            RX packets:10551633 errors:4294967239 dropped:767 overruns:0
>> frame:170
>>            TX packets:9371606 errors:4294967239 dropped:0 overruns:0
>> carrier:0
>>            collisions:4294967239 txqueuelen:1000
>>            RX bytes:28237000 (26.9 MiB)  TX bytes:163377979 (155.8 MiB)
>>            Interrupt:16
>>
>>  From the kernel log I see this:
>>
>> Dec 12 12:19:21 fw [74355.059369] tg3: tg3_abort_hw timed out for world,
>> TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
>> Dec 12 12:19:24 fw [74357.842979] tg3: world: No firmware running.
>> Dec 12 12:19:41 fw [74374.992867] tg3: world: Link is down.
>>
>> I'm surprised by the errors and collision numbers here, as I checked it
>> the
>> other day, and all of them was 0.  I also know that the TX and RX values
>> was above 3-4GB, but don't remember which was what.
>>
>> Could this be an overflow bug of some kind?
>>
>> I have also found out that IBM have released an updated firmware to this
>> network device, so I'll try to upgrade it during Christmas when I'm close
>> to the box again.  In the mean time I have a little ping-script, which
>> restarts network (incl. reloading of the tg3 module) when the network
>> dies.
>>   This restart gives me minimal downtime.
>>
>> But I do not understand why this box was so rock solid until I upgraded
>> from 2.6.22-hardened-r8 to 2.6.25-hardened-r8.  The new kernel driver
>> obviously does something it didn't do before.  Unfortunately I can't find
>> anything particular in the kernel git logs for the tg3.[ch] files which
>> could pin-point anything particular.
>>
>>
>> Does anyone have any experiences regarding firmware upgrades on these
>> cards?  The instructions seems pretty much forward, but if you know about
>> anything, whatever, I would appreciate that.
>>
>>
>> kind regards,
>>
>> David Sommerseth
>>
> 
> Rather strange. The collisions and the errors counter shows the same...
> It was a long time ago, when I last saw collisions.
> 
> There are several possibilities regarding this symptom. It would be
> important to know if the card is connected to a hub, or a switch(ing-hub)?
> 1.) There can be a defective device on the subnet, which is connected to
> it from time-to-time, or it is present all the time, but doesn't hog the
> line constantly

Pretty confident this is not the case, as this interface is the one 
connected straight to the router from the ISP.

> 2.) The switch/hub can have a problem - try reconnecting the card to
> another port

Pretty confident this is also not the case.

> 3.) The network card can have a problem, which can be software related and
> might be solved by a firmware upgrade (unfortunately the card itself
> cannot be replaced being an on-board NIC)

Firmware updated now.  I found a firmware updates for the Broadcom 
interface I have in the IBM xSeries server and updated it.  I also upgraded 
the kernel to 2.6.25-hardened-r11 from 2.6.25-hardened-r8.  After this, the 
server have survived 55 days without any issues, which is the longest since 
I upgraded from 2.6.22-hardened-r8.  I believe strongly that it was the 
firmware update which helped out.

> 4.) It can even be caused by a driver bug - which we know is all the way
> possible since the e1000 issue

Yeah, and this part scares me more ...

> I hope it'll turn out soon. I would think about a hardware issue, but it's
> a disturbing fact, that these symptoms appeared after a kernel upgrade.

Exactly!


So my thesis is that between linux-2.6.22-hardened-r8 and 
2.6.25-hardened-r8 the tg3 driver must have been updated somehow, which 
then depends on some features in the firmware which obviously did not work 
properly.  And if the tg3 driver did not change, I've simply been way to 
lucky to not experience that for over 13 months with the 2.6.22 kernel.

The firmware I upgraded to can be found here:
http://www-947.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-5070004&brandind=5000008

This update upgraded the network card firmware "bootcode" from 3.61 to 3.65 
and the "IPMI" from 6.20 to 6.25.


kind regards,

David Sommerseth


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic