'Re: [E1000-devel] Network stalls with e1000 driver and 82541'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       e1000-devel
Subject:    Re: [E1000-devel] Network stalls with e1000 driver and 82541
From:       Lars Ehrhardt <le-e1000 () ml ! h42 ! net>
Date:       2010-04-16 18:03:45
Message-ID: 4BC8A681.7030009 () ml ! h42 ! net
[Download RAW message or body]

Hi,

Brandeburg, Jesse wrote:

> Hm, thats very interesting.  Depending on your kernel version they
> made the NETDEV_WATCHDOG message a WARN_ONCE only, but the driver's 
> tx_timeout counter will still increment if the OS resets us due to
> not completing transmits.

Yeah, true... the value in tx_timeout is incrementing on the interface:

NIC statistics:
     rx_packets: 4388583768
     tx_packets: 144829687
     rx_bytes: 11352706471
     tx_bytes: 182892477547
     rx_broadcast: 410657
     tx_broadcast: 24509
     rx_multicast: 842065
     tx_multicast: 370
     rx_errors: 4294967294
     tx_errors: 0
     tx_dropped: 0
     multicast: 842065
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 8589934590
     rx_frame_errors: 0
     rx_no_buffer_count: 16341
     rx_missed_errors: 28387
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 47
     tx_restart_queue: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 1
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 11352706471
     rx_csum_offload_good: 80766344
     rx_csum_offload_errors: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0

>>> Can you make absolutely sure that ethtool -K ethX tso off is done
>>> on each 82541 interface?
>>> 
>>> The other thing that might be relevant is if you have >= 4GB ram.
>>> 
>> Nope, the machine has 2 GB RAM. Could it be that the problem is
>> related to hyperthreading? What I find odd is that the problem
>> occurs on 2 machines, while it does not occur on 3 other machines.
>> I cannot find a difference in the system settings though.
>> Unfortunately there a stickers on the network chips, so I can't say
>> if there are different revisions of the 82541 chips in those
>> machines.
> 
> It shouldn't be related to hyperthreading, at least I've never heard
> of such.  you can compare dmidecode output of the machines, maybe
> compare /dev/nvram output of them too.  lspci -vvv should show enough
> information to see if they are different parts (they will have
> different revisions)

md5sum of /dev/nvram is different. Is there a tool that translates the
binary output of /dev/nvram into something human readable?

Comparing the output of lspci -vvv shows mainly differences in mem
regions and irq addressing, but this could be related to an additional
pci card which was installed in the failing machine. I can rerun the
test without the card, though.

Not sure, if this is important. PERR is + on the failing machine:

        Subsystem: Intel Corporation PRO/1000 MT Network Connection
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
-       Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR+ INTx-
+       Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 252 (63750ns min), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at fd8c0000 (32-bit, non-prefetchable) [size=128K]

Only difference I could spot with dmidecode is a difference in cpu
voltages. But this might be due to rounding errors?

        Manufacturer: Intel
        ID: C2 06 01 00 FF FB E9 BF
        Version: Intel(R) Atom(TM)
-       Voltage: 0.9 V
+       Voltage: 1.0 V
        External Clock: 133 MHz
        Max Speed: 1600 MHz
        Current Speed: 1600 MHz

> can you also compare the ethtool -e outputs on the machines?  What
> about the slot they are plugged into?  Could the two machines with
> issues have heat problems for some reason (different case maybe?)

The network chips are on board. I don't think that the problem is
related to heat problems. The machines with problems are in a cooler
environment than the machines without the problems. Temp is around 35
degrees inside the failing machine, fans work normal.

Lars

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic