[prev in list] [next in list] [prev in thread] [next in thread]
List: e1000-devel
Subject: Re: [E1000-devel] Network stalls with e1000 driver and 82541
From: Lars Ehrhardt <le-e1000 () ml ! h42 ! net>
Date: 2010-04-16 18:03:45
Message-ID: 4BC8A681.7030009 () ml ! h42 ! net
[Download RAW message or body]
Hi,
Brandeburg, Jesse wrote:
> Hm, thats very interesting. Depending on your kernel version they
> made the NETDEV_WATCHDOG message a WARN_ONCE only, but the driver's
> tx_timeout counter will still increment if the OS resets us due to
> not completing transmits.
Yeah, true... the value in tx_timeout is incrementing on the interface:
NIC statistics:
rx_packets: 4388583768
tx_packets: 144829687
rx_bytes: 11352706471
tx_bytes: 182892477547
rx_broadcast: 410657
tx_broadcast: 24509
rx_multicast: 842065
tx_multicast: 370
rx_errors: 4294967294
tx_errors: 0
tx_dropped: 0
multicast: 842065
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 8589934590
rx_frame_errors: 0
rx_no_buffer_count: 16341
rx_missed_errors: 28387
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 47
tx_restart_queue: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 1
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 11352706471
rx_csum_offload_good: 80766344
rx_csum_offload_errors: 0
alloc_rx_buff_failed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
>>> Can you make absolutely sure that ethtool -K ethX tso off is done
>>> on each 82541 interface?
>>>
>>> The other thing that might be relevant is if you have >= 4GB ram.
>>>
>> Nope, the machine has 2 GB RAM. Could it be that the problem is
>> related to hyperthreading? What I find odd is that the problem
>> occurs on 2 machines, while it does not occur on 3 other machines.
>> I cannot find a difference in the system settings though.
>> Unfortunately there a stickers on the network chips, so I can't say
>> if there are different revisions of the 82541 chips in those
>> machines.
>
> It shouldn't be related to hyperthreading, at least I've never heard
> of such. you can compare dmidecode output of the machines, maybe
> compare /dev/nvram output of them too. lspci -vvv should show enough
> information to see if they are different parts (they will have
> different revisions)
md5sum of /dev/nvram is different. Is there a tool that translates the
binary output of /dev/nvram into something human readable?
Comparing the output of lspci -vvv shows mainly differences in mem
regions and irq addressing, but this could be related to an additional
pci card which was installed in the failing machine. I can rerun the
test without the card, though.
Not sure, if this is important. PERR is + on the failing machine:
Subsystem: Intel Corporation PRO/1000 MT Network Connection
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR+ INTx-
+ Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 252 (63750ns min), Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 18
Region 0: Memory at fd8c0000 (32-bit, non-prefetchable) [size=128K]
Only difference I could spot with dmidecode is a difference in cpu
voltages. But this might be due to rounding errors?
Manufacturer: Intel
ID: C2 06 01 00 FF FB E9 BF
Version: Intel(R) Atom(TM)
- Voltage: 0.9 V
+ Voltage: 1.0 V
External Clock: 133 MHz
Max Speed: 1600 MHz
Current Speed: 1600 MHz
> can you also compare the ethtool -e outputs on the machines? What
> about the slot they are plugged into? Could the two machines with
> issues have heat problems for some reason (different case maybe?)
The network chips are on board. I don't think that the problem is
related to heat problems. The machines with problems are in a cooler
environment than the machines without the problems. Temp is around 35
degrees inside the failing machine, fans work normal.
Lars
------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic