'Re: [E1000-devel] =?utf-8?q?igb_+_bonding_+_netem_packet_corruption_?='

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       e1000-devel
Subject:    Re: [E1000-devel] =?utf-8?q?igb_+_bonding_+_netem_packet_corruption_?=
From:       Matthew Kent <mkent () magoazul ! com>
Date:       2013-12-20 22:43:11
Message-ID: 9185090E73544240BCB684F399BAA8EE () magoazul ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


On Friday, December 20, 2013 at 2:02 PM, Alexander Duyck wrote:
> On 12/20/2013 12:38 PM, Matthew Kent wrote:
> > Hello all,
> > 
> > We've been doing some testing here with netem and packet corruption \
> > (http://www.linuxfoundation.org/collaborate/workgroups/networking/netem#Packet_corruption) \
> > and I believe we may have stumbled on an igb or bonding bug that leads to a nasty \
> > system lockup on RHEL/CentOS 6. We have a reproducible test case, but I'm looking \
> > for pointers on what I can collect to help deconstruct and hopefully solve the \
> > issue. 
> > Here's the setup:
> > 
> > Server A
> > --------
> > * Dell r710 server
> > * bnx2 2.1.11 (5.0.12 bc 5.0.11)
> > * Ubuntu precise
> > * kernel 3.2.0-49-generic
> > * 2 bonded interfaces in 802.3ad
> > * one connection to each top of the rack Force10 switch
> > * untagged interface in vlan id 500
> > 
> > Server B
> > --------
> > * Dell c5220 server sled
> > * igb 5.0.5-k (firmware 3.29)
> > * CentOS 6.5
> > * kernel 2.6.32-431.1.2.0.1.el6.x86_64
> > * 2 bonded interfaces in 802.3ad
> > * one connection to each top of the rack Force10 switch
> > * tagged interface in vlan id 500
> > 
> > * Both servers plugged into the same Force10 switches
> > * Both servers are completely isolated on their own vlan unique to these switches
> > * tx flow control disabled on all switch ports
> > 
> > And the tests:
> > 
> > Test #1
> > -------
> > 
> > Server A runs:
> > 
> > tc qdisc add dev eth0 root netem corrupt 5%
> > 
> > introducing some corruption on one of the bonded interfaces. Server B does \
> > nothing. 
> > Result:
> > Server B after 2-5 minutes, even at idle, locks up *hard*. By that I mean it's \
> > completely unresponsive, but the kernel doesn't actually panic. Eventually after \
> > a couple minutes it starts dumping strack traces from the other stuck threads. \
> > Worse still, this somehow renders the onboard ipmi management controller \
> > unreachable as well, requiring a physical reset of the server sled. Server B will \
> > continue crashing until tc is disabled on Server A. 
> > Example call trace during the lockup \
> > https://gist.github.com/mdkent/a792cda348ca0048d3cc (though this was from an \
> > earlier test on centos 6.3, the end result is the same). 
> > 
> > Test #2
> > -------
> > 
> > This time we remove the bonded interface on Server B on both the host and switch, \
> > giving us a single untagged port in vlan 500. 
> > We introduce the same corruption by Server A. We also send some traffic from \
> > Server A -> Server B since things are quieter with lacp disabled. 
> > Result:
> > Everything is fine. Server B has no complaints.
> > 
> > 
> > The obvious solution is to stay away from netem :) but this is a scary bug. It's \
> > already caused a major outage for us that was very difficult to debug. 
> > What can I gather and provide to help fix this?  
> 
Thanks for looking!  
> 
> Hello Matthew,
> 
> Based on the traces provided the issue would appear to be a soft lockup
> related to a locking issue in the bonding driver. By any chance have you
> tried testing this issue with an interface other than igb in the same
> system? This would help to determine if igb actually plays a role in
> this or if it is just specific to the bonding driver receiving frames
> from netem.
> 
Unfortunately these Dell c5220 server sleds can't accommodate any add in cards. They \
ship with just the 2 onboard igb interfaces.

This might help though, here's a complete review of what we saw during the outage \
netem triggered:

* All servers with CentOS/RHEL 6 + igb locked up.
* All servers with CentOS/RHEL 6 + ixgbe locked up.
* Some servers with Ubuntu precise + igb started port flapping but no lockups. Reboot \
                cured them.
* All servers with Ubuntu lucid + igb were fine.
* All servers with CentOS/RHEL 6 + bnx2 were fine.
* All servers with precise + bnx2 were fine.
  
Our servers are mostly a mix of older Dell r720s and newer Dell c5220 and Dell \
c6220's, with all of them using bonding.
> 
> As far as the igb driver itself, could you send us the ethtool -i,
> ethtool -S, and lspci -vvv output for the interface? This would help us
> to determine the hardware configuration you have.
> Thanks,
> 
> 

Sure! Here you go https://gist.github.com/mdkent/821026b70882c706142e  

Let me know if you'd like me to open up a proper bug report.

Thanks again.

- Matt



------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic