On Saturday 16 February 2008, Kok, Auke wrote:
> Bernd Schubert wrote:
> > Hello,
> >
> > I can't login to one of our servers and just got this in an ipmi sol
> > session:
> >
> > [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [18169.209183]   Tx Queue             <0>
> > [18169.209184]   TDH                  <e3>
> > [18169.209185]   TDT                  <e3>
> > [18169.209186]   next_to_use          <e3>
> > [18169.209187]   next_to_clean        <bd>
> > [18169.209188] buffer_info[next_to_clean]
> > [18169.209189]   time_stamp           <10043e4d2>
> > [18169.209190]   next_to_watch        <be>
> > [18169.209191]   jiffies              <10043e6f6>
> > [18169.209192]   next_to_watch.status <1>
> > [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
> > [18169.256979]   Tx Queue             <0>
> > [18169.256980]   TDH                  <de>
> > [18169.256982]   TDT                  <de>
> > [18169.256983]   next_to_use          <de>
> > [18169.256984]   next_to_clean        <bc>
> > [18169.256985] buffer_info[next_to_clean]
> > [18169.256986]   time_stamp           <10043e511>
> > [18169.256987]   next_to_watch        <bd>
> > [18169.256988]   jiffies              <10043e701>
> > [18169.256989]   next_to_watch.status <1>
> >
> > This is with 2.6.22.18. Is there any chance to recover the system? For
> > some reasons I would prefer not to reboot now.
>
> if that's all you have then it was false alarm. there should be a 'netdev
> timeout - link reset' following those messages. can you send some more
> context on those messages?

All I presently know is that there are 20 servers and login doesn't work any 
more - sysrq+t does show me it hangs in fuse, which is accessing the 
underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t 
output suddenly these e1000 messages appeared.
Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which 
2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone  
mis-configured the switch/network environment today. 
Hmm, now that I think about the last part, there already had been other 
networking problems today, which were supposed to be fixed several hours ago. 
Seems they didn't fix it properly.

>
> in real tx hang cases, the hardware is reset within 2 seconds, and
> everything continues as normal.

Thanks, this gives me hope I don't need to reboot the serves (reboot would 
mean I would need to start 60 md-raid rebuilds...).

Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html