[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    [Lustre-discuss] odd mount behavior
From:       nheald () indiana ! edu (Heald, Nathan T ! )
Date:       2010-03-29 22:27:26
Message-ID: C7D6A18E.CFB4%nheald () indiana ! edu
[Download RAW message or body]

While I've never used Lustre on IB, I have seen clients with very similar symptoms \
before when the network appears to be functioning properly. You should verify that \
jumbo frames are getting passed properly between your networks (or switch ports if \
both hosts are on the same network segment). If your MDS has MTU set for 9000 but the \
client has MTU at the default of 1500 then the MDS will send a jumbo frame at the \
very last moment in the mount process and it never shows up on the client. The \
symptom is repeated timeout errors in syslog during the mount on both the client and \
MDS and a hanging mount command.

Assuming you are using jumbo frames send a packet larger than MTU1500 from both \
hosts, if it fails it's an MTU issue: ping -s 8000 10.4.200.6

If there's a router between your hosts you might try sending a jumbo frame with \
Don't-Fragment set, sometimes routers can silently drop these: ping -s 8000 -M do \
10.4.200.6

Again, if there's a router between your hosts you can double check with the \
"tracepath" command which sets the DF bit. If it stops returning output at some point \
between your hosts then Path-MTU-Discovery will not work and hence the client on that \
network segment will get this issue while clients on other networks may work fine. \
The next hop beyond the last IP which returns PMTU packets in the tracepath output is \
the one silently dropping DF packets on you, use traceroute to compare and locate the \
offender.

If both hosts are on the same network segment (no router between them) then this \
should not be an issue anyway as it's assumed both hosts are configured for the \
proper MTU of their local network. I haven't tested this on IB, but on Myricom 10GB \
nics we found that if one host has the wrong MTU set you may see the error count for \
the interface increment during the failing mount attempt. Since the network stack on \
the client with the smaller MTU can't handle the jumbo frames, it tosses them and \
increments the error counter (visible with "netstat -I"). Your mileage may vary with \
IB.

Hope this helps.
-Nathan


On 3/28/10 3:18 AM, "John White" <jwhite at lbl.gov> wrote:

Unfortunately for this case, there is a clear network path, no firewalls.  A quick \
telnet test at least confirms something is listening on the port and closes the \
connection pretty quickly.  If networking were the case, wouldn't I still see \
                connection errors for the @tcp NID?
----------------
John White
High Performance Computing Services (HPCS)
(510) 486-7307
One Cyclotron Rd, MS: 50B-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720








On Mar 26, 2010, at 9:49 PM, Andreas Dilger wrote:

> On 2010-03-26, at 17:45, John White wrote:
> > We've got a new client we're trying to get to mount an existing file system.  The \
> > host cluster is set up with 2 NIDs for the MDT (o2ib, tcp), same with the client. \
> > When I try mounting via tcp (mount -t lustre -o flock n0006.lustre at tcp:/vulcan \
> > /clusterfs/vulcan/pscratch), it just hangs there indefinitely with repeated \
> > messages like: 
> > Lustre: Request x264 sent from vulcan-MDT0000-mdc-ffff8101e39bb400 to NID \
> > 10.4.200.6 at o2ib 5s ago has timed out (limit 5s). 
> > I'm confused why it's attempting the o2ib NID repeatedly and never tries the tcp \
> > NID... Ideas?
> 
> 
> A common cause for newly-installed systems is hosts.deny or firewall rules that are \
> preventing connections on port 988. 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100329/e6d995d9/attachment.htm>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic