[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    [lustre-discuss] Frequent, silent OSS hangs on multi-homed system
From:       "Kirk, Benjamin (JSC-EG311)" <benjamin.kirk () nasa ! gov>
Date:       2019-08-14 19:20:13
Message-ID: 56FD166F-872C-46AF-9B0D-02CCE8909515 () nasa ! gov
[Download RAW message or body]

Hi, I'm love some ideas to debug what has become a frequent annoyance for us.  At the \
high level, we're observing fairly frequent OSS hangs, with absolutely no console or \
logging activity.  Our BMC watchdogs then reboot the OSS and ~6 minutes later \
everything is back in line.  This has been an infrequent occurance on this system for \
a couple years, but has become much more frequent in recent months.

I'd love any suggestions for either lustre/lnet or overall kernel tricks to up the \
logging level if possible to see if we can get some more useful output. Right now \
we're blind.

More details below, and also what I'd characterize as uninformed speculation:

-) overall system is (2x)MDS, (12x)OSS, (2x) Monitoring nodes of identical servers, \
network cards, etc... 

-) only difference is JBOD types, the OSS'es are connected to Supermicro 90-bay \
SC946ED-R2KJBOD. All other server hardware is identical. 

-) only the OSSes hang in this manner. I'm looking back, some seem more prone than \
others, but it's not obviously only a few.

-) CentOS 7.6, lustre 2.10.8, ZFS 0.7.9

-) 2 active file systems, one is pure ZFS and the other ZFS/OSS with ldiskfs mdt

-) Mellanox ConnectX3 FDR IB & 40GbE

-) LSI 9300-8e HBA

-) Lustre servers are triple-homed, they live on (2x) IB and (1x) 40GbE networks

-) previously when we first moved to 2.10 we were bit hard and frequently by LU-10163 \
(which may or may not be relevant)

-) The hangs don't correlate to any discrete event best I can tell.  Importantly, we \
get no LBUGs or anything, which is different than the previous signature.

-) We have definitely stepped up the traffic on the ethernet network this year.  \
Whereas the primary I/O was previously just on the two IB networks, we are now taxing \
the ethernet as well with some regularity.

Any thoughts are most welcome, and thanks!

-Ben




_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic