[prev in list] [next in list] [prev in thread] [next in thread]
List: lustre-discuss
Subject: [lustre-discuss] Frequent, silent OSS hangs on multi-homed system
From: "Kirk, Benjamin (JSC-EG311)" <benjamin.kirk () nasa ! gov>
Date: 2019-08-14 19:20:13
Message-ID: 56FD166F-872C-46AF-9B0D-02CCE8909515 () nasa ! gov
[Download RAW message or body]
Hi, I'm love some ideas to debug what has become a frequent annoyance for us. At the \
high level, we're observing fairly frequent OSS hangs, with absolutely no console or \
logging activity. Our BMC watchdogs then reboot the OSS and ~6 minutes later \
everything is back in line. This has been an infrequent occurance on this system for \
a couple years, but has become much more frequent in recent months.
I'd love any suggestions for either lustre/lnet or overall kernel tricks to up the \
logging level if possible to see if we can get some more useful output. Right now \
we're blind.
More details below, and also what I'd characterize as uninformed speculation:
-) overall system is (2x)MDS, (12x)OSS, (2x) Monitoring nodes of identical servers, \
network cards, etc...
-) only difference is JBOD types, the OSS'es are connected to Supermicro 90-bay \
SC946ED-R2KJBOD. All other server hardware is identical.
-) only the OSSes hang in this manner. I'm looking back, some seem more prone than \
others, but it's not obviously only a few.
-) CentOS 7.6, lustre 2.10.8, ZFS 0.7.9
-) 2 active file systems, one is pure ZFS and the other ZFS/OSS with ldiskfs mdt
-) Mellanox ConnectX3 FDR IB & 40GbE
-) LSI 9300-8e HBA
-) Lustre servers are triple-homed, they live on (2x) IB and (1x) 40GbE networks
-) previously when we first moved to 2.10 we were bit hard and frequently by LU-10163 \
(which may or may not be relevant)
-) The hangs don't correlate to any discrete event best I can tell. Importantly, we \
get no LBUGs or anything, which is different than the previous signature.
-) We have definitely stepped up the traffic on the ethernet network this year. \
Whereas the primary I/O was previously just on the two IB networks, we are now taxing \
the ethernet as well with some regularity.
Any thoughts are most welcome, and thanks!
-Ben
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic