[prev in list] [next in list] [prev in thread] [next in thread]
List: lustre-discuss
Subject: [Lustre-discuss] kernel: excessive revalidate_it loops
From: David_Kewley () dell ! com (Kewley, David)
Date: 2007-06-27 17:25:29
Message-ID: 200706271325.36374.David_Kewley () dell ! com
[Download RAW message or body]
For the past several months, we've been running Lustre 1.4.7 on our
production cluster. Periodically we get the Subject: error message in the
syslog.
This most often happens while a user MPI job is running, often on the node
that is running rank 0, often very early in the run. Several applications
can trigger this message.
I see that the kernel code that prints this message is contained in the
Lustre 1.4.7 patches, specifically the addition of
namei.c:revalidate_special().
I have only a shallow and very incomplete understanding of what
circumstances can cause the error message to be logged. Looking at the
code, it appears to happen when ten successive (rapid) attempts to
revalidate a dentry suffer a certain class of failure. I do not know:
* what circumstances cause revalidate_special() to be called
* what types of failure cause the loop to be re-executed (up to ten times)
* what are typical circumstances in which the loop terminates with the
Subject: message
* what dentry validation is, really
So I'm asking you: What might be causing the failures, how we can check, and
how we can avoid them in the future?
-----
Let me elaborate a little on why I care. I have two concerns.
First, we've been seeing this messages sporadically over the last several
months. I've never been able to find any commonality until today (see next
paragraph), and Google has not been very friendly. I want to figure out
whether these message reflect that there's a problem I need to solve for my
users.
My second, more important concern is that recently a particular application
with particular input parameters has been causing nodes to die, and we
don't know why. This affect that user (jobs die), and other users (loss of
nodes). I just noticed today that the node deaths appear to be correlated
with appearance of the Subject: error message.
The node "death" is simply that many processes get general protection errors
logged in syslog. The great majority of these log entries are for
processes that are not related to the job processes, except for the fact
that they run on the same node. I wonder whether there is some non-obvious
resource starvation, or a kernel bug, or ...
Once the general protection errors start getting logged for a node, the node
is unusable without a reboot. If you already have an interactive shell
open, you can do certain things but not others.
The Subject: message can be logged even when the node does not die and the
job keeps running. When the node *does* die in this way, though, signs of
the node death always start occurring within a few seconds of the Subject:
message.
I'd appreciate any suggestions.
Thanks,
David
--
David Kewley
Dell Services - Americas Technology Consulting
Consultant
Cell Phone: 602-460-7617
David_Kewley@Dell.com
I speak only for myself; my views do not necessarily reflect Dell's views.
Dell Services: http://www.dell.com/services/
How am I doing? Email my manager Dustin_Johnson@Dell.com with any feedback.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic