'[Lustre-discuss] kernel: excessive revalidate_it loops'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    [Lustre-discuss] kernel: excessive revalidate_it loops
From:       David_Kewley () dell ! com (Kewley, David)
Date:       2007-06-27 17:25:29
Message-ID: 200706271325.36374.David_Kewley () dell ! com
[Download RAW message or body]

For the past several months, we've been running Lustre 1.4.7 on our 
production cluster.  Periodically we get the Subject: error message in the 
syslog.

This most often happens while a user MPI job is running, often on the node 
that is running rank 0, often very early in the run.  Several applications 
can trigger this message.

I see that the kernel code that prints this message is contained in the 
Lustre 1.4.7 patches, specifically the addition of 
namei.c:revalidate_special().

I have only a shallow and very incomplete understanding of what 
circumstances can cause the error message to be logged.  Looking at the 
code, it appears to happen when ten successive (rapid) attempts to 
revalidate a dentry suffer a certain class of failure.  I do not know:

* what circumstances cause revalidate_special() to be called
* what types of failure cause the loop to be re-executed (up to ten times)
* what are typical circumstances in which the loop terminates with the
  Subject: message
* what dentry validation is, really

So I'm asking you: What might be causing the failures, how we can check, and 
how we can avoid them in the future?

-----

Let me elaborate a little on why I care.  I have two concerns.

First, we've been seeing this messages sporadically over the last several 
months.  I've never been able to find any commonality until today (see next 
paragraph), and Google has not been very friendly.  I want to figure out 
whether these message reflect that there's a problem I need to solve for my 
users.

My second, more important concern is that recently a particular application 
with particular input parameters has been causing nodes to die, and we 
don't know why.  This affect that user (jobs die), and other users (loss of 
nodes).  I just noticed today that the node deaths appear to be correlated 
with appearance of the Subject: error message.

The node "death" is simply that many processes get general protection errors 
logged in syslog.  The great majority of these log entries are for 
processes that are not related to the job processes, except for the fact 
that they run on the same node.  I wonder whether there is some non-obvious 
resource starvation, or a kernel bug, or ...

Once the general protection errors start getting logged for a node, the node 
is unusable without a reboot.  If you already have an interactive shell 
open, you can do certain things but not others.

The Subject: message can be logged even when the node does not die and the 
job keeps running.  When the node *does* die in this way, though, signs of 
the node death always start occurring within a few seconds of the Subject: 
message.

I'd appreciate any suggestions.

Thanks,
David

-- 
David Kewley
Dell Services - Americas Technology Consulting
Consultant
Cell Phone: 602-460-7617
David_Kewley@Dell.com

I speak only for myself; my views do not necessarily reflect Dell's views.

Dell Services: http://www.dell.com/services/
How am I doing? Email my manager Dustin_Johnson@Dell.com with any feedback.

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic