[prev in list] [next in list] [prev in thread] [next in thread] 

List:       grid-engine-dev
Subject:    Problem with CPR checkpointing
From:       "Shannon V. Davidson" <svdavidson () swbell ! net>
Date:       2003-06-10 20:20:11
Message-ID: 3EE63D7B.8020700 () swbell ! net
[Download RAW message or body]

Hello Ernst,

When using SGE 5.3p3 CPR checkpointing (maybe other types as well), when 
the restart_command completes successfully, the qmaster is rescheduling 
the job causing an infinite loop:

Tue Jun 10 14:26:52 2003|qmaster|pogo|W|job 448301.1 failed on host 
pogo.hpc-mo.com  migrating because: <unknown reason>
Tue Jun 10 14:26:52 2003|qmaster|pogo|W|rescheduling job 448301.1

My investigation of the problem led me to the following comment and 
code, which indicates that the shepherd is supposed to remove the 
"checkpointed" file when the job completes successfully.

    "./daemons/execd/reaper_execd.c" [readonly] line 609

       /* Be careful: the checkpointing checking is done at the end. It will
        * often override other failure states.
        * If the job finishes, the shepherd must remove the
    "checkpointed" file
        */

       sprintf(fname, "%s/checkpointed", jobdir);
       ckpt_arena = 1;   /* 1 job will be restarted in case of failure *
                          * 2 job will be restarted from ckpt arena    */
       if (!SGE_STAT(fname, &statbuf)) {
          int dummy;

          failed = SSTATE_MIGRATE;


There are two places in the shepherd code where the "checkpointed" file 
is unlinked, but I noticed that you have commented these out with an #ifdef.

    "./daemons/shepherd/shepherd.c" [readonly] line 1072

    #if 0 /* EB: review with AS */
          if (ckpt_type && !signalled_ckpt_job) {
             unlink("checkpointed");
             shepherd_trace("%s exited due to signal but not due to
    checkpoint", childname);
             if (ckpt_type & CKPT_KERNEL) {
                shepherd_trace("starting ckpt clean command");
                start_clean_command(clean_command);
             }
          }
    #endif

    "./daemons/shepherd/shepherd.c" [readonly] line 1101

    #if 0 /* EB: review with AS */
          if (!strcmp("job", childname)) {
             /* remove indication of checkpoints */
             if (WEXITSTATUS(status) < 128) {
                if (!signalled_ckpt_job && ckpt_type) {
                   shepherd_trace("checkpointing job exited normally");
                   unlink("checkpointed");
                   if (ckpt_type & CKPT_KERNEL) {
                      shepherd_trace("starting ckpt clean command");
                      start_clean_command(clean_command);
                   }
                }
             }
          }
    #endif


As a test, I added "rm -f $SGE_JOB_SPOOL_DIR/checkpointed" to my 
restart_command script, and this seemed to fix the problem (the job 
completed normally).

Was there a protocol change between the shepherd and the execd or should 
the unlink still be taking place?

Cheers,
Shannon

-- 
___________________________________________

Shannon V. Davidson <svdavidson@swbell.net>
Senior Software Engineer           Raytheon
636-479-7465 office        443-383-0331 fax
___________________________________________



[Attachment #3 (text/html)]

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title></title>
</head>
<body>
Hello Ernst,<br>
<br>
When using SGE 5.3p3 CPR checkpointing (maybe other types as well), when
the restart_command completes successfully, the qmaster is rescheduling the
job causing an infinite loop:<br>
<br>
Tue Jun 10 14:26:52 2003|qmaster|pogo|W|job 448301.1 failed on host \
pogo.hpc-mo.com&nbsp; migrating because: &lt;unknown reason&gt;<br>
Tue Jun 10 14:26:52 2003|qmaster|pogo|W|rescheduling job 448301.1<br>
<br>
My investigation of the problem led me to the following comment and code,
which indicates that the shepherd is supposed to remove the "checkpointed"
file when the job completes successfully.<br>
<br>
<blockquote>"./daemons/execd/reaper_execd.c" [readonly] line 609<br>
  <br>
&nbsp;&nbsp; /* Be careful: the checkpointing checking is done at the end. It \
will<br> &nbsp;&nbsp;&nbsp; * often override other failure states.<br>
&nbsp;&nbsp;&nbsp; * If the job finishes, the shepherd must remove the "checkpointed" \
file<br> &nbsp;&nbsp;&nbsp; */<br>
  <br>
&nbsp;&nbsp; sprintf(fname, "%s/checkpointed", jobdir);<br>
&nbsp;&nbsp; ckpt_arena = 1;&nbsp;&nbsp; /* 1 job will be restarted in case of \
failure *<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
* 2 job will be restarted from ckpt arena&nbsp;&nbsp;&nbsp; */<br> &nbsp;&nbsp; if \
(!SGE_STAT(fname, &amp;statbuf)) {<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; int dummy;<br>
  <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; failed = SSTATE_MIGRATE;<br>
  <br>
</blockquote>
<br>
There are two places in the shepherd code where the "checkpointed" file is
unlinked, but I noticed that you have commented these out with an #ifdef.<br>
<br>
<blockquote>"./daemons/shepherd/shepherd.c" [readonly] line 1072 <br>
  <br>
#if 0 /* EB: review with AS */<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (ckpt_type &amp;&amp; !signalled_ckpt_job) {<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; unlink("checkpointed");<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; shepherd_trace("%s exited due to \
signal but not due to checkpoint", childname);<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (ckpt_type &amp; CKPT_KERNEL) \
{<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
shepherd_trace("starting ckpt clean command");<br> \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
start_clean_command(clean_command);<br> \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
}<br> #endif<br>
  <br>
"./daemons/shepherd/shepherd.c" [readonly] line 1101<br>
  <br>
#if 0 /* EB: review with AS */<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (!strcmp("job", childname)) {<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /* remove indication of checkpoints \
*/<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (WEXITSTATUS(status) &lt; \
128) {<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if \
(!signalled_ckpt_job &amp;&amp; ckpt_type) {<br> \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
shepherd_trace("checkpointing job exited normally");<br> \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
unlink("checkpointed");<br> \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
if (ckpt_type &amp; CKPT_KERNEL) {<br> \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
shepherd_trace("starting ckpt clean command");<br> \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
start_clean_command(clean_command);<br> \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \
}<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>
#endif<br>
</blockquote>
<br>
As a test, I added "rm -f $SGE_JOB_SPOOL_DIR/checkpointed" to my restart_command
script, and this seemed to fix the problem (the job completed normally).<br>
<br>
Was there a protocol change between the shepherd and the execd or should
the unlink still be taking place?<br>
<br>
Cheers,<br>
Shannon<br>
<pre class="moz-signature" cols="$mailwrapcol">-- 
___________________________________________

Shannon V. Davidson <a class="moz-txt-link-rfc2396E" \
href="mailto:svdavidson@swbell.net">&lt;svdavidson@swbell.net&gt;</a> Senior Software \
Engineer           Raytheon 636-479-7465 office        443-383-0331 fax
___________________________________________

</pre>
</body>
</html>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic