[prev in list] [next in list] [prev in thread] [next in thread]
List: grid-engine-cvs
Subject: CVS update: MODIFIED: howto, scripting.html
From: chaubal () sunsource ! net
Date: 2003-08-14 20:31:45
Message-ID: 20030814203145.18796.qmail () s005 ! sfo ! collab ! net
[Download RAW message or body]
User: chaubal
Date: 03/08/14 13:31:45
Modified: www/howto commonproblems.html commontasks.html qrsh_ssh.html
scripting.html
Log:
CC-2003-07-14-1: extended common problems list
Revision Changes Path
1.4 +360 -15 gridengine/www/howto/commonproblems.html
http://gridengine.sunsource.net/source/browse/gridengine/www/howto/commonproblems.html.diff?r1=1.3&r2=1.4
(In the diff below, changes in quantity of whitespace are not shown.)
Index: commonproblems.html
===================================================================
RCS file: /cvs/gridengine/www/howto/commonproblems.html,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -b -r1.3 -r1.4
--- commonproblems.html 2002/06/28 12:08:29 1.3
+++ commonproblems.html 2003/08/14 20:31:44 1.4
@@ -6,7 +6,7 @@
<META NAME="GENERATOR" CONTENT="StarOffice 6.0 (Solaris Sparc)">
<META NAME="AUTHOR" CONTENT=" ">
<META NAME="CREATED" CONTENT="20020111;13083600">
- <META NAME="CHANGED" CONTENT="20020419;13045300">
+ <META NAME="CHANGED" CONTENT="20030814;12383000">
<STYLE>
<!--
@page { size: 21.59cm 27.94cm }
@@ -16,14 +16,13 @@
<BODY LANG="en-US">
<H1><FONT COLOR="#336699"><FONT SIZE=4 STYLE="font-size: 16pt"><B>Common
problems using Grid Engine</B></FONT></FONT></H1>
+<P STYLE="margin-bottom: 0cm">Last updated: <SDFIELD TYPE=DATETIME \
SDNUM="1023;1033;MMM D, YYYY">Aug 14, 2003</SDFIELD></P> <P STYLE="margin-bottom: \
0cm">The present HOWTO goes over some commonly seen problems experienced when using \
Grid Engine, and appropriate solutions. The information is presented in a tabular
chart, using the following scheme:</P>
<P STYLE="margin-bottom: 0cm"><BR>
</P>
-<P STYLE="margin-bottom: 0cm"><BR>
-</P>
<TABLE WIDTH=288 BORDER=1 BORDERCOLOR="#000000" CELLPADDING=4 CELLSPACING=0 \
STYLE="page-break-inside: avoid"> <COL WIDTH=136>
<COL WIDTH=134>
@@ -50,24 +49,24 @@
</TR>
</TBODY>
</TABLE>
-<P STYLE="margin-bottom: 0cm"><BR>
-</P>
<P STYLE="margin-bottom: 0cm">For problems which are not explicitly
mentioned here, search for a symptom in the appropriate category
which matches your problem as closely as possible, and see if the
resolution fixes your particular case.</P>
-<P STYLE="margin-bottom: 0cm"><BR>
-</P>
<H3>Categories:</H3>
<UL>
<LI><P STYLE="margin-bottom: 0cm"><A HREF="#batch">Batch Submit</A></P>
<LI><P STYLE="margin-bottom: 0cm"><A HREF="#monitoring">Monitoring</A></P>
<LI><P STYLE="margin-bottom: 0cm"><A HREF="#miscerrmsg">Miscellaneous
Error Messages</A></P>
+ <LI><P STYLE="margin-bottom: 0cm"><A HREF="#performance">Performance</A></P>
+ <LI><P STYLE="margin-bottom: 0cm"><A HREF="#configuration">Configuration</A></P>
<LI><P STYLE="margin-bottom: 0cm"><A HREF="#interactive">Qrsh/Interactive
Jobs</A></P>
<LI><P STYLE="margin-bottom: 0cm"><A HREF="#qmake">Qmake</A></P>
<LI><P STYLE="margin-bottom: 0cm"><A HREF="#qmon">Qmon</A></P>
+ <LI><P STYLE="margin-bottom: 0cm"><A \
HREF="#pe-ckpt">Parallel/Checkpointing</A></P> + <LI><P STYLE="margin-bottom: \
0cm"><A HREF="#shadow">Shadow Facility</A></P> </UL>
<P STYLE="margin-bottom: 0cm"><BR>
</P>
@@ -156,6 +155,74 @@
</TD>
</TR>
<TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>qsub
+ of a job results in the error "can't set additional group id
+ for job" (seen in administrator or user mail, or shepherd
+ trace file) and puts queue into error state</FONT></FONT></FONT></P>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Possible
+ reasons</FONT></FONT></FONT></P>
+ <OL>
+ <LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
+ error message below can occur if the user already have 16
+ existing group ids set. SGE tries to set one more group id and
+ fails b/c usually the limit is 16.</FONT></FONT></FONT></P>
+ <LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>If
+ you are not running Grid Engine as root, then the setgroups()
+ command will fail trying to set the unique group ID which is
+ used to track all the spawned processes of a job.</FONT></FONT></FONT></P>
+ </OL>
+ </TD>
+ <TD WIDTH=50%>
+ <OL>
+ <P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, \
serif"><FONT SIZE=3>Corresponding + solutions</FONT></FONT></FONT></P>
+ <LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>Please + check to see how many group ids \
are assigned to the user using + 'id -a'. If it's more than 16, then you need to \
reduce this + number or increase the limit in the kernel \
(NGROUPS_MAX).</FONT></FONT></FONT></P> + <LI><P><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>Be + sure to run the Grid Engine daemons as \
root.</FONT></FONT></FONT></P> + </OL>
+ </TD>
+ </TR>
+ <TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Jobs
+ work when run from command line but fail when run via \
qsub</FONT></FONT></FONT></P> + </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Data
+ and executables may not be accessible where needed</FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
+ jobs script itself must be accessible from the submit host. All
+ data and other executables needed by the script must be
+ accessible on the execute host. Usually shared via \
NFS.</FONT></FONT></FONT></P> + </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Unlimited
+ stack size set by default by SGE may cause some apps to crash on
+ some OS's.</FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P>In the job script, use “ulimit” to set stack size
+ limits before calling the executable that crashes.</P>
+ <P>Or modify the queue to set smaller stack size:</P>
+ <PRE>qconf -mattr queue h_stack 8389486 <queue_name> (hard limit in \
bytes) +qconf -mattr queue s_stack 8389486 <queue_name> (soft limit in \
bytes)</PRE> + </TD>
+ </TR>
+ <TR>
<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
<P><A NAME="monitoring"></A>Monitoring</P>
</TH>
@@ -163,7 +230,8 @@
<TR>
<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Exec
- hosts report a load of 99.99</FONT></FONT></FONT></P>
+ hosts report a load of 99.99; queue is in “alarm”
+ and/or “unknown” state</FONT></FONT></FONT></P>
</TD>
</TR>
<TR VALIGN=TOP>
@@ -193,13 +261,18 @@
up the execd as root on the host by running the
$SGE_ROOT/default/common/rcsge script </FONT></FONT></FONT>
</P>
- <LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>Run + <LI><P STYLE="margin-bottom: \
0cm"><FONT SIZE=3><FONT FACE="Thorndale, serif"><FONT COLOR="#000000">Run 'qconf \
-mconf' as the Sun Grid Engine administrator and change the default_domain to none. \
</FONT></FONT></FONT> </P>
- <LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT \
SIZE=3>Please
- see the AppNote/HOWTO <A \
HREF="http://supportforum.sun.com/gridengine/appnote_loadinfo.html" \
TARGET="_child">loadinfo</A> for more
- information. </FONT></FONT></FONT>
+ <LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>Set + <FONT SIZE=2>IGNORE_FQDN=TRUE \
</FONT>for qmaster_params in + cluster configuration.</FONT></FONT></FONT></P>
+ <LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>See + man page \
sge_h_aliases(5)</FONT></FONT></FONT></P> + <LI><P><FONT FACE="Thorndale, \
serif"><FONT COLOR="#000000">Please + see the AppNote/HOWTO <A \
HREF="http://supportforum.sun.com/gridengine/appnote_loadinfo.html" \
TARGET="_child">loadinfo</A> + for more information. </FONT></FONT>
</P>
</OL>
</TD>
@@ -303,6 +376,102 @@
</TD>
</TR>
<TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <PRE>“critical error: can't connect commd”
+“<FONT SIZE=2>critical error: setup failed starting \
cod_schedd”</FONT></PRE> + </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P>A bug on 32 bit systems: <FONT SIZE=2>rlim_fd_max > 1024 </FONT><FONT \
SIZE=3><FONT FACE="Thorndale, serif"><FONT COLOR="#000000">in \
+ /etc/system</FONT></FONT></FONT></P> + </TD>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Set
+ rlim_fd_max to < 1024. Or update to SGE 5.3p2 or \
higher</FONT></FONT></FONT></P> + </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P>The actual hostname <myhostname> of the machine is in
+ alias to /localhost in etc/hosts. Looks like this:</P>
+ <PRE>127.0.0.1 localhost myhostname</PRE>
+ </TD>
+ <TD WIDTH=50%>
+ <P>remove <myhostname> as an alias to localhost and put
+ <myhostname> after the real IP-address in /etc/hosts</P>
+ </TD>
+ </TR>
+ <TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P>Multiple queues cascade into error state, rendering the grid
+ unusable.
+ </P>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P>errors in a user's .cshrc/.profile result in setting all
+ queues in error state</P>
+ </TD>
+ <TD WIDTH=50%>
+ <OL>
+ <LI><P>Fix errors in users' .cshrc/.profile</P>
+ <LI><P>Use the -f option in the first line of the jobscript
+ (i.e. Use “!#/bin/sh -f”) to bypass users' .cshrc or
+ .profile</P>
+ </OL>
+ </TD>
+ </TR>
+ <TR>
+ <TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
+ <P><A NAME="performance"></A>Performance</P>
+ </TH>
+ </TR>
+ <TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Memory
+ leak and huge memory consumption for schedd on large \
systems</FONT></FONT></FONT></P> + </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Parameter
+ <CODE><FONT SIZE=2>sched_job_info=true</FONT></CODE></FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Set
+ <CODE><FONT SIZE=2>sched_job_info= false</FONT></CODE> or update
+ to release 5.3p3 or higher</FONT></FONT></FONT></P>
+ </TD>
+ </TR>
+ <TR>
+ <TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
+ <P><A NAME="configuration"></A>Configuration</P>
+ </TH>
+ </TR>
+ <TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>max_u_jobs
+ doesn't work as expected.</FONT></FONT></FONT></P>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>It
+ doesn't work exactly the same way in all versions of the product
+ – and affects scheduling differently depending on whether
+ the product is used in SGE or SGEEE mode. </FONT></FONT></FONT>
+ </P>
+ </TD>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Update
+ to SGE 5.3p2 (or higher) which contains the latest
+ implementation. </FONT></FONT></FONT>
+ </P>
+ </TD>
+ </TR>
+ <TR>
<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
<P><A NAME="interactive"></A>Qrsh/Interactive Jobs</P>
</TH>
@@ -347,7 +516,7 @@
<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>omit this check generally by overriding qrsh's \
default setting "-w e" explicitly by submitting it with "-w n" \
(can
- also be put into \
$SGE_ROOT/<cell>/common/cod_request)</FONT></FONT></FONT></P> + also be \
put into \
$SGE_ROOT/<cell>/common/sge_request)</FONT></FONT></FONT></P>
<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>if you intend managing 'mem_free' as a \
consumbale resource specify the 'mem_free' capacity for your hosts in \
'complex_values' of @@ -388,6 +557,30 @@
</TR>
<TR>
<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, \
serif"><FONT SIZE=3>when + I do a qrsh, I get this \
error..</FONT></FONT></FONT></P> + <P STYLE="margin-bottom: 0cm"><BR>
+ </P>
+ <PRE>% qrsh
+error: 1: can't set additional group id for job</PRE>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
+ error message below can occur if the user already have 16
+ existing group ids set. SGE tries to set one more group id and
+ fails b/c usually the limit is 16.</FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Please
+ check to see how many group ids are assigned to the user using
+ 'id -a'. If it's more than 16, then you need to reduce this
+ number or increase the limit in the kernel.</FONT></FONT></FONT></P>
+ </TD>
+ </TR>
+ <TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, \
serif"><FONT SIZE=3>qrsh
-inherit -V does not work when used inside a parallel \
job:</FONT></FONT></FONT></P> <P STYLE="margin-bottom: 0cm"><BR>
@@ -464,6 +657,22 @@
</TD>
</TR>
<TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT \
SIZE=3>Interactive + jobs fail when run via qsh, without error \
message.</FONT></FONT></FONT></P> + </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>DISPLAY
+ variable may be set incorrectly</FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P>Set DISPLAY correctly. Or to get error messages for this
+ situation - upgrade to release 5.3p2 or higher</P>
+ </TD>
+ </TR>
+ <TR>
<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
<P><A NAME="qmake"></A>Qmake</P>
</TH>
@@ -547,10 +756,146 @@
installation will fail</FONT></FONT></P>
</TD>
</TR>
+ <TR>
+ <TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
+ <P><A NAME="pe-ckpt"></A>Parallel/Checkpointing</P>
+ </TH>
+ </TR>
+ <TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P STYLE="font-weight: medium">Parts of Sun HPC ClusterTools
+ parallel jobs (job script itself, child processes, etc) fail to
+ stop when terminated by user or by qmaster.</P>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
+ user may not have supplied the necessary means (scripts) for SGE
+ to control the distributed jobs.</FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P>Follow the complete HOW-TO instructions:
+ <A HREF="http://supportforum.sun.com/gridengine/appnote_hpc.html">http://supportforum.sun.com/gridengine/appnote_hpc.html</A></P>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Bugs
+ in early versions of loose integration package</FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P>Update to SGE 5.3p2 (or higher) which includes latest MPI
+ loose integration package</P>
+ </TD>
+ </TR>
+ <TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P>Parallel jobs that run with the tight integration of SGE5.3.x
+ and HPC CT 5 are not terminated if one of the queues has wall
+ clock limit set.</P>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>A
+ bug in SGE prevented correct signal delivery to all parallel
+ processes</FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P>SGE 5.3p4 contains the fix; for earlier 5.3.x versions, get
+ corresponding patches from <A \
HREF="http://sunsolve.Sun.COM/pub-cgi/show.pl?target=patches/patch-access">Sunsolve</A>:</P>
+ <P>SGE: 113136-04 (pkgadd Solaris 32-bit); 113137-04 (pkgadd
+ Solaris 64-bit); 113138-04 (pkgadd Solaris X86); 113663-02
+ (pkgadd common pkg); 113849-03 (tar.gz Solaris 32-bit); 113850-03
+ (tar.gz Solaris 64-bit); 113851-03 (tar.gz Solaris X86);
+ 113852-04 (tar.gz Linux); 113853-02 (tar.gz common package)</P>
+ <P>SGEEE: 113139-04 (pkgadd Solaris 32-bit); 113140-04 (pkgadd
+ Solaris 64-bit); 113636-03 (pkgadd common pkg); 113855-03 (tar.gz
+ Solaris 32-bit); 113856-03 (tar.gz Solaris 64-bit); 113900-02
+ (tar.gz Linux); 113857-02 (tar.gz common package)</P>
+ </TD>
+ </TR>
+ <TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P>Parallel jobs that run with the tight integration of SGE5.3.x
+ and HPC CT 5 would not suspend and resume correctly.</P>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Another
+ bug in SGE prevented STOP and CONT signals to be correctly
+ delivered to all processes. </FONT></FONT></FONT>
+ </P>
+ </TD>
+ <TD WIDTH=50%>
+ <P>Need to set the suspend/resume methods in the queues used for
+ the parallel jobs with the appropriate scripts. These scripts can
+ either be downloaded from the Grid Engine Project site at the
+ <A HREF="http://gridengine.sunsource.net/servlets/ProjectDownloadList">File
+ Exchange</A> or obtained from Sun support.</P>
+ <P>Releases beyond 5.3p4 will ship with these two scripts, a
+ README file and a parallel environment template.</P>
+ </TD>
+ </TR>
+ <TR>
+ <TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
+ <P><A NAME="shadow"></A>Shadow Facility</P>
+ </TH>
+ </TR>
+ <TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P STYLE="font-weight: medium">After failover to shadow master,
+ the schedd daemon remains running on the original qmaster</P>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>This
+ is a bug in earlier versions of SGE.</FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P>Update to 5.3p2 or higher</P>
+ </TD>
+ </TR>
+ <TR>
+ <TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
+ <P STYLE="font-weight: medium">Shadow host fails to own
+ mastership of SGE cluster</P>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Lock
+ file exists.</FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P>Remove $SGE_ROOT/<cell>/spool/qmaster/lock file if
+ master host has crashed or can no longer function as
+ qmaster.<BR><B>NOTE:</B> to force the shadow host to take over
+ from another master, use the “migrate” option, ie,
+ “rcsge -migrate”.</P>
+ </TD>
+ </TR>
+ <TR VALIGN=TOP>
+ <TD WIDTH=50%>
+ <P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Root
+ R/W access to $SGE_ROOT directory and its sub-directories should
+ be from both master and shadow.</FONT></FONT></FONT></P>
+ </TD>
+ <TD WIDTH=50%>
+ <P STYLE="margin-bottom: 0cm">Adjust permissions for root r/w
+ access to the $SGE_ROOT directory and its sub-directories from
+ shadow host.</P>
+ <P><B>NOTE: </B><SPAN STYLE="font-weight: medium">please s</SPAN>ee
+ the <A HREF="http://gridengine.sunsource.net/project/gridengine/howto/shadow.html">Shadow
+ Master HOWTO</A></P>
+ </TD>
+ </TR>
</TBODY>
</TABLE>
<P STYLE="margin-bottom: 0cm"><BR>
</P>
-<P STYLE="margin-bottom: 0cm">Last updated: <SDFIELD TYPE=DATETIME \
SDNUM="1033;1033;MMM D, YYYY">Apr 19, 2002</SDFIELD></P> </BODY>
</HTML>
1.6 +89 -67 gridengine/www/howto/commontasks.html
http://gridengine.sunsource.net/source/browse/gridengine/www/howto/commontasks.html.diff?r1=1.5&r2=1.6
(In the diff below, changes in quantity of whitespace are not shown.)
Index: commontasks.html
===================================================================
RCS file: /cvs/gridengine/www/howto/commontasks.html,v
retrieving revision 1.5
retrieving revision 1.6
diff -u -b -r1.5 -r1.6
--- commontasks.html 2001/08/03 06:52:06 1.5
+++ commontasks.html 2003/08/14 20:31:44 1.6
@@ -1,68 +1,90 @@
-<table border="0" cellpadding="2" cellspacing="0" width="100%">
-<tr>
-<td><H2><font color="#336699" class="PageHeader">Common Administrative Tasks for \
Grid Engine</font></H2></td>
-</tr>
-</table>
-<table border="0" cellpadding="2" cellspacing="0" width="100%">
-<tr>
-<td>
-
-<br><br>
-Qconf is the command used for most administrative tasks. This
-HOWTO contains a selection of the most frequently used options. See
-qconf(1) for more details.
-</P>
-<P><B>Adding and removing administrative privileges from a host</B></P>
-<UL>
- <LI><P STYLE="margin-bottom: 0in">qconf -ah # gives host
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
+<HTML>
+<HEAD>
+ <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
+ <TITLE></TITLE>
+ <META NAME="GENERATOR" CONTENT="StarOffice 6.0 (Solaris Sparc)">
+ <META NAME="CREATED" CONTENT="20021028;15043100">
+ <META NAME="CHANGEDBY" CONTENT="Charu Chaubal">
+ <META NAME="CHANGED" CONTENT="20021028;15071800">
+</HEAD>
+<BODY LANG="en-US">
+<TABLE WIDTH=100% BORDER=0 CELLPADDING=2 CELLSPACING=0 STYLE="page-break-before: \
always"> + <TR>
+ <TD>
+ <H2><FONT COLOR="#336699">Common Administrative Tasks for Grid
+ Engine</FONT></H2>
+ </TD>
+ </TR>
+</TABLE>
+<TABLE WIDTH=100% BORDER=0 CELLPADDING=2 CELLSPACING=0>
+ <TR>
+ <TD>
+ <P><BR><BR>Qconf is the command used for most administrative
+ tasks. This HOWTO contains a selection of the most frequently used
+ options. See qconf(1) for more details.
+ </P>
+ <P><B>Adding and removing administrative privileges from a host</B></P>
+ <UL>
+ <LI><P STYLE="margin-bottom: 0cm">qconf -ah # gives host
administrative privileges
</P>
<LI><P>qconf -dh # removes administrative privileges from host</P>
-</UL>
-<P><B>Adding an execution host</B></P>
-<UL>
+ </UL>
+ <P><B>Adding an execution host</B></P>
+ <UL>
<LI><P>Make the new host an administrative host<BR><BR>qconf -ah
- </P>
+ <hostname></P>
<LI><P>As root on this new host, run the following script from
$SGE_ROOT<BR><BR>install_execd</P>
-</UL>
-<P><B>Removing an execution host</B></P>
-<UL>
+ </UL>
+ <P><B>Removing an execution host</B></P>
+ <UL>
<LI><P>First, delete the queues associated with this host<BR><BR>qconf
- -dq
- </P>
- <LI><P>Delete the host<BR><BR>qconf -de
- </P>
-</UL>
-<P><B>Adding and removing submit hosts</B></P>
-<UL>
- <LI><P STYLE="margin-bottom: 0in">qconf -as # host is now a submit
- host
- </P>
- <LI><P>qconf -ds # jobs may not be submitted from host</P>
-</UL>
-<P><B>Displaying current administrative/submit/execution hosts</B></P>
-<UL>
- <LI><P STYLE="margin-bottom: 0in">qconf -sh # show current
+ -dq <queuenames...></P>
+ <LI><P>Delete the host<BR><BR>qconf -de <hostname></P>
+ <LI><P>Finally, delete the configuration for the host<BR><BR>qconf
+ -dconf <hostname></P>
+ </UL>
+ <P><BR><BR>
+ </P>
+ <P><B>Adding and removing submit hosts</B></P>
+ <UL>
+ <LI><P STYLE="margin-bottom: 0cm">qconf -as <hostname> #
+ host is now a submit host
+ </P>
+ <LI><P>qconf -ds <hostname> # jobs may not be submitted
+ from host</P>
+ </UL>
+ <P><B>Displaying current administrative/submit/execution hosts</B></P>
+ <UL>
+ <LI><P STYLE="margin-bottom: 0cm">qconf -sh # show current
administrative hosts
</P>
- <LI><P STYLE="margin-bottom: 0in">qconf -ss # show current submit
+ <LI><P STYLE="margin-bottom: 0cm">qconf -ss # show current submit
hosts
</P>
<LI><P>qconf -sel # show current execution host list
</P>
-</UL>
-<P><B>Administering queues</B></P>
-<UL>
- <LI><P STYLE="margin-bottom: 0in">qconf -aq # adding a queue</P>
- <LI><P STYLE="margin-bottom: 0in">qconf -dq # delete a queue
- </P>
- <LI><P>qconf -mq # modify a queue
- </P>
- <LI><P STYLE="margin-bottom: 0in">qconf -Aq # adding a queue from
- file</P>
- <LI><P STYLE="margin-bottom: 0in">qconf -mqattr # change single
- attributes of more than one queue</P>
-</UL>
-
-</table>
\ No newline at end of file
+ </UL>
+ <P><B>Administering queues</B></P>
+ <UL>
+ <LI><P STYLE="margin-bottom: 0cm">qconf -aq <queuename> #
+ adding a queue</P>
+ <LI><P STYLE="margin-bottom: 0cm">qconf -dq <queuename> #
+ delete a queue
+ </P>
+ <LI><P>qconf -mq <queuename> # modify a queue
+ </P>
+ <LI><P STYLE="margin-bottom: 0cm">qconf -Aq <filename> #
+ adding a queue from file</P>
+ <LI><P>qconf -mattr queue ... # change single attributes of more
+ than one queue</P>
+ </UL>
+ </TD>
+ </TR>
+</TABLE>
+<P><BR><BR>
+</P>
+</BODY>
+</HTML>
\ No newline at end of file
1.4 +97 -0 gridengine/www/howto/qrsh_ssh.html
http://gridengine.sunsource.net/source/browse/gridengine/www/howto/qrsh_ssh.html.diff?r1=1.3&r2=1.4
(In the diff below, changes in quantity of whitespace are not shown.)
Index: qrsh_ssh.html
===================================================================
RCS file: /cvs/gridengine/www/howto/qrsh_ssh.html,v
retrieving revision 1.3
retrieving revision 1.4
diff -u -b -r1.3 -r1.4
--- qrsh_ssh.html 2003/01/07 15:36:23 1.3
+++ qrsh_ssh.html 2003/08/14 20:31:44 1.4
@@ -1,3 +1,99 @@
+<<<<<<< qrsh_ssh.html
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
+<HTML>
+<HEAD>
+ <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
+ <TITLE></TITLE>
+ <META NAME="GENERATOR" CONTENT="StarOffice 6.0 (Solaris Sparc)">
+ <META NAME="CREATED" CONTENT="20020529;10474700">
+ <META NAME="CHANGEDBY" CONTENT="Charu Chaubal">
+ <META NAME="CHANGED" CONTENT="20020529;12534600">
+</HEAD>
+<BODY LANG="en-US">
+<P STYLE="margin-bottom: 0cm">
+</P>
+<TABLE WIDTH=100% BORDER=0 CELLPADDING=2 CELLSPACING=0>
+ <TR>
+ <TD>
+ <H2><FONT COLOR="#336699">Using ssh with qrsh</FONT></H2>
+ </TD>
+ </TR>
+</TABLE>
+<TABLE WIDTH=100% BORDER=0 CELLPADDING=2 CELLSPACING=0>
+ <COL WIDTH=256*>
+ <TR>
+ <TD WIDTH=100%>
+ <P>By default, the Grid Engine command <B>qrsh</B> will use
+ standard remote mechanisms (rsh/rlogin) to establish interactive
+ sessions.
+ </P>
+ <UL>
+ <LI><P><B>qrsh</B> by itself will use rlogin</P>
+ <LI><P><B>qrsh</B> with a command will establish a rsh
+ connection.
+ </P>
+ </UL>
+ <P>To enable the rsh/rlogin mechanism, special rsh and rlogin
+ binaries are provided with Grid Engine (found in
+ $SGE_ROOT/utilbin/$ARCH). In addition, to have full accounting and
+ process control for interactive jobs, an extended <B>rshd</B>
+ comes with Grid Engine.
+ </P>
+ <P>As an alternative, Grid Engine can be configured to use <B>ssh</B>
+ instead to start interactive jobs. <BR>
+ </P>
+ <H3>Advantages of using ssh:</H3>
+ <UL>
+ <LI><P STYLE="margin-bottom: 0cm">secure connection
+ </P>
+ <LI><P STYLE="margin-bottom: 0cm">no need to have suid root
+ programs installed (rsh and rlogin have to be suid root)</P>
+ <LI><P STYLE="margin-bottom: 0cm">much larger number of running
+ sessions per host (not limited by port number < 1024)</P>
+ <LI><P STYLE="margin-bottom: 0cm">compression (if lots of data
+ pushed through STDIN/STDOUT)</P>
+ <LI><P>possibility to attach a tty to remotely executed commands
+ (ssh option -t)</P>
+ </UL>
+ <H3 STYLE="margin-top: 0cm; margin-bottom: 0cm">Disadvantages:</H3>
+ <UL>
+ <LI><P STYLE="margin-bottom: 0cm">Lack of complete accounting
+ </P>
+ <LI><P>lack of process control (reprioritization)
+ </P>
+ </UL>
+ </TD>
+ </TR>
+</TABLE>
+<H3>How to setup ssh for qrsh:</H3>
+<P STYLE="margin-bottom: 0cm">Have ssh working, all keys created ...
+</P>
+<P STYLE="margin-bottom: 0cm">Set the parameters rsh_daemon and
+rlogin_daemon in your cluster configuration to ssh:
+</P>
+<UL>
+ <LI><P>rsh_daemon: /usr/sbin/sshd -i</P>
+ <LI><P>rlogin_daemon: /usr/sbin/sshd -i
+ </P>
+</UL>
+<P STYLE="margin-bottom: 0cm">If you have execution hosts with
+different architectures that have different paths to ssh, you will
+have to make these settings for each execution host individualy
+(qconf -mconf host), else you can change the global cluster
+configuration (qconf -mconf).</P>
+<P>Set the parameters rsh_command and rlogin_command in your cluster
+configuration to ssh:</P>
+<UL>
+ <LI><P>rsh_command /usr/bin/ssh</P>
+ <LI><P>rlogin_command /usr/bin/ssh
+ </P>
+</UL>
+<P>If you have submit hosts with different architectures that have
+different paths to ssh, you will have to make these settings for each
+submit host individualy (qconf -mconf host), else you can change the
+global cluster configuration (qconf -mconf). <BR> <BR> <BR> </P>
+</BODY>
+</HTML>=======
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
@@ -109,3 +205,4 @@
<br><tt><font color="#000000">exec /usr/sbin/sshd -i</font></tt>
</body>
</html>
+>>>>>>> 1.2
1.5 +239 -238 gridengine/www/howto/scripting.html
http://gridengine.sunsource.net/source/browse/gridengine/www/howto/scripting.html.diff?r1=1.4&r2=1.5
(In the diff below, changes in quantity of whitespace are not shown.)
Index: scripting.html
===================================================================
RCS file: /cvs/gridengine/www/howto/scripting.html,v
retrieving revision 1.4
retrieving revision 1.5
diff -u -b -r1.4 -r1.5
--- scripting.html 2002/03/15 18:11:47 1.4
+++ scripting.html 2003/08/14 20:31:44 1.5
@@ -1,12 +1,12 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
- <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=windows-1252">
+ <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
<TITLE></TITLE>
- <META NAME="GENERATOR" CONTENT="StarOffice 6.0 (Win32)">
+ <META NAME="GENERATOR" CONTENT="StarOffice 6.0 (Solaris Sparc)">
<META NAME="AUTHOR" CONTENT=" ">
<META NAME="CREATED" CONTENT="20020111;13083600">
- <META NAME="CHANGED" CONTENT="20020315;9310466">
+ <META NAME="CHANGED" CONTENT="20020319;12001100">
<STYLE>
<!--
H2 { font-family: "Sunsans Demi" }
@@ -15,7 +15,7 @@
-->
</STYLE>
</HEAD>
-<BODY>
+<BODY BGCOLOR="#ffffff">
<H1><FONT COLOR="#336699"><FONT SIZE=4 STYLE="font-size: 16pt"><B>Command
Line and Scripting of Administrative Tasks in Grid Engine</B></FONT></FONT></H1>
<P STYLE="margin-bottom: 0cm">The <B>qmon(1) </B>graphical user
@@ -61,20 +61,21 @@
with the "show" option of <B>qconf</B> (<B>qconf -s<obj></B>)
to take an existing object, modify it, and then update the existing
object or create a new one.</P>
-<H4>Example: Write a shell script to modify the <I>migration command
-</I><SPAN STYLE="font-style: normal">of an existing checkpoint
-environment</SPAN></H4>
+<H4>Example: Write a shell script to specify queues of a <SPAN STYLE="font-style: \
normal">checkpoint +environment</SPAN> from a list in a file</H4>
<PRE>#!/bin/sh
-# ckptmod.sh: modify the migration command
-# of a checkpointing environment
-# Usage: ckptmod.sh <checkpoint-env-name> <full-path-to-command>
-TMPFILE=/tmp/ckptmod.$$
+# ckptq.sh: specify queues of a checkpoint from a list in a file
+# Usage: ckptq.sh <checkpoint-env-name> <filename>
+# <filename> contains a list of queues,
+# separated by commas and/or newlines
+TMPFILE=/tmp/ckptq.$$
CKPT=$1
-MIGMETHOD=$2
+QUEUELIST=$2
-qconf -sckpt $CKPT | grep -v '^migr_command' > $TMPFILE
-echo "migr_command $MIGMETHOD" >> $TMPFILE
+qconf -sckpt $CKPT | grep -v 'queue_list' > $TMPFILE
+echo queue_list `cat $QUEUELIST | \
+ tr "\012" " " | tr "," " "` >> \
$TMPFILE qconf -Mckpt $TMPFILE
rm $TMPFILE</PRE>
<HR>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic