[prev in list] [next in list] [prev in thread] [next in thread] 

List:       grid-engine-cvs
Subject:    CVS update: MODIFIED: howto, scripting.html
From:       chaubal () sunsource ! net
Date:       2003-08-14 20:31:45
Message-ID: 20030814203145.18796.qmail () s005 ! sfo ! collab ! net
[Download RAW message or body]

  User: chaubal 
  Date: 03/08/14 13:31:45

  Modified:    www/howto commonproblems.html commontasks.html qrsh_ssh.html
                        scripting.html
  Log:
  CC-2003-07-14-1: extended common problems list
  
  Revision  Changes    Path
  1.4       +360 -15   gridengine/www/howto/commonproblems.html
  
  http://gridengine.sunsource.net/source/browse/gridengine/www/howto/commonproblems.html.diff?r1=1.3&r2=1.4
  
  (In the diff below, changes in quantity of whitespace are not shown.)
  
  Index: commonproblems.html
  ===================================================================
  RCS file: /cvs/gridengine/www/howto/commonproblems.html,v
  retrieving revision 1.3
  retrieving revision 1.4
  diff -u -b -r1.3 -r1.4
  --- commonproblems.html	2002/06/28 12:08:29	1.3
  +++ commonproblems.html	2003/08/14 20:31:44	1.4
  @@ -6,7 +6,7 @@
   	<META NAME="GENERATOR" CONTENT="StarOffice 6.0  (Solaris Sparc)">
   	<META NAME="AUTHOR" CONTENT=" ">
   	<META NAME="CREATED" CONTENT="20020111;13083600">
  -	<META NAME="CHANGED" CONTENT="20020419;13045300">
  +	<META NAME="CHANGED" CONTENT="20030814;12383000">
   	<STYLE>
   	<!--
   		@page { size: 21.59cm 27.94cm }
  @@ -16,14 +16,13 @@
   <BODY LANG="en-US">
   <H1><FONT COLOR="#336699"><FONT SIZE=4 STYLE="font-size: 16pt"><B>Common
   problems using Grid Engine</B></FONT></FONT></H1>
  +<P STYLE="margin-bottom: 0cm">Last updated: <SDFIELD TYPE=DATETIME \
SDNUM="1023;1033;MMM D, YYYY">Aug 14, 2003</SDFIELD></P>  <P STYLE="margin-bottom: \
0cm">The present HOWTO goes over some  commonly seen problems experienced when using \
Grid Engine, and  appropriate solutions. The information is presented in a tabular
   chart, using the following scheme:</P>
   <P STYLE="margin-bottom: 0cm"><BR>
   </P>
  -<P STYLE="margin-bottom: 0cm"><BR>
  -</P>
   <TABLE WIDTH=288 BORDER=1 BORDERCOLOR="#000000" CELLPADDING=4 CELLSPACING=0 \
STYLE="page-break-inside: avoid">  <COL WIDTH=136>
   	<COL WIDTH=134>
  @@ -50,24 +49,24 @@
   		</TR>
   	</TBODY>
   </TABLE>
  -<P STYLE="margin-bottom: 0cm"><BR>
  -</P>
   <P STYLE="margin-bottom: 0cm">For problems which are not explicitly
   mentioned here, search for a symptom in the appropriate category
   which matches your problem as closely as possible, and see if the
   resolution fixes your particular case.</P>
  -<P STYLE="margin-bottom: 0cm"><BR>
  -</P>
   <H3>Categories:</H3>
   <UL>
   	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#batch">Batch Submit</A></P>
   	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#monitoring">Monitoring</A></P>
   	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#miscerrmsg">Miscellaneous
   	Error Messages</A></P>
  +	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#performance">Performance</A></P>
  +	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#configuration">Configuration</A></P>
   	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#interactive">Qrsh/Interactive
   	Jobs</A></P>
   	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#qmake">Qmake</A></P>
   	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#qmon">Qmon</A></P>
  +	<LI><P STYLE="margin-bottom: 0cm"><A \
HREF="#pe-ckpt">Parallel/Checkpointing</A></P>  +	<LI><P STYLE="margin-bottom: \
0cm"><A HREF="#shadow">Shadow Facility</A></P>  </UL>
   <P STYLE="margin-bottom: 0cm"><BR>
   </P>
  @@ -156,6 +155,74 @@
   			</TD>
   		</TR>
   		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>qsub
  +				of a job results in the error &quot;can't set additional group id
  +				for job&quot; (seen in administrator or user mail, or shepherd
  +				trace file) and puts queue into error state</FONT></FONT></FONT></P>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Possible
  +				reasons</FONT></FONT></FONT></P>
  +				<OL>
  +					<LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
  +					error message below can occur if the user already have 16
  +					existing group ids set. SGE tries to set one more group id and
  +					fails b/c usually the limit is 16.</FONT></FONT></FONT></P>
  +					<LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>If
  +					you are not running Grid Engine as root, then the setgroups()
  +					command will fail trying to set the unique group ID which is
  +					used to track all the spawned processes of a job.</FONT></FONT></FONT></P>
  +				</OL>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<OL>
  +					<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, \
serif"><FONT SIZE=3>Corresponding  +					solutions</FONT></FONT></FONT></P>
  +					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>Please  +					check to see how many group ids \
are assigned to the user using  +					'id -a'. If it's more than 16, then you need to \
reduce this  +					number or increase the limit in the kernel \
(NGROUPS_MAX).</FONT></FONT></FONT></P>  +					<LI><P><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>Be  +					sure to run the Grid Engine daemons as \
root.</FONT></FONT></FONT></P>  +				</OL>
  +			</TD>
  +		</TR>
  +		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Jobs
  +				work when run from command line but fail when run via \
qsub</FONT></FONT></FONT></P>  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Data
  +				and executables may not be accessible where needed</FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
  +				jobs script itself must be accessible from the submit host. All
  +				data and other executables needed by the script must be
  +				accessible on the execute host. Usually shared via \
NFS.</FONT></FONT></FONT></P>  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Unlimited
  +				stack size set by default by SGE may cause some apps to crash on
  +				some OS's.</FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P>In the job script, use &ldquo;ulimit&rdquo; to set stack size
  +				limits before calling the executable that crashes.</P>
  +				<P>Or modify the queue to set smaller stack size:</P>
  +				<PRE>qconf -mattr queue h_stack 8389486 &lt;queue_name&gt; (hard limit in \
bytes)  +qconf -mattr queue s_stack 8389486 &lt;queue_name&gt; (soft limit in \
bytes)</PRE>  +			</TD>
  +		</TR>
  +		<TR>
   			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
   				<P><A NAME="monitoring"></A>Monitoring</P>
   			</TH>
  @@ -163,7 +230,8 @@
   		<TR>
   			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
   				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Exec
  -				hosts report a load of 99.99</FONT></FONT></FONT></P>
  +				hosts report a load of 99.99; queue is in &ldquo;alarm&rdquo;
  +				and/or &ldquo;unknown&rdquo; state</FONT></FONT></FONT></P>
   			</TD>
   		</TR>
   		<TR VALIGN=TOP>
  @@ -193,13 +261,18 @@
   					up the execd as root on the host by running the
   					$SGE_ROOT/default/common/rcsge script </FONT></FONT></FONT>
   					</P>
  -					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>Run  +					<LI><P STYLE="margin-bottom: \
0cm"><FONT SIZE=3><FONT FACE="Thorndale, serif"><FONT COLOR="#000000">Run  'qconf \
-mconf' as the Sun Grid Engine administrator and change  the default_domain to none. \
</FONT></FONT></FONT>  </P>
  -					<LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT \
                SIZE=3>Please
  -					see the AppNote/HOWTO <A \
HREF="http://supportforum.sun.com/gridengine/appnote_loadinfo.html" \
                TARGET="_child">loadinfo</A> for more
  -					information. </FONT></FONT></FONT>
  +					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>Set  +					<FONT SIZE=2>IGNORE_FQDN=TRUE \
</FONT>for qmaster_params in  +					cluster configuration.</FONT></FONT></FONT></P>
  +					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>See  +					man page \
sge_h_aliases(5)</FONT></FONT></FONT></P>  +					<LI><P><FONT FACE="Thorndale, \
serif"><FONT COLOR="#000000">Please  +					see the AppNote/HOWTO <A \
HREF="http://supportforum.sun.com/gridengine/appnote_loadinfo.html" \
TARGET="_child">loadinfo</A>  +					for more information. </FONT></FONT>
   					</P>
   				</OL>
   			</TD>
  @@ -303,6 +376,102 @@
   			</TD>
   		</TR>
   		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<PRE>&ldquo;critical error: can't connect commd&rdquo;
  +&ldquo;<FONT SIZE=2>critical error: setup failed starting \
cod_schedd&rdquo;</FONT></PRE>  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P>A bug on 32 bit systems: <FONT SIZE=2>rlim_fd_max &gt; 1024 </FONT><FONT \
SIZE=3><FONT FACE="Thorndale, serif"><FONT COLOR="#000000">in  \
+				/etc/system</FONT></FONT></FONT></P>  +			</TD>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Set
  +				rlim_fd_max to &lt; 1024. Or update to SGE 5.3p2 or \
higher</FONT></FONT></FONT></P>  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P>The actual hostname &lt;myhostname&gt; of the machine is in
  +				alias to /localhost in etc/hosts. Looks like this:</P>
  +				<PRE>127.0.0.1   localhost  myhostname</PRE>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P>remove &lt;myhostname&gt; as an alias to localhost and put
  +				&lt;myhostname&gt; after the real IP-address in /etc/hosts</P>
  +			</TD>
  +		</TR>
  +		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P>Multiple queues cascade into error state, rendering the grid
  +				unusable. 
  +				</P>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P>errors in a user's .cshrc/.profile result in setting all
  +				queues in error state</P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<OL>
  +					<LI><P>Fix errors in users' .cshrc/.profile</P>
  +					<LI><P>Use the -f option  in the first line of the jobscript
  +					(i.e. Use &ldquo;!#/bin/sh -f&rdquo;) to bypass users' .cshrc or
  +					.profile</P>
  +				</OL>
  +			</TD>
  +		</TR>
  +		<TR>
  +			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
  +				<P><A NAME="performance"></A>Performance</P>
  +			</TH>
  +		</TR>
  +		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Memory
  +				leak and huge memory consumption for schedd on large \
systems</FONT></FONT></FONT></P>  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Parameter
  +				<CODE><FONT SIZE=2>sched_job_info=true</FONT></CODE></FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Set
  +				<CODE><FONT SIZE=2>sched_job_info= false</FONT></CODE> or update
  +				to release 5.3p3 or higher</FONT></FONT></FONT></P>
  +			</TD>
  +		</TR>
  +		<TR>
  +			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
  +				<P><A NAME="configuration"></A>Configuration</P>
  +			</TH>
  +		</TR>
  +		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>max_u_jobs
  +				doesn't work as expected.</FONT></FONT></FONT></P>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>It
  +				doesn't work exactly the same way in all versions of the product
  +				&ndash; and affects scheduling differently depending on whether
  +				the product is used in SGE or SGEEE mode. </FONT></FONT></FONT>
  +				</P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Update
  +				to SGE 5.3p2 (or higher) which contains the latest
  +				implementation. </FONT></FONT></FONT>
  +				</P>
  +			</TD>
  +		</TR>
  +		<TR>
   			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
   				<P><A NAME="interactive"></A>Qrsh/Interactive Jobs</P>
   			</TH>
  @@ -347,7 +516,7 @@
   					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>omit  this check generally by overriding qrsh's \
default setting &quot;-w  e&quot; explicitly by submitting it with &quot;-w n&quot; \
                (can
  -					also be put into \
$SGE_ROOT/&lt;cell&gt;/common/cod_request)</FONT></FONT></FONT></P>  +					also be \
                put into \
                $SGE_ROOT/&lt;cell&gt;/common/sge_request)</FONT></FONT></FONT></P>
   					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT \
FACE="Thorndale, serif"><FONT SIZE=3>if  you intend managing 'mem_free' as a \
consumbale resource specify  the 'mem_free' capacity for your hosts in \
'complex_values' of  @@ -388,6 +557,30 @@
   		</TR>
   		<TR>
   			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, \
serif"><FONT SIZE=3>when  +				I do a qrsh, I get this \
error..</FONT></FONT></FONT></P>  +				<P STYLE="margin-bottom: 0cm"><BR>
  +				</P>
  +				<PRE>% qrsh
  +error: 1: can't set additional group id for job</PRE>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
  +				error message below can occur if the user already have 16
  +				existing group ids set. SGE tries to set one more group id and
  +				fails b/c usually the limit is 16.</FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Please
  +				check to see how many group ids are assigned to the user using
  +				'id -a'. If it's more than 16, then you need to reduce this
  +				number or increase the limit in the kernel.</FONT></FONT></FONT></P>
  +			</TD>
  +		</TR>
  +		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
   				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, \
                serif"><FONT SIZE=3>qrsh
   				-inherit -V does not work when used inside a parallel \
job:</FONT></FONT></FONT></P>  <P STYLE="margin-bottom: 0cm"><BR>
  @@ -464,6 +657,22 @@
   			</TD>
   		</TR>
   		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT \
SIZE=3>Interactive  +				jobs fail when run via qsh, without error \
message.</FONT></FONT></FONT></P>  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>DISPLAY
  +				variable may be set incorrectly</FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P>Set DISPLAY correctly. Or to get error messages for this
  +				situation - upgrade to release 5.3p2 or higher</P>
  +			</TD>
  +		</TR>
  +		<TR>
   			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
   				<P><A NAME="qmake"></A>Qmake</P>
   			</TH>
  @@ -547,10 +756,146 @@
   				installation will fail</FONT></FONT></P>
   			</TD>
   		</TR>
  +		<TR>
  +			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
  +				<P><A NAME="pe-ckpt"></A>Parallel/Checkpointing</P>
  +			</TH>
  +		</TR>
  +		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P STYLE="font-weight: medium">Parts of Sun HPC ClusterTools
  +				parallel jobs (job script itself, child processes, etc) fail to
  +				stop when terminated by user or by qmaster.</P>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
  +				user may not have supplied the necessary means (scripts) for SGE
  +				to control the distributed jobs.</FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P>Follow the complete HOW-TO instructions:
  +				<A HREF="http://supportforum.sun.com/gridengine/appnote_hpc.html">http://supportforum.sun.com/gridengine/appnote_hpc.html</A></P>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Bugs
  +				in early versions of loose integration package</FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P>Update to SGE 5.3p2 (or higher) which includes latest MPI
  +				loose integration package</P>
  +			</TD>
  +		</TR>
  +		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P>Parallel jobs that run with the tight integration of SGE5.3.x
  +				and HPC CT 5 are not terminated if one of the queues has wall
  +				clock limit set.</P>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>A
  +				bug in SGE  prevented correct signal delivery to all parallel
  +				processes</FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P>SGE 5.3p4 contains the fix; for earlier 5.3.x versions, get
  +				corresponding patches from <A \
HREF="http://sunsolve.Sun.COM/pub-cgi/show.pl?target=patches/patch-access">Sunsolve</A>:</P>
  +				<P>SGE: 113136-04 (pkgadd Solaris 32-bit); 113137-04 (pkgadd
  +				Solaris 64-bit); 113138-04 (pkgadd Solaris X86); 113663-02
  +				(pkgadd common pkg); 113849-03 (tar.gz Solaris 32-bit); 113850-03
  +				(tar.gz Solaris 64-bit); 113851-03 (tar.gz Solaris X86);
  +				113852-04 (tar.gz Linux); 113853-02 (tar.gz common package)</P>
  +				<P>SGEEE: 113139-04 (pkgadd Solaris 32-bit); 113140-04 (pkgadd
  +				Solaris 64-bit); 113636-03 (pkgadd common pkg); 113855-03 (tar.gz
  +				Solaris 32-bit); 113856-03 (tar.gz Solaris 64-bit); 113900-02
  +				(tar.gz Linux); 113857-02 (tar.gz common package)</P>
  +			</TD>
  +		</TR>
  +		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P>Parallel jobs that run with the tight integration of SGE5.3.x
  +				and HPC CT 5 would not suspend and resume correctly.</P>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Another
  +				bug in SGE prevented STOP and CONT signals to be correctly
  +				delivered to all processes. </FONT></FONT></FONT>
  +				</P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P>Need to set the suspend/resume methods in the queues used for
  +				the parallel jobs with the appropriate scripts. These scripts can
  +				either be downloaded from the Grid Engine Project site at the
  +				<A HREF="http://gridengine.sunsource.net/servlets/ProjectDownloadList">File
  +				Exchange</A> or obtained from Sun support.</P>
  +				<P>Releases beyond 5.3p4 will ship with these two scripts, a
  +				README file and a parallel environment template.</P>
  +			</TD>
  +		</TR>
  +		<TR>
  +			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
  +				<P><A NAME="shadow"></A>Shadow Facility</P>
  +			</TH>
  +		</TR>
  +		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P STYLE="font-weight: medium">After failover to shadow master,
  +				the schedd daemon remains running on the original qmaster</P>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>This
  +				is a bug in earlier versions of SGE.</FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P>Update to 5.3p2 or higher</P>
  +			</TD>
  +		</TR>
  +		<TR>
  +			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
  +				<P STYLE="font-weight: medium">Shadow host fails to own
  +				mastership of SGE cluster</P>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Lock
  +				file exists.</FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P>Remove $SGE_ROOT/&lt;cell&gt;/spool/qmaster/lock file if
  +				master host has crashed or can no longer function as
  +				qmaster.<BR><B>NOTE:</B> to force the shadow host to take over
  +				from another master, use the &ldquo;migrate&rdquo; option, ie,
  +				&ldquo;rcsge -migrate&rdquo;.</P>
  +			</TD>
  +		</TR>
  +		<TR VALIGN=TOP>
  +			<TD WIDTH=50%>
  +				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Root
  +				R/W access  to $SGE_ROOT directory and its sub-directories should
  +				be from both master and shadow.</FONT></FONT></FONT></P>
  +			</TD>
  +			<TD WIDTH=50%>
  +				<P STYLE="margin-bottom: 0cm">Adjust permissions for root r/w
  +				access to the $SGE_ROOT directory and its sub-directories from
  +				shadow host.</P>
  +				<P><B>NOTE: </B><SPAN STYLE="font-weight: medium">please s</SPAN>ee
  +				the <A HREF="http://gridengine.sunsource.net/project/gridengine/howto/shadow.html">Shadow
  +				Master  HOWTO</A></P>
  +			</TD>
  +		</TR>
   	</TBODY>
   </TABLE>
   <P STYLE="margin-bottom: 0cm"><BR>
   </P>
  -<P STYLE="margin-bottom: 0cm">Last updated: <SDFIELD TYPE=DATETIME \
SDNUM="1033;1033;MMM D, YYYY">Apr 19, 2002</SDFIELD></P>  </BODY>
   </HTML>
  
  
  
  1.6       +89 -67    gridengine/www/howto/commontasks.html
  
  http://gridengine.sunsource.net/source/browse/gridengine/www/howto/commontasks.html.diff?r1=1.5&r2=1.6
  
  (In the diff below, changes in quantity of whitespace are not shown.)
  
  Index: commontasks.html
  ===================================================================
  RCS file: /cvs/gridengine/www/howto/commontasks.html,v
  retrieving revision 1.5
  retrieving revision 1.6
  diff -u -b -r1.5 -r1.6
  --- commontasks.html	2001/08/03 06:52:06	1.5
  +++ commontasks.html	2003/08/14 20:31:44	1.6
  @@ -1,68 +1,90 @@
  -<table border="0" cellpadding="2" cellspacing="0" width="100%">
  -<tr>
  -<td><H2><font color="#336699" class="PageHeader">Common Administrative Tasks for \
                Grid Engine</font></H2></td>
  -</tr>
  -</table>
  -<table border="0" cellpadding="2" cellspacing="0" width="100%">
  -<tr>
  -<td>
  -
  -<br><br>
  -Qconf is the command used for most administrative tasks. This
  -HOWTO contains a selection of the most frequently used options. See 
  -qconf(1) for more details. 
  -</P>
  -<P><B>Adding and removing administrative privileges from a host</B></P>
  -<UL>
  -	<LI><P STYLE="margin-bottom: 0in">qconf -ah # gives host
  +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
  +<HTML>
  +<HEAD>
  +	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
  +	<TITLE></TITLE>
  +	<META NAME="GENERATOR" CONTENT="StarOffice 6.0  (Solaris Sparc)">
  +	<META NAME="CREATED" CONTENT="20021028;15043100">
  +	<META NAME="CHANGEDBY" CONTENT="Charu Chaubal">
  +	<META NAME="CHANGED" CONTENT="20021028;15071800">
  +</HEAD>
  +<BODY LANG="en-US">
  +<TABLE WIDTH=100% BORDER=0 CELLPADDING=2 CELLSPACING=0 STYLE="page-break-before: \
always">  +	<TR>
  +		<TD>
  +			<H2><FONT COLOR="#336699">Common Administrative Tasks for Grid
  +			Engine</FONT></H2>
  +		</TD>
  +	</TR>
  +</TABLE>
  +<TABLE WIDTH=100% BORDER=0 CELLPADDING=2 CELLSPACING=0>
  +	<TR>
  +		<TD>
  +			<P><BR><BR>Qconf is the command used for most administrative
  +			tasks. This HOWTO contains a selection of the most frequently used
  +			options. See qconf(1) for more details. 
  +			</P>
  +			<P><B>Adding and removing administrative privileges from a host</B></P>
  +			<UL>
  +				<LI><P STYLE="margin-bottom: 0cm">qconf -ah # gives host
   	administrative privileges 
   	</P>
   	<LI><P>qconf -dh # removes administrative privileges from host</P>
  -</UL>
  -<P><B>Adding an execution host</B></P>
  -<UL>
  +			</UL>
  +			<P><B>Adding an execution host</B></P>
  +			<UL>
   	<LI><P>Make the new host an administrative host<BR><BR>qconf -ah 
  -	</P>
  +				&lt;hostname&gt;</P>
   	<LI><P>As root on this new host, run the following script from
   	$SGE_ROOT<BR><BR>install_execd</P>
  -</UL>
  -<P><B>Removing an execution host</B></P>
  -<UL>
  +			</UL>
  +			<P><B>Removing an execution host</B></P>
  +			<UL>
   	<LI><P>First, delete the queues associated with this host<BR><BR>qconf
  -	-dq 
  -	</P>
  -	<LI><P>Delete the host<BR><BR>qconf -de 
  -	</P>
  -</UL>
  -<P><B>Adding and removing submit hosts</B></P>
  -<UL>
  -	<LI><P STYLE="margin-bottom: 0in">qconf -as # host is now a submit
  -	host 
  -	</P>
  -	<LI><P>qconf -ds # jobs may not be submitted from host</P>
  -</UL>
  -<P><B>Displaying current administrative/submit/execution hosts</B></P>
  -<UL>
  -	<LI><P STYLE="margin-bottom: 0in">qconf -sh # show current
  +				-dq &lt;queuenames...&gt;</P>
  +				<LI><P>Delete the host<BR><BR>qconf -de &lt;hostname&gt;</P>
  +				<LI><P>Finally, delete the configuration for the host<BR><BR>qconf
  +				-dconf  &lt;hostname&gt;</P>
  +			</UL>
  +			<P><BR><BR>
  +			</P>
  +			<P><B>Adding and removing submit hosts</B></P>
  +			<UL>
  +				<LI><P STYLE="margin-bottom: 0cm">qconf -as &lt;hostname&gt; #
  +				host is now a submit host 
  +				</P>
  +				<LI><P>qconf -ds &lt;hostname&gt; # jobs may not be submitted
  +				from host</P>
  +			</UL>
  +			<P><B>Displaying current administrative/submit/execution hosts</B></P>
  +			<UL>
  +				<LI><P STYLE="margin-bottom: 0cm">qconf -sh # show current
   	administrative hosts 
   	</P>
  -	<LI><P STYLE="margin-bottom: 0in">qconf -ss # show current submit
  +				<LI><P STYLE="margin-bottom: 0cm">qconf -ss # show current submit
   	hosts 
   	</P>
   	<LI><P>qconf -sel # show current execution host list 
   	</P>
  -</UL>
  -<P><B>Administering queues</B></P>
  -<UL>
  -	<LI><P STYLE="margin-bottom: 0in">qconf -aq # adding a queue</P>
  -	<LI><P STYLE="margin-bottom: 0in">qconf -dq # delete a queue 
  -	</P>
  -	<LI><P>qconf -mq # modify a queue 
  -	</P>
  -	<LI><P STYLE="margin-bottom: 0in">qconf -Aq # adding a queue from
  -	file</P>
  -	<LI><P STYLE="margin-bottom: 0in">qconf -mqattr # change single
  -	attributes of more than one queue</P>
  -</UL>
  -
  -</table>
  \ No newline at end of file
  +			</UL>
  +			<P><B>Administering queues</B></P>
  +			<UL>
  +				<LI><P STYLE="margin-bottom: 0cm">qconf -aq &lt;queuename&gt; #
  +				adding a queue</P>
  +				<LI><P STYLE="margin-bottom: 0cm">qconf -dq &lt;queuename&gt; #
  +				delete a queue 
  +				</P>
  +				<LI><P>qconf -mq &lt;queuename&gt; # modify a queue 
  +				</P>
  +				<LI><P STYLE="margin-bottom: 0cm">qconf -Aq &lt;filename&gt; #
  +				adding a queue from file</P>
  +				<LI><P>qconf -mattr queue ... # change single attributes of more
  +				than one queue</P>
  +			</UL>
  +		</TD>
  +	</TR>
  +</TABLE>
  +<P><BR><BR>
  +</P>
  +</BODY>
  +</HTML>
  \ No newline at end of file
  
  
  
  1.4       +97 -0     gridengine/www/howto/qrsh_ssh.html
  
  http://gridengine.sunsource.net/source/browse/gridengine/www/howto/qrsh_ssh.html.diff?r1=1.3&r2=1.4
  
  (In the diff below, changes in quantity of whitespace are not shown.)
  
  Index: qrsh_ssh.html
  ===================================================================
  RCS file: /cvs/gridengine/www/howto/qrsh_ssh.html,v
  retrieving revision 1.3
  retrieving revision 1.4
  diff -u -b -r1.3 -r1.4
  --- qrsh_ssh.html	2003/01/07 15:36:23	1.3
  +++ qrsh_ssh.html	2003/08/14 20:31:44	1.4
  @@ -1,3 +1,99 @@
  +<<<<<<< qrsh_ssh.html
  +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
  +<HTML>
  +<HEAD>
  +	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
  +	<TITLE></TITLE>
  +	<META NAME="GENERATOR" CONTENT="StarOffice 6.0  (Solaris Sparc)">
  +	<META NAME="CREATED" CONTENT="20020529;10474700">
  +	<META NAME="CHANGEDBY" CONTENT="Charu Chaubal">
  +	<META NAME="CHANGED" CONTENT="20020529;12534600">
  +</HEAD>
  +<BODY LANG="en-US">
  +<P STYLE="margin-bottom: 0cm">&nbsp; 
  +</P>
  +<TABLE WIDTH=100% BORDER=0 CELLPADDING=2 CELLSPACING=0>
  +	<TR>
  +		<TD>
  +			<H2><FONT COLOR="#336699">Using ssh with qrsh</FONT></H2>
  +		</TD>
  +	</TR>
  +</TABLE>
  +<TABLE WIDTH=100% BORDER=0 CELLPADDING=2 CELLSPACING=0>
  +	<COL WIDTH=256*>
  +	<TR>
  +		<TD WIDTH=100%>
  +			<P>By default, the Grid Engine command <B>qrsh</B> will use
  +			standard remote mechanisms (rsh/rlogin) to establish interactive
  +			sessions. 
  +			</P>
  +			<UL>
  +				<LI><P><B>qrsh</B> by itself will use rlogin</P>
  +				<LI><P><B>qrsh</B> with a command will establish a rsh
  +				connection. 
  +				</P>
  +			</UL>
  +			<P>To enable the rsh/rlogin mechanism, special rsh and rlogin
  +			binaries are provided with Grid Engine (found in
  +			$SGE_ROOT/utilbin/$ARCH). In addition, to have full accounting and
  +			process control for interactive jobs, an extended <B>rshd</B>
  +			comes with Grid&nbsp;Engine. 
  +			</P>
  +			<P>As an alternative, Grid Engine can be configured to use <B>ssh</B>
  +			instead to start interactive jobs. <BR>&nbsp; 
  +			</P>
  +			<H3>Advantages of using ssh:</H3>
  +			<UL>
  +				<LI><P STYLE="margin-bottom: 0cm">secure connection 
  +				</P>
  +				<LI><P STYLE="margin-bottom: 0cm">no need to have suid root
  +				programs installed (rsh and rlogin have to be suid root)</P>
  +				<LI><P STYLE="margin-bottom: 0cm">much larger number of running
  +				sessions per host (not limited by port number &lt; 1024)</P>
  +				<LI><P STYLE="margin-bottom: 0cm">compression (if lots of data
  +				pushed through STDIN/STDOUT)</P>
  +				<LI><P>possibility to attach a tty to remotely executed commands
  +				(ssh option -t)</P>
  +			</UL>
  +			<H3 STYLE="margin-top: 0cm; margin-bottom: 0cm">Disadvantages:</H3>
  +			<UL>
  +				<LI><P STYLE="margin-bottom: 0cm">Lack of complete accounting 
  +				</P>
  +				<LI><P>lack of process control (reprioritization) 
  +				</P>
  +			</UL>
  +		</TD>
  +	</TR>
  +</TABLE>
  +<H3>How to setup ssh for qrsh:</H3>
  +<P STYLE="margin-bottom: 0cm">Have ssh working, all keys created ... 
  +</P>
  +<P STYLE="margin-bottom: 0cm">Set the parameters rsh_daemon and
  +rlogin_daemon in your cluster configuration to ssh: 
  +</P>
  +<UL>
  +	<LI><P>rsh_daemon: /usr/sbin/sshd -i</P>
  +	<LI><P>rlogin_daemon: /usr/sbin/sshd -i 
  +	</P>
  +</UL>
  +<P STYLE="margin-bottom: 0cm">If you have execution hosts with
  +different architectures that have different paths to ssh, you will
  +have to make these settings for each execution host individualy
  +(qconf -mconf host), else you can change the global cluster
  +configuration (qconf -mconf).</P>
  +<P>Set the parameters rsh_command and rlogin_command in your cluster
  +configuration to ssh:</P>
  +<UL>
  +	<LI><P>rsh_command&nbsp;&nbsp;&nbsp;&nbsp; /usr/bin/ssh</P>
  +	<LI><P>rlogin_command&nbsp; /usr/bin/ssh 
  +	</P>
  +</UL>
  +<P>If you have submit hosts with different architectures that have
  +different paths to ssh, you will have to make these settings for each
  +submit host individualy (qconf -mconf host), else you can change the
  +global cluster configuration (qconf -mconf). <BR>&nbsp; <BR>&nbsp; <BR>&nbsp;</P>
  +</BODY>
  +</HTML>=======
   <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
   <html>
   <head>
  @@ -109,3 +205,4 @@
   <br><tt><font color="#000000">exec /usr/sbin/sshd -i</font></tt>
   </body>
   </html>
  +>>>>>>> 1.2
  
  
  
  1.5       +239 -238  gridengine/www/howto/scripting.html
  
  http://gridengine.sunsource.net/source/browse/gridengine/www/howto/scripting.html.diff?r1=1.4&r2=1.5
  
  (In the diff below, changes in quantity of whitespace are not shown.)
  
  Index: scripting.html
  ===================================================================
  RCS file: /cvs/gridengine/www/howto/scripting.html,v
  retrieving revision 1.4
  retrieving revision 1.5
  diff -u -b -r1.4 -r1.5
  --- scripting.html	2002/03/15 18:11:47	1.4
  +++ scripting.html	2003/08/14 20:31:44	1.5
  @@ -1,12 +1,12 @@
   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
   <HTML>
   <HEAD>
  -	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=windows-1252">
  +	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
   	<TITLE></TITLE>
  -	<META NAME="GENERATOR" CONTENT="StarOffice 6.0  (Win32)">
  +	<META NAME="GENERATOR" CONTENT="StarOffice 6.0  (Solaris Sparc)">
   	<META NAME="AUTHOR" CONTENT=" ">
   	<META NAME="CREATED" CONTENT="20020111;13083600">
  -	<META NAME="CHANGED" CONTENT="20020315;9310466">
  +	<META NAME="CHANGED" CONTENT="20020319;12001100">
   	<STYLE>
   	<!--
   		H2 { font-family: "Sunsans Demi" }
  @@ -15,7 +15,7 @@
   	-->
   	</STYLE>
   </HEAD>
  -<BODY>
  +<BODY BGCOLOR="#ffffff">
   <H1><FONT COLOR="#336699"><FONT SIZE=4 STYLE="font-size: 16pt"><B>Command
   Line and Scripting of Administrative Tasks in Grid Engine</B></FONT></FONT></H1>
   <P STYLE="margin-bottom: 0cm">The <B>qmon(1) </B>graphical user
  @@ -61,20 +61,21 @@
   with the &quot;show&quot; option of <B>qconf</B> (<B>qconf -s&lt;obj&gt;</B>)
   to take an existing object, modify it, and then update the existing
   object or create a new one.</P>
  -<H4>Example: Write a shell script to modify the <I>migration command
  -</I><SPAN STYLE="font-style: normal">of an existing checkpoint
  -environment</SPAN></H4>
  +<H4>Example: Write a shell script to specify queues of a <SPAN STYLE="font-style: \
normal">checkpoint  +environment</SPAN> from a list in a file</H4>
   <PRE>#!/bin/sh
  -# ckptmod.sh: modify the migration command 
  -# of a checkpointing environment
  -# Usage: ckptmod.sh &lt;checkpoint-env-name&gt; &lt;full-path-to-command&gt;
  -TMPFILE=/tmp/ckptmod.$$
  +# ckptq.sh: specify queues of a checkpoint from a list in a file
  +# Usage: ckptq.sh &lt;checkpoint-env-name&gt; &lt;filename&gt;
  +# &lt;filename&gt; contains a list of queues,
  +#    separated by commas and/or newlines
   
  +TMPFILE=/tmp/ckptq.$$
   CKPT=$1
  -MIGMETHOD=$2
  +QUEUELIST=$2
   
  -qconf -sckpt $CKPT | grep -v '^migr_command' &gt; $TMPFILE
  -echo &quot;migr_command $MIGMETHOD&quot; &gt;&gt; $TMPFILE
  +qconf -sckpt $CKPT | grep -v 'queue_list' &gt; $TMPFILE
  +echo  queue_list `cat $QUEUELIST | \
  +    tr &quot;\012&quot; &quot; &quot; | tr &quot;,&quot; &quot; &quot;` &gt;&gt; \
$TMPFILE  qconf -Mckpt $TMPFILE
   rm $TMPFILE</PRE>
   <HR>
  
  
  


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic