'[Rocks-Discuss] Re: Adding programs to compute installs'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       npaci-rocks-discussion
Subject:    [Rocks-Discuss] Re: Adding programs to compute installs
From:       Aaron Carr <aaronhcarr () gmail ! com>
Date:       2015-04-22 17:48:12
Message-ID: CANyxMSp8Pv+gryL_4H362QjEGnyk9fauerfkLYhdN4NjcJH=cg () mail ! gmail ! com
[Download RAW message or body]

When you add nodes to the cluster, you define what you're adding, so at
that point rocks knows.

You can also define them to the scheduler via queues.

I also use pdsh with the genders module, so that's useful as well if I only
want to interact with a specific node type.

For example, lets saying want to reinstall my GPU nodes, so that I can
upgrade CUDA from 5.5 to 6.

I make my changes to the install scripts, test on one node.  When I'm
satisfied that it's ready, I'll do a for loop to set those nodes to
install, then do pdsh -g gpu reboot.

I'm just using GPU as an example.

I also break nodes up by rack or chassis, row, etc in the pdsh genders
file.  Makes things easier.

Aaron

On Wednesday, April 22, 2015, Stephan Henning <shenning@gmail.com> wrote:

> I think Aaron touched on my confusion here.
> 
> It sounds like you both are created hardware specific appliance types that
> you are using to make it easier to manage what software is getting applied
> to what hardware. I guess what my question really is, is how does Rocks
> then handle the creation of user configured/custom appliance types and how
> does Rocks utilize this feature? Is it only something that is useful during
> the installation process?
> 
> -stephan
> 
> On Wed, Apr 22, 2015 at 12:20 PM, Gowtham <g@mtu.edu <javascript:;>>
> wrote:
> 
> > 
> > Hello Stephan,
> > 
> > Yes, I take the same approach as Aaron described. The queue configuration
> > -- for e.g., gpu.q that includes compute nodes that have GPUs in them --
> > can/needs to be selected for jobs/simulations that require GPUs.
> > 
> > I use Grid Engine that comes with the standard Rocks distribution, and
> > Grid Engine has a very active mailing list as well -- should you need
> more
> > information about it.
> > 
> > Best regards,
> > g
> > 
> > --
> > Gowtham, PhD
> > Director of Research Computing, IT
> > Adj. Asst. Professor, Physics/ECE
> > Michigan Technological University
> > 
> > P: (906) 487-3593
> > F: (906) 487-2787
> > http://it.mtu.edu
> > http://hpc.mtu.edu
> > 
> > 
> > On Wed, 22 Apr 2015, Aaron Carr wrote:
> > 
> > > Standard option?  Not sure what you mean.  If you mean built in, then
> no.
> > > 
> > > You can add appliances to Rocks.
> > > 
> > > It has zero to do with scheduling.
> > > 
> > > It has more to do with having large numbers of hosts of varying types.
> > > 
> > > Some compute, some big memory (2TB), some GPU, some Xeon Phi, etc.
> > > 
> > > That way you can have the various types install software or perform
> > special
> > > configurations to themselves based on that appliance type.
> > > 
> > > My suggestion for scheduling would be to add the nodes to specialty
> > queues
> > > by their attributes.
> > > 
> > > On Wednesday, April 22, 2015, Stephan Henning <shenning@gmail.com
> <javascript:;>>
> > wrote:
> > > 
> > > > Aaron, Gowtham,
> > > > 
> > > > 
> > > > Both of you have mentioned having alternate appliance types of
> > > > 'gpu-compute'.
> > > > Is this a standard option for Rocks? How does Rocks utilize this to
> > > > determine if a different type of compute node is needed for a job?
> > > > 
> > > > 
> > > > -stephan
> > > > 
> > > > On Wed, Apr 22, 2015 at 9:07 AM, Aaron Carr <aaronhcarr@gmail.com
> <javascript:;>
> > > > <javascript:;>> wrote:
> > > > 
> > > > > Here's an example of how we install CUDA.
> > > > > 
> > > > > First, the nodes are a different appliance type (gpu-compute).
> > > > > 
> > > > > So now we have the standard extend-compute.xml that gets processed,
> > then
> > > > it
> > > > > will process our gpu-compute.xml file, which has the following:
> > > > > 
> > > > > <post>
> > > > > 
> > > > > <!--Install NVidia CUDA-->
> > > > > <file name="/etc/rc.d/rocksconfig.d/post-80-nvidia-cuda"
> > perms="0755">
> > > > > #!/bin/sh
> > > > > 
> > > > > # Make sure that networking is up and the NFS share is accessible.
> > > > > 
> > > > > while [ ! -f /share/apps/Nvidia/cuda/install_cuda.sh ];
> > > > > do
> > > > > sleep 3;
> > > > > done
> > > > > 
> > > > > /share/apps/Nvidia/cuda/install_cuda.sh
> > > > > 
> > > > > rm /etc/rc.d/rocksconfig.d/post-80-nvidia-cuda
> > > > > 
> > > > > </file>
> > > > > 
> > > > > </post>
> > > > > 
> > > > > The install cuda script creates an nvidia service that starts the
> > driver,
> > > > > installs the driver from the run file, etc.
> > > > > 
> > > > > 
> > > > > Just one example for you.
> > > > > 
> > > > > Aaron
> > > > > 
> > > > > On Wed, Apr 22, 2015 at 6:25 AM, Gowtham <g@mtu.edu <javascript:;>
> <javascript:;>>
> > > > wrote:
> > > > > 
> > > > > > 
> > > > > > Hello Stephan,
> > > > > > 
> > > > > > You can go about a couple different ways to add programs to the
> > compute
> > > > > > nodes in your cluster.
> > > > > > 
> > > > > > 1. Placing them (i.e., copying/installing/compiling as the
> > > > > > case may be) under
> > > > > > 
> > > > > > /share/apps/
> > > > > > 
> > > > > > in the front end. This location is seen 'as is' in every
> > > > > > compute node (and login and tile and NAS and development
> > > > > > nodes as well). This will mean that you will do it only
> > > > > > once, and the process is transparent to/unaffected by
> > > > > > compute (or other kind of) node (re)installation.
> > > > > > 
> > > > > > For e.g., I use the following naming convention for
> > > > > > software suite XYZ version 1.0:
> > > > > > 
> > > > > > /share/apps/XYZ/1.0/
> > > > > > 
> > > > > > So far, this approach has given me the flexibility of
> > > > > > maintaining different version of a given software suite.
> > > > > > In cases where one research group requires XYZ 1.0 compiled
> > > > > > with GCC 4.4.6 and another research group requires XYZ 1.0
> > > > > > compiled with Intel 2013.0.028, the aforementioned scheme
> > > > > > can be easily expanded to something like
> > > > > > 
> > > > > > /share/apps/XYZ/1.0/gcc/4.4.6/
> > > > > > /share/apps/XYZ/1.0/intel/2013.0.028/
> > > > > > 
> > > > > > 
> > > > > > 2. Updating the extend-NODE.xml file (where NODE is one of
> > > > > > compute/nas/login/viz) found in front end under, for
> > > > > > Rocks 6.1,
> > > > > > 
> > > > > > /export/rocks/install/site-profiles/6.1/nodes/
> > > > > > 
> > > > > > with required changes/instructions, verifying the XML
> > > > > > file for syntax errors using the command
> > > > > > 
> > > > > > xmllint -noout extend-NODE.xml
> > > > > > 
> > > > > > and re-building the Rocks distribution using the
> > > > > > commands
> > > > > > 
> > > > > > cd /export/rocks/install
> > > > > > rocks create distro
> > > > > > 
> > > > > > Once successfully completed, the NODE will pick up the
> > > > > > relevant instructions in extend-NODE.xml during
> > > > > > (re)installation.
> > > > > > 
> > > > > > The verification of XML file and rebuilding of the Rocks
> > > > > > distribution need to be done every time you make an edit
> > > > > > to any of the extend-NODE.xml files.
> > > > > > 
> > > > > > 
> > > > > > 3. Suppose that you have four compute nodes that have NVIDIA
> > > > > > GPUs (say, compute-0-N, N = 101:1:104) and that you wish
> > > > > > to install a specific NVIDIA driver automatically (say,
> > > > > > NVIDIA-Linux-x86_64-331.38.run). You can follow the steps
> > > > > > below (derived from prior mailing list interactions):
> > > > > > 
> > > > > > A. Designate those four nodes as GPU nodes
> > > > > > 
> > > > > > rocks list bootaction
> > > > > > rocks add bootaction action="gpuinstall" \
> > > > > > args="[SAME_ARGS_AS_FOR_BOOTACTION_install] \
> > > > > > rdblacklist=nouveau nouveau.modeset=0
> > > > > > 
> > > > > > for x in `seq 101 1 104`
> > > > > > do
> > > > > > rocks set host installaction compute-0-$x \
> > > > > > action="gpuinstall"
> > > > > > rocks add host attr compute-0-$x gpunode true
> > > > > > rocks sync config
> > > > > > ssh compute-0-$x
> '/boot/kickstart/cluster-kickstart-pxe'
> > > > > > done
> > > > > > 
> > > > > > 
> > > > > > B. Place NVIDIA-Linux-x86_64-331.38.run in front end under
> > > > > > 
> > > > > > /export/rocks/install/contrib/6.1/x86_64/
> > > > > > 
> > > > > > 
> > > > > > C. Add the content below after </post> section in
> > > > > > extend-compute.xml
> > > > > > 
> > > > > > <post cond="gpunode">
> > > > > > <!-- compute-0-101 through compute-0-104 only -->
> > > > > > mkdir /tmp/nvidia/
> > > > > > cd /tmp/nvidia/
> > > > > > rm -f NVIDIA-Linux-x86_64-331.38.run
> > > > > > wget
> > > > > > 
> > > > > 
> > > > 
> > 
> http://127.0.0.1/install/contrib/6.1/x86_64/NVIDIA-Linux-x86_64-331.38.run
> > > > > > chmod 744 ./NVIDIA-Linux-x86_64-331.38.run
> > > > > > ./NVIDIA-Linux-x86_64-331.38.run --silent
> > > > > > </post>
> > > > > > 
> > > > > > D. Rebuild the Rocks distribution as detailed in #2
> > > > > > 
> > > > > > E. Reinstall the nodes, compute-0-N (N=101:1:104)
> > > > > > 
> > > > > > 
> > > > > > Hope this helps to get you started. You can further extend the
> > above
> > > > > > concepts/procedures for other software suites as necessary.
> > > > > > 
> > > > > > Best regards,
> > > > > > g
> > > > > > 
> > > > > > --
> > > > > > Gowtham, PhD
> > > > > > Director of Research Computing, IT
> > > > > > Adj. Asst. Professor, Physics/ECE
> > > > > > Michigan Technological University
> > > > > > 
> > > > > > P: (906) 487-3593
> > > > > > F: (906) 487-2787
> > > > > > http://it.mtu.edu
> > > > > > http://hpc.mtu.edu
> > > > > > 
> > > > > > 
> > > > > > On Wed, 22 Apr 2015, John Hearns wrote:
> > > > > > 
> > > > > > > Stephan
> > > > > > > I would suggest looking for the SDSC Rocks Rools on Github.
> > > > > > > The applications you mention are installed with those rolls -
> > plus
> > > > lots
> > > > > > of others!
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > 
> #####################################################################################
> 
> > > > > > > Scanned by MailMarshal - M86 Security's comprehensive email
> > content
> > > > > > security solution.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > 
> #####################################################################################
> 
> > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > -------------- next part --------------
> > > > > An HTML attachment was scrubbed...
> > > > > URL:
> > > > > 
> > > > 
> > 
> http://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20150422/6c471575/attachment.html
> 
> > > > > 
> > > > -------------- next part --------------
> > > > An HTML attachment was scrubbed...
> > > > URL:
> > > > 
> > 
> http://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20150422/11ef4453/attachment.html
> 
> > > > 
> > > -------------- next part --------------
> > > An HTML attachment was scrubbed...
> > > URL:
> > 
> http://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20150422/137d691c/attachment.html
> 
> > > 
> > 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20150422/b85e9ff2/attachment.html
>  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20150422/ebbc5557/attachment.html \



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic