-
Notifications
You must be signed in to change notification settings - Fork 308
ubuntu 11.04 cluster compute
As of 2011-07-18, no Ubuntu based public AMI is available to use with StarCluster on cluster compute instances. This should change soon (see issue #31). Here are my notes at setting up such an AMI based on Ubuntu 11.04 to use with StarCluster 0.92rc2. I have not made any attempt at getting it working with the cluster GPU instances.
Use a Ubuntu 11.04 Natty HVM (cluster compute) AMI built by Canonical. In us-east-1 this is ami-1cad5275 (as of 2011-07-18) as suggested by http://alestic.com/. Launch it from the Alestic website. It shows up as:
(starcluster)user@localhost:~$ starcluster listinstances StarCluster - (http://web.mit.edu/starcluster) (v. 0.92rc2) Software Tools for Academics and Researchers (STAR) Please submit bug reports to starcluster@mit.edu id: i-bd2fdedc dns_name: ec2-50-17-72-101.compute-1.amazonaws.com private_dns_name: ip-10-17-2-245.ec2.internal state: running public_ip: 50.17.72.101 private_ip: 10.17.2.245 zone: us-east-1b ami: ami-1cad5275 type: cc1.4xlarge groups: default keypair: starcluster_1 uptime: 00:01:31
Log in as ubuntu user then switch to root:
user@localhost:~$ starcluster sshinstance -u ubuntu i-bd2fdedc ubuntu@ip-...:~# sudo -i root@ip-...:~#
Update the system:
root@ip-...:~# apt-get update root@ip-...:~# apt-get upgrade
- The AMI was configured to disable root logins but it is needed for StarCluster.
-
- edit
/etc/cloud/cloud.cfg
and setdisable_root:0
- edit
/root/.ssh/authorized_keys
and remove prefix commands from pubkey entry - edit
/usr/bin/cloud-init
, go to line 143 and change'once-per-instance'
to'always'
.
- edit
Install the following packages:
root@ip-...:~# apt-get install portmap nfs-common nfs-kernel-server rxvt
And add a symbolic link (nfs is used by StarCluster, is present on starcluster's AMI based on Ubuntu 10.10, but is not available on Ubuntu 11.04):
root@ip-...:~# ln -s /etc/init.d/nfs-kernel-server nfs
StarCluster at version 0.92rc2 uses Sun GridEngine version 6.2u5 installed in /opt/sge6
.
Since Oracle bought Sun, installation files are not easily available anymore on the Internet.
Looking into StarCluster's Ubuntu 10.04 64-bits AMI, /opt/
contains /opt/sge6-fresh
but also 2 versions of drmaa-python bindings (0.2 and 0.4b3) [not used by StarCluster anymore?].
After launching an AMI with AWS managing Console (public DNS looking like ec2-....compute-1.amazonaws.com) grab the folder with:
user@localhost:~$ ssh -i ~/.ssh/myssh_certificate.pem root@ec2-....compute-1.amazonaws.com root@domU...:/# tar -caf opt_starcluster.tar.gz ./opt user@localhost:~$ scp -i ~/.ssh/myssh_certificate.pem root@ec2-....compute-1.amazonaws.com:/opt_starcluster.tar.gz ./
That instance can now be terminated.
user@localhost:~$ scp -i ~/.ssh/myssh_certificate.pem opt_starcluster.tar.gz root@ec2-....compute-1.amazonaws.com:. root@ip-...:~# tar -xf opt_starcluster.tar.gz root@ip-...:~# mv ./opt /
Create the following symbolic link (because not in Natty):
root@ip-...:~# ln -s /lib64/x86_64-linux-gnu/libc-2.13.so /lib64/libc.so.6
/opt/sge6-fresh/util/arch/
needs to be patched as discussed in http://comments.gmane.org/gmane.comp.clustering.gridengine.users/21495.
Replacing lines 64 to 66 by:
ossysname="`$UNAME -s`" 2>/dev/null || ossysname=unknown osmachine="`$UNAME -m`" 2>/dev/null || osmachine=unknown osrelease="`$UNAME -r`" 2>/dev/null || osrelease=unknown
replace line 237 by:
libc=/lib64/libc.so.6
replace line 240 by (optional):
libc=/lib/libc.so.6.1
replace line 243 by (optional):
libc=/lib/libc.so.6
inserting a new line 247 as
libc_string=`$libc | head -n 1`
replace lines 253 and 254 by [there is a GitHub issue here regarding how the following text is formatted. One should read version [0-9]*\\.
followed by \, (, [, 0, -, 9, ], *, \, ), " ]:
libc_version=`expr "$libc_string" : ".* version [0-9]*\\.\([0-9]*\)" 2>/dev/null` if [ $? -ne 0 -o $libc_version -lt 2 ]; then
Install the build dependences for the libopenmpi-dev package:
root@ip-...:~# apt-get build-dep libopenmpi-dev
Get the source for the libopenmpi-dev Debian package:
root@ip-...:~# cd /usr/local/src root@ip-...:/usr/local/src# mkdir openmpi root@ip-...:/usr/local/src/openmpi# apt-get source libopenmpi-dev
Change into the libopenmpi-dev package debian folder
root@ip-...:/usr/local/src/openmpi# cd openmpi-1.4.3/debian
Modify the rules
file and add --with-sge
to the configure arguments on line 61.
Use tabs, not space!
It should look like something close to (last 2 lines where modified):
COMMON_CONFIG_PARAMS = \ $(CROSS) \ $(CHKPT) \ $(NUMA) \ --prefix=/usr \ --mandir=\$${prefix}/share/man \ --infodir=\$${prefix}/share/info \ --sysconfdir=/etc/openmpi \ --libdir=\$${prefix}/lib/openmpi/lib \ --includedir=\$${prefix}/lib/openmpi/include \ --with-devel-headers \ --enable-heterogeneous \ $(TORQUE) \ --with-sge
Rebuild the libopenmpi-dev package:
root@ip-...:/usr/local/src/openmpi/openmpi-1.4.3/debian# cd .. root@ip-...:/usr/local/src/openmpi/openmpi-1.4.3# dpkg-buildpackage -rfakeroot -b
Install the newly rebuilt package:
root@ip-...:/usr/local/src/openmpi/openmpi-1.4.3# cd .. root@ip-...:/usr/local/src/openmpi# dpkg -i *.deb
Verify Sun Grid Engine support:
root@ip-...:~# ompi_info | grep -i grid
should return:
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4.3)
First we need to make an AMI of the current instance (i-bd2fdedc below)
(starcluster)user@localhost:~$ starcluster ebsimage i-bd2fdedc test-sge-install -d 'Temporary AMI to test the SGE install' StarCluster - (http://web.mit.edu/starcluster) (v. 0.92rc2) Software Tools for Academics and Researchers (STAR) Please submit bug reports to starcluster@mit.edu >>> Removing private data... >>> Creating EBS image... >>> Waiting for AMI ami-af11d5c6 to become available... >>> create_image took 6.982 mins >>> Your new AMI id is: ami-af11d5c6
Now let's check if it actually works. Add a new cluster configuration using the newly created AMI in ~/.starcluster/config
, for example the following lines:
[cluster computecluster] # change this to the name of one of the keypair sections defined above KEYNAME = myssh_certificate # number of ec2 instances to launch CLUSTER_SIZE = 2 # create the following user on the cluster CLUSTER_USER = sgeadmin # optionally specify shell (defaults to bash) CLUSTER_SHELL = bash # AMI for cluster nodes. NODE_IMAGE_ID = ami-af11d5c6 # instance type for all cluster nodes NODE_INSTANCE_TYPE = cc1.4xlarge # list of volumes to attach to the master node (OPTIONAL) # these volumes, if any, will be NFS shared to the worker nodes # VOLUMES = computeclusterhome
Then we can launch the cluster:
user@localhost:~$ starcluster spothistory -d 50 cc1.4xlarge user@localhost:~$ starcluster start --login-master --bid=1.6 --cluster-template=computecluster testcluster
Log into the cluster with:
user@localhost:~$ starcluster sshmaster -u sgeadmin testcluster
and verify the installation by following Sun's procedure.
On the testcluster created above we compile and run a "Hello World" OpenMPI program:
sgeadmin@master:~# mkdir mpi_test sgeadmin@master:~# cd mpi_test sgeadmin@master:~/mpi_test# vim mpi_hello.c
Write the following program source code as mpi_hello.c
:
/*The Parallel Hello World Program*/ #include <stdio.h> /* printf and BUFSIZ defined there */ #include <stdlib.h> /* exit defined there */ #include <mpi.h> /* all MPI-2 functions defined there */ main(int argc, char **argv) { int rank, size, length; char name[BUFSIZ]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(name, &length); printf("%s: Hello World from process %d of %d\n", name, rank, size); MPI_Finalize(); exit(0); }
Compile it:
sgeadmin@master:~/mpi_test# mpicc ./mpi_hello.c -o ./mpi_hello
And run it first with mpi then using SGE:
sgeadmin@master:~/mpi_test# mpirun -n 16 ./mpi_hello sgeadmin@master:~/mpi_test# qsub -b y -cwd -pe orte 24 mpirun ./mpi_hello
Check output files in sgeadmin's home directory.
Do a custom build, see README.Debian (link to a version newer than the one in Natty to get right info).
First gets the build dependencies (devscripts was missing somehow):
root@ip-...:~# apt-get build-dep atlas root@ip-...:~# apt-get install devscripts
The package version in Natty (atlas-3.8.3-29) fails when trying to do a custom build. It is a confirmed package bug. I choose to backport the Oneiric package 3.8.4-3. Oneiric is not released as of 2011-07-18. It goes like this:
root@ip-...:~# cd /usr/local/src root@ip-...:/usr/local/src/# mkdir atlas root@ip-...:/usr/local/src/atlas# wget https://launchpad.net/ubuntu/oneiric/+source/atlas/3.8.4-3/+files/atlas_3.8.4.orig.tar.bz2 root@ip-...:/usr/local/src/atlas# wget https://launchpad.net/ubuntu/oneiric/+source/atlas/3.8.4-3/+files/atlas_3.8.4-3.debian.tar.gz root@ip-...:/usr/local/src/atlas# wget https://launchpad.net/ubuntu/oneiric/+source/atlas/3.8.4-3/+files/atlas_3.8.4-3.dsc root@ip-...:/usr/local/src/atlas# cd atlas-3.8.4 root@ip-...:/usr/local/src/atlas/atlas-3.8.4# fakeroot debian/rules custom root@ip-...:/usr/local/src/atlas/atlas-3.8.4# cd .. root@ip-...:/usr/local/src/atlas# dpkg -i *.deb
Lapack was installed with the dependencies.