Skip to content

Commit

Permalink
Add CGCloud deploy doc
Browse files Browse the repository at this point in the history
cgcloud doc edits

edits to cgcloud docs

more cgcloud edits

more cgcloud docs edits

more cgcloud docs edits

edit cgcloud docs
  • Loading branch information
jpdna committed Nov 17, 2016
1 parent 8fef9a7 commit f887597
Showing 1 changed file with 117 additions and 38 deletions.
155 changes: 117 additions & 38 deletions docs/source/40_deploying_ADAM.md
Original file line number Diff line number Diff line change
@@ -1,57 +1,136 @@
# Deploying ADAM

## Running ADAM on EC2
## Running ADAM on AWS EC2 using CGCloud

First, export some variables for running EC2:
CGCloud provides an automated means to create a cluster on EC2 for use with ADAM.

```bash
export AWS_ACCESS_KEY_ID='?????'
export AWS_SECRET_ACCESS_KEY='?????'
export MY_KEYPAIR="?????" # your keypair in us-east
export MY_KEYFILE="?????.pem"
export MY_CLUSTER_NAME="adam_cluster"
export MY_CLUSTER_SIZE=10
[CGCloud](https://github.com/BD2KGenomics/cgcloud) lets you automate the creation,
management and provisioning of VMs and clusters of VMs in Amazon EC2.
The [CGCloud plugin for Spark]
(https://github.com/BD2KGenomics/cgcloud/blob/master/spark/README.rst)
lets you setup a fully configured Apache Spark cluster in EC2.

# M2 and CR1 are memory optimized
export MY_INSTANCE_TYPE="m2.4xlarge"
Prior to following these instructions, you need to already have setup your AWS
account and know your
AWS access keys. See https://aws.amazon.com/ for details.

#### Configure CGCloud

Begin by reading the CGcloud [readme](https://github.com/BD2KGenomics/cgcloud).

Next, configure [CGCloud core]
(https://github.com/BD2KGenomics/cgcloud/blob/master/core/README.rst)
and then install the
[CGcloud spark plugin]
(https://github.com/BD2KGenomics/cgcloud/blob/master/spark/README.rst).

One modification to CGCloud install instructions: replace the two pip calls
`pip install cgcloud-core` and `pip install cgcloud-spark` with the single command:
```
pip install cgcloud-spark==1.6.0
```
which will install the correct version of both cgcloud-core and cgcloud-spark.

If you want to use spot pricing, add `--spot-price` as an option to
`spark_ec2_launch` (below) and `export MY_SPOT_PRICE=1.399`.

Export the path to your `spark-ec2` script,

```bash
export SPARK_EC2_SCRIPT="/path/to/spark-0.8.1/ec2/spark-ec2" # CHANGE ME
Note, the steps to register your ssh key and create the template boxes below
need only be done once.
```
cgcloud register-key ~/.ssh/id_rsa.pub
cgcloud create generic-ubuntu-trusty-box
cgcloud create -IT spark-box
```

Set up some aliases for commands to the spark ec2 script,
#### Launch a cluster

```bash
alias spark_ec2_launch="$SPARK_EC2_SCRIPT -k $MY_KEYPAIR \
-i $MY_KEYFILE -s $MY_CLUSTER_SIZE --zone us-east-1c \
--instance-type=$MY_INSTANCE_TYPE launch $MY_CLUSTER_NAME"
alias spark_ec2_stop="$SPARK_EC2_SCRIPT stop $MY_CLUSTER_NAME"
alias spark_ec2_start="$SPARK_EC2_SCRIPT -i $MY_KEYFILE start $MY_CLUSTER_NAME"
alias spark_ec2_destroy="$SPARK_EC2_SCRIPT destroy $MY_CLUSTER_NAME"
alias spark_ec2_login="$SPARK_EC2_SCRIPT -k $MY_KEYPAIR -i $MY_KEYFILE login $MY_CLUSTER_NAME"
Spin up a Spark cluster named `cluster1` with one leader and two workers nodes
of instance type `m3.large`with the command:
```
cgcloud create-cluster spark -c cluster1 -s 2 -t m3.large
```
Once running, you can ssh to `spark-master` with the command:
```
cgcloud ssh -c cluster1 spark-master
```

Spark is already installed on the `spark-master` machine and slaves, test it
by starting a spark-shell.
```
spark-shell
exit()
```

#### Install ADAM

To use the ADAM application on top of Spark, we need to download and install
ADAM on `spark-master`
From the command line on `spark-master` download a release from:
https://github.com/bigdatagenomics/adam/releases

As of this writing, CGCloud supports Spark 1.6.2, not Spark 2.x, so download
the Spark 1.x Scala2.10 release:
```
wget https://repo1.maven.org/maven2/org/bdgenomics/adam/\
adam-distribution_2.10/0.20.0/adam-distribution_2.10-0.20.0-bin.tar.gz
tar -xvfz adam-distribution_2.10-0.20.0-bin.tar.gz
```

You can now run `./bin/adam-submit` and `./bin/adam-shell` using your EC2
cluster.

#### Input and Output data on HDFS and S3

Spark requires a file system, such a HDFS or a network file mount, that all
machines can access.
The CGCloud EC2 Spark cluster you just created is already running HDFS.

The typical flow of data to and from your ADAM application on EC2 will be:
- Upload data to AWS S3
- Transfer from S3 to the HDFS on your cluster
- Compute with ADAM, write output to HDFS
- Copy data you wish to persist for later use to S3

For small test files you may wish to skip S3 by uploading directly to
spark-master using `scp` and then copy to HDFS using
`hadoop fs -put sample1.bam /datadir/`

From ADAM shell, or as parameter to ADAM submit, you would refer HDFS URLs
such as:
```
adam-submit transform hdfs://spark-master/work_dir/sample1.bam \
hdfs://spark-master/work_dir/sample1.adam
```

#### Bulk Transfer between HDFS and S3
To transfer large amounts of data back and forth from S3 to HDFS, we suggest using
[Conductor](https://github.com/BD2KGenomics/conductor).
It is also possible to directly use AWS S3 as a distributed file system,
but with some loss of performance.

#### Terminate Cluster
Shutdown the cluster using:
```
cgcloud terminate-cluster -c cluster1 spark
```

#### CGCoud options and Spot Instances
View help docs for all options of the the `cgcloud create-cluster` command:
```
cgcloud create-cluster -h
```

Now you can run:
* `spark_ec2_launch` to launch your cluster,
* `spark_ec2_stop` to stop the cluster (your data is not deleted),
* `spark_ec2_start` to restart your cluster,
* `spark_ec2_destroy` to stop the cluster and cleanup all data,
* `spark_ec2_login` to log into the master node of your cluster.
In particular, note the `--spot-bid` and related spot options to utilize AWS
spot instances inorder to save on costs. Also, it's a good idea to double check
in AWS console that your instances have terminated to avoid unintended costs.

Launching a cluster takes about 10 minutes. When the spark ec2 script finishes,
it will give you the location of your spark master web UI and ganglia UI. You
may want to open both URLs in tabs and 'pin' them to return to later.

Once you have the cluster running, you will need to scp `adam-x.y.jar` to the
master node, e.g.
`scp -i /path/to/key.pem adam-x.y.jar root@ec2-107-21-175-59.compute-1.amazonaws.com:`
(don't forget the *colon* at the end).
#### Access Spark GUI
In order to view the Spark server or application GUI pages on port 4040 and
8080 on `spark-master` go to Security Groups in AWS console
and open inbound TCP for those ports from your local IP address. Find the
IP address of `spark-master` which is part of the Linux command prompt, then
on your local machine point your web-browser to http://ip_of_spark_master:4040/

## Running ADAM on CDH 5 and other YARN based Distros

Expand Down

0 comments on commit f887597

Please sign in to comment.