Run experiments in parallel on AWS #10

lewfish · 2017-03-08T15:09:53Z

Currently, we can run experiments in parallel by spinning up some instances and then manually SSHing into each one, and running a command for each experiment. This doesn't scale well, so we would like to find a way of automating this. Some ideas include:

Use AWS Batch. This seems ideal, but is challenging since we are using nvidia-docker and ECS uses docker. There's a workaround to make GPUs available in docker but it might be tricky to get it to work. https://blog.cloudsight.ai/deep-learning-image-recognition-using-gpus-in-amazon-ecs-docker-containers-5bdb1956f30e#.mau60bfvo
Run arbitrary commands across multiple EC2 instances via the SSM agent. Simple Systems Manager: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/systems-manager-prereqs.html
Custom producer/consumer workflow with ASG of consumers that pull work off of an SQS queue. Similar to https://github.com/stamen/vapor-clock
Elastic Beanstalk worker environment. The EB worker environment may not work due to the length of duration for each task.

OpenAI has a blog post about their infrastructure setup which we should mine for ideas https://openai.com/blog/infrastructure-for-deep-learning/

The text was updated successfully, but these errors were encountered:

lewfish · 2017-03-08T15:25:29Z

We also need a way to define the set of experiments we want to run. I'm not convinced that we need to do anything fancy for this, but we might want to look at https://github.com/keplr-io/picard

lossyrob · 2017-03-30T02:43:16Z

I believe I got regular docker running with the GPU, which gets around the challenge you mentioned for running on AWS Batch.

For AWS Batch, because we'd need a custom AMI, we'd need to in an unmanaged compute environment.

Steps I took to get the GPU running in an ECS-optimized AMI instance (p2.xlarge):

Launch latest ecs agent AMI [amzn-ami-2016.09.g-amazon-ecs-optimized]
SSH in, run the following:

> sudo yum groupinstall -y "Development Tools"
> version=364.19
> arch=`uname -m`
> sudo yum install -y wget
> wget http://us.download.nvidia.com/XFree86/Linux-${arch}/${version}/NVIDIA-Linux-${arch}-${version}.run
> srcs=`ls /usr/src/kernels`
> sudo bash ./NVIDIA-Linux-${arch}-${version}.run -silent --kernel-source-path /usr/src/kernels/${srcs}
> sudo reboot

Log back in, clone this repository.

Run the following patch on this repo:

diff --git a/scripts/run b/scripts/run
index 08914c1..b4a8869 100755
--- a/scripts/run
+++ b/scripts/run
@@ -27,10 +27,10 @@ then
             keras-semantic-segmentation-cpu "${@:2}"
     elif [ "${1:-}" = "--gpu" ]
     then
-        sudo nvidia-docker run --rm -it \
+        sudo docker run --rm -it \
             -v ~/keras-semantic-segmentation/src:/opt/src \
             -v ~/data:/opt/data \
-            002496907356.dkr.ecr.us-east-1.amazonaws.com/keras-semantic-segmentation-gpu "${@:2}"
+            --privileged -v /usr:/hostusr -v /lib:/hostlib keras-semantic-segmentation-gpu "${@:2}"
     else
         usage
     fi
diff --git a/src/Dockerfile-gpu b/src/Dockerfile-gpu
index d0d4ca2..c022276 100644
--- a/src/Dockerfile-gpu
+++ b/src/Dockerfile-gpu
@@ -20,4 +20,7 @@ USER root
 RUN mkdir /opt/data
 RUN chown -R keras:root /opt/data
 
-CMD ["bash"]
+COPY startup.sh /usr/local/bin/
+
+CMD ["/usr/local/bin/startup.sh"]
+

startup.sh looks like:

#!/bin/bash

set -e

echo "Copy the NVidia drivers from the parent (because nvidia-docker-plugin doesn't work with ECS agent)"
find /hostusr -name "*nvidia*" -o -name "*cuda*" -o -name "*GL*" | while read path
do
  newpath="/usr${path#/hostusr}"
  mkdir -p `dirname $newpath` && \
    cp -a $path $newpath
done

cp -ar /hostlib/modules /lib

echo "/usr/lib64" > /etc/ld.so.conf.d/nvidia.conf
ldconfig

echo "Starting your essential task"
exec /bin/bash

remember to chmod a+x src/startup.sh before building the container.

From inside the container, run a python shell with the following:

>>> from tensorflow.python.client import device_lib as _device_lib
>>> _device_lib.list_local_devices()
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 9470809447589491728
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11386087015
incarnation: 568561566696502057
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0"
]

lossyrob · 2017-03-31T14:31:54Z

Removed the operations tag because this issue isn't work on the Azavea Ops team's plate, but tagging @azavea/operations because in case there is interest in tracking.

lossyrob · 2017-04-03T02:37:17Z

I've updated the comment above to reflect the current process; I was able to run this successfully with the reboot as the last step. The next step is to pull the setup into a cloud-config user data for the spot request, and attempt to run a container that sees the GPU successfully directly from instance startup.

lossyrob · 2017-04-03T03:40:46Z

With UserData

#cloud-config

runcmd:
  - sudo yum groupinstall -y "Development Tools"
  - sudo yum install -y wget
  - curl -o driver-install.run http://us.download.nvidia.com/XFree86/Linux-`uname -m`/364.19/NVIDIA-Linux-`uname -m`-364.19.run
  - sudo bash ./driver-install.run -silent --kernel-source-path /usr/src/kernels/`ls /usr/src/kernels`
  - sudo reboot

I can pull a container, run python and get

root@f9b54b3edfc1:/opt/src# python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 15:32:45) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib as _device_lib
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> _device_lib.list_local_devices()
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.25GiB
Free memory: 11.16GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 10438605519399576444
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11387098727
incarnation: 11180310221239512191
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0"
]
>>>

Just to check, a p2.xlarge instance without the cloud-init file gives:

root@2d2f8ad8e131:/opt/src# python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 15:32:45) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib as _device_lib
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: 2d2f8ad8e131
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1077] LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1078] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> _device_lib.list_local_devices()
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:140] kernel driver does not appear to be running on this host (2d2f8ad8e131): /proc/driver/nvidia/version does not exist
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 4842678244661302863
]
>>>

with the same container.

whatnick · 2017-05-17T04:33:22Z

I have been testing Neptune experimenting platform. This seems like a good fit for batching multiple experiments.

lewfish mentioned this issue Mar 8, 2017

Infrastructure improvements #1

Closed

lewfish added the discussion label Mar 8, 2017

lossyrob added operations and removed operations labels Mar 30, 2017

lewfish mentioned this issue May 30, 2017

Run experiments using AWS Batch #35

Merged

lewfish closed this as completed May 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run experiments in parallel on AWS #10

Run experiments in parallel on AWS #10

lewfish commented Mar 8, 2017

lewfish commented Mar 8, 2017

lossyrob commented Mar 30, 2017 •

edited

Loading

lossyrob commented Mar 31, 2017

lossyrob commented Apr 3, 2017

lossyrob commented Apr 3, 2017

whatnick commented May 17, 2017

Run experiments in parallel on AWS #10

Run experiments in parallel on AWS #10

Comments

lewfish commented Mar 8, 2017

lewfish commented Mar 8, 2017

lossyrob commented Mar 30, 2017 • edited Loading

lossyrob commented Mar 31, 2017

lossyrob commented Apr 3, 2017

lossyrob commented Apr 3, 2017

whatnick commented May 17, 2017

lossyrob commented Mar 30, 2017 •

edited

Loading