Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run experiments in parallel on AWS #10

Closed
lewfish opened this issue Mar 8, 2017 · 6 comments
Closed

Run experiments in parallel on AWS #10

lewfish opened this issue Mar 8, 2017 · 6 comments

Comments

@lewfish
Copy link
Contributor

lewfish commented Mar 8, 2017

Currently, we can run experiments in parallel by spinning up some instances and then manually SSHing into each one, and running a command for each experiment. This doesn't scale well, so we would like to find a way of automating this. Some ideas include:

OpenAI has a blog post about their infrastructure setup which we should mine for ideas https://openai.com/blog/infrastructure-for-deep-learning/

@lewfish
Copy link
Contributor Author

lewfish commented Mar 8, 2017

We also need a way to define the set of experiments we want to run. I'm not convinced that we need to do anything fancy for this, but we might want to look at https://github.com/keplr-io/picard

@lossyrob
Copy link
Contributor

lossyrob commented Mar 30, 2017

I believe I got regular docker running with the GPU, which gets around the challenge you mentioned for running on AWS Batch.

For AWS Batch, because we'd need a custom AMI, we'd need to in an unmanaged compute environment.

Steps I took to get the GPU running in an ECS-optimized AMI instance (p2.xlarge):

  • Launch latest ecs agent AMI [amzn-ami-2016.09.g-amazon-ecs-optimized]
  • SSH in, run the following:
> sudo yum groupinstall -y "Development Tools"
> version=364.19
> arch=`uname -m`
> sudo yum install -y wget
> wget http://us.download.nvidia.com/XFree86/Linux-${arch}/${version}/NVIDIA-Linux-${arch}-${version}.run
> srcs=`ls /usr/src/kernels`
> sudo bash ./NVIDIA-Linux-${arch}-${version}.run -silent --kernel-source-path /usr/src/kernels/${srcs}
> sudo reboot

Log back in, clone this repository.

  • Run the following patch on this repo:
diff --git a/scripts/run b/scripts/run
index 08914c1..b4a8869 100755
--- a/scripts/run
+++ b/scripts/run
@@ -27,10 +27,10 @@ then
             keras-semantic-segmentation-cpu "${@:2}"
     elif [ "${1:-}" = "--gpu" ]
     then
-        sudo nvidia-docker run --rm -it \
+        sudo docker run --rm -it \
             -v ~/keras-semantic-segmentation/src:/opt/src \
             -v ~/data:/opt/data \
-            002496907356.dkr.ecr.us-east-1.amazonaws.com/keras-semantic-segmentation-gpu "${@:2}"
+            --privileged -v /usr:/hostusr -v /lib:/hostlib keras-semantic-segmentation-gpu "${@:2}"
     else
         usage
     fi
diff --git a/src/Dockerfile-gpu b/src/Dockerfile-gpu
index d0d4ca2..c022276 100644
--- a/src/Dockerfile-gpu
+++ b/src/Dockerfile-gpu
@@ -20,4 +20,7 @@ USER root
 RUN mkdir /opt/data
 RUN chown -R keras:root /opt/data
 
-CMD ["bash"]
+COPY startup.sh /usr/local/bin/
+
+CMD ["/usr/local/bin/startup.sh"]
+

startup.sh looks like:

#!/bin/bash

set -e

echo "Copy the NVidia drivers from the parent (because nvidia-docker-plugin doesn't work with ECS agent)"
find /hostusr -name "*nvidia*" -o -name "*cuda*" -o -name "*GL*" | while read path
do
  newpath="/usr${path#/hostusr}"
  mkdir -p `dirname $newpath` && \
    cp -a $path $newpath
done

cp -ar /hostlib/modules /lib

echo "/usr/lib64" > /etc/ld.so.conf.d/nvidia.conf
ldconfig

echo "Starting your essential task"
exec /bin/bash

remember to chmod a+x src/startup.sh before building the container.

  • From inside the container, run a python shell with the following:
>>> from tensorflow.python.client import device_lib as _device_lib
>>> _device_lib.list_local_devices()
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 9470809447589491728
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11386087015
incarnation: 568561566696502057
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0"
]

@lossyrob
Copy link
Contributor

Removed the operations tag because this issue isn't work on the Azavea Ops team's plate, but tagging @azavea/operations because in case there is interest in tracking.

@lossyrob
Copy link
Contributor

lossyrob commented Apr 3, 2017

I've updated the comment above to reflect the current process; I was able to run this successfully with the reboot as the last step. The next step is to pull the setup into a cloud-config user data for the spot request, and attempt to run a container that sees the GPU successfully directly from instance startup.

@lossyrob
Copy link
Contributor

lossyrob commented Apr 3, 2017

With UserData

#cloud-config

runcmd:
  - sudo yum groupinstall -y "Development Tools"
  - sudo yum install -y wget
  - curl -o driver-install.run http://us.download.nvidia.com/XFree86/Linux-`uname -m`/364.19/NVIDIA-Linux-`uname -m`-364.19.run
  - sudo bash ./driver-install.run -silent --kernel-source-path /usr/src/kernels/`ls /usr/src/kernels`
  - sudo reboot

I can pull a container, run python and get

root@f9b54b3edfc1:/opt/src# python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 15:32:45) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib as _device_lib
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> _device_lib.list_local_devices()
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.25GiB
Free memory: 11.16GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 10438605519399576444
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11387098727
incarnation: 11180310221239512191
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0"
]
>>> 

Just to check, a p2.xlarge instance without the cloud-init file gives:

root@2d2f8ad8e131:/opt/src# python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 15:32:45) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib as _device_lib
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: 2d2f8ad8e131
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1077] LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1078] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> _device_lib.list_local_devices()
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:140] kernel driver does not appear to be running on this host (2d2f8ad8e131): /proc/driver/nvidia/version does not exist
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 4842678244661302863
]
>>> 

with the same container.

@whatnick
Copy link

I have been testing Neptune experimenting platform. This seems like a good fit for batching multiple experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants