-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run experiments in parallel on AWS #10
Comments
We also need a way to define the set of experiments we want to run. I'm not convinced that we need to do anything fancy for this, but we might want to look at https://github.com/keplr-io/picard |
I believe I got regular docker running with the GPU, which gets around the challenge you mentioned for running on AWS Batch. For AWS Batch, because we'd need a custom AMI, we'd need to in an unmanaged compute environment. Steps I took to get the GPU running in an ECS-optimized AMI instance (p2.xlarge):
> sudo yum groupinstall -y "Development Tools"
> version=364.19
> arch=`uname -m`
> sudo yum install -y wget
> wget http://us.download.nvidia.com/XFree86/Linux-${arch}/${version}/NVIDIA-Linux-${arch}-${version}.run
> srcs=`ls /usr/src/kernels`
> sudo bash ./NVIDIA-Linux-${arch}-${version}.run -silent --kernel-source-path /usr/src/kernels/${srcs}
> sudo reboot Log back in, clone this repository.
diff --git a/scripts/run b/scripts/run
index 08914c1..b4a8869 100755
--- a/scripts/run
+++ b/scripts/run
@@ -27,10 +27,10 @@ then
keras-semantic-segmentation-cpu "${@:2}"
elif [ "${1:-}" = "--gpu" ]
then
- sudo nvidia-docker run --rm -it \
+ sudo docker run --rm -it \
-v ~/keras-semantic-segmentation/src:/opt/src \
-v ~/data:/opt/data \
- 002496907356.dkr.ecr.us-east-1.amazonaws.com/keras-semantic-segmentation-gpu "${@:2}"
+ --privileged -v /usr:/hostusr -v /lib:/hostlib keras-semantic-segmentation-gpu "${@:2}"
else
usage
fi
diff --git a/src/Dockerfile-gpu b/src/Dockerfile-gpu
index d0d4ca2..c022276 100644
--- a/src/Dockerfile-gpu
+++ b/src/Dockerfile-gpu
@@ -20,4 +20,7 @@ USER root
RUN mkdir /opt/data
RUN chown -R keras:root /opt/data
-CMD ["bash"]
+COPY startup.sh /usr/local/bin/
+
+CMD ["/usr/local/bin/startup.sh"]
+ startup.sh looks like: #!/bin/bash
set -e
echo "Copy the NVidia drivers from the parent (because nvidia-docker-plugin doesn't work with ECS agent)"
find /hostusr -name "*nvidia*" -o -name "*cuda*" -o -name "*GL*" | while read path
do
newpath="/usr${path#/hostusr}"
mkdir -p `dirname $newpath` && \
cp -a $path $newpath
done
cp -ar /hostlib/modules /lib
echo "/usr/lib64" > /etc/ld.so.conf.d/nvidia.conf
ldconfig
echo "Starting your essential task"
exec /bin/bash remember to
>>> from tensorflow.python.client import device_lib as _device_lib
>>> _device_lib.list_local_devices()
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 9470809447589491728
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11386087015
incarnation: 568561566696502057
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0"
] |
Removed the |
I've updated the comment above to reflect the current process; I was able to run this successfully with the reboot as the last step. The next step is to pull the setup into a cloud-config user data for the spot request, and attempt to run a container that sees the GPU successfully directly from instance startup. |
With UserData #cloud-config
runcmd:
- sudo yum groupinstall -y "Development Tools"
- sudo yum install -y wget
- curl -o driver-install.run http://us.download.nvidia.com/XFree86/Linux-`uname -m`/364.19/NVIDIA-Linux-`uname -m`-364.19.run
- sudo bash ./driver-install.run -silent --kernel-source-path /usr/src/kernels/`ls /usr/src/kernels`
- sudo reboot I can pull a container, run python and get root@f9b54b3edfc1:/opt/src# python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 15:32:45)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib as _device_lib
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> _device_lib.list_local_devices()
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.25GiB
Free memory: 11.16GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 10438605519399576444
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11387098727
incarnation: 11180310221239512191
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0"
]
>>> Just to check, a p2.xlarge instance without the cloud-init file gives: root@2d2f8ad8e131:/opt/src# python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 15:32:45)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib as _device_lib
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:102] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: 2d2f8ad8e131
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1077] LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1078] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> _device_lib.list_local_devices()
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:140] kernel driver does not appear to be running on this host (2d2f8ad8e131): /proc/driver/nvidia/version does not exist
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
bus_adjacency: BUS_ANY
incarnation: 4842678244661302863
]
>>> with the same container. |
I have been testing Neptune experimenting platform. This seems like a good fit for batching multiple experiments. |
Currently, we can run experiments in parallel by spinning up some instances and then manually SSHing into each one, and running a command for each experiment. This doesn't scale well, so we would like to find a way of automating this. Some ideas include:
nvidia-docker
and ECS usesdocker
. There's a workaround to make GPUs available indocker
but it might be tricky to get it to work. https://blog.cloudsight.ai/deep-learning-image-recognition-using-gpus-in-amazon-ecs-docker-containers-5bdb1956f30e#.mau60bfvoOpenAI has a blog post about their infrastructure setup which we should mine for ideas https://openai.com/blog/infrastructure-for-deep-learning/
The text was updated successfully, but these errors were encountered: