-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infrastructure improvements #1
Comments
|
I don't think we need to bake our own AMI with nvidia-docker on it. We can just use an existing AMI that has it. But if it only takes a few seconds to install it, then it doesn't matter.
Tags make more sense, but Terraform doesn't let us associate tags with instances created using a spot fleet request. See hashicorp/terraform#3263
True, but each worker needs to know its index (ie. a number between 0 and n-1 if n workers) to figure out which batch jobs to run. I don't think we can turn the instanceid into a worker index. |
Good point about #3263. Regarding an index for workers, there is also I guess my high level concern is that we try to make use of what's already there (if it applies) via instance metadata vs. supplying and managing our own identifiers. |
I'm thinking about using AWS Batch to run lots of experiments in parallel. Does that sound ok? One issue is that Batch uses ECS, and ECS doesn't know about nvidia-docker. There's a workaround to be able to use the GPU even when running using regular docker in ECS: https://blog.cloudsight.ai/deep-learning-image-recognition-using-gpus-in-amazon-ecs-docker-containers-5bdb1956f30e#.mau60bfvo |
I'm moving the conversation about parallelizing experiments to #10 |
@lewfish and I discussed some potential areas for infrastructure improvements:
Reducing EC2 Boot times
nvidia-docker
with our own AMI, based onami-50b4f047
, that hasnvidia-docker
installed.Optimizations for multi-user collaborations
Identify a user's EC2 instance via key-pair name:
scripts/run
usesaws ec2 wait
to determine when Spot Fleet requests are complete. However, if multiple users are running the script at the same time, the script may wait for the wrong Spot Fleet request to finish. One way to avoid this is to allow users to use their own (named) key-pairs, and addkey-name
as an additional filter toaws ec2 wait instance-running
.Use the AWS CLI to terminate instances once the jobs have finished. Ideally we'd be able to run this from inside the container, but that would either require stored credentials in the container (a security risk), or access to the EC2 metadata service.
Concurrent processing across instances
We want to be able to run the same command with different parameters, simultaneously, across all available workers. We settled on the following:
$INSTANCE_ID
as an environment variable via cloud-config$INSTANCE_ID-command-options.json
.The text was updated successfully, but these errors were encountered: