Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infrastructure improvements #1

Closed
tnation14 opened this issue Feb 14, 2017 · 5 comments
Closed

Infrastructure improvements #1

tnation14 opened this issue Feb 14, 2017 · 5 comments

Comments

@tnation14
Copy link
Contributor

tnation14 commented Feb 14, 2017

@lewfish and I discussed some potential areas for infrastructure improvements:

Reducing EC2 Boot times

  • Build docker-images locally and push them to quay.io/ECR, rather than copying the entire local workspace up to EC2 for builds.
  • Get latest source code onto the EC2 instance by cloning this repository using cloud-init or a command run over SSH
  • Replace cloud-config installation of nvidia-docker with our own AMI, based on ami-50b4f047, that has nvidia-docker installed.

Optimizations for multi-user collaborations

  • Identify a user's EC2 instance via key-pair name: scripts/run uses aws ec2 wait to determine when Spot Fleet requests are complete. However, if multiple users are running the script at the same time, the script may wait for the wrong Spot Fleet request to finish. One way to avoid this is to allow users to use their own (named) key-pairs, and add key-name as an additional filter to aws ec2 wait instance-running.

  • Use the AWS CLI to terminate instances once the jobs have finished. Ideally we'd be able to run this from inside the container, but that would either require stored credentials in the container (a security risk), or access to the EC2 metadata service.

Concurrent processing across instances

We want to be able to run the same command with different parameters, simultaneously, across all available workers. We settled on the following:

  • Add instance name/index $INSTANCE_ID as an environment variable via cloud-config
  • Store the command parameters in files namespaced by instance ID. Instances would access a file like $INSTANCE_ID-command-options.json.
@hectcastro
Copy link

  • I'm not sure introducing all of the overhead around baking an AMI is going to be worth it just to bake in the nvidia-docker dependency. That archive is ~2MB and decompresses to a single binary. Downloading and decompressing that on an EC2 instance should be taking a few seconds end-to-end.

  • I think it would be good to make the aws ec2 wait thing use tags vs. keypair names. That looks like an option supported by --filter.

  • Terminating the instances from within an EC2 instance could be tricky because I think we'd actually have to terminate the Spot Fleet request (doing that also indirectly messes with Terraform's state of the word).

  • Not entirely following the last solutions to processing work in parallel, but as a related note, instanceID is already available via instance metadata.

@lewfish
Copy link
Contributor

lewfish commented Feb 14, 2017

I'm not sure introducing all of the overhead around baking an AMI is going to be worth it just to bake in the nvidia-docker dependency. That archive is ~2MB and decompresses to a single binary. Downloading and decompressing that on an EC2 instance should be taking a few seconds end-to-end.

I don't think we need to bake our own AMI with nvidia-docker on it. We can just use an existing AMI that has it. But if it only takes a few seconds to install it, then it doesn't matter.

I think it would be good to make the aws ec2 wait thing use tags vs. keypair names. That looks like an option supported by --filter.

Tags make more sense, but Terraform doesn't let us associate tags with instances created using a spot fleet request. See hashicorp/terraform#3263

Not entirely following the last solutions to processing work in parallel, but as a related note, instanceID is already available via instance metadata.

True, but each worker needs to know its index (ie. a number between 0 and n-1 if n workers) to figure out which batch jobs to run. I don't think we can turn the instanceid into a worker index.

@hectcastro
Copy link

Good point about #3263.

Regarding an index for workers, there is also ami-launch-index via instance metadata. Not entirely sure what value that returns when multiple instances get launched from a Spot Fleet though.

I guess my high level concern is that we try to make use of what's already there (if it applies) via instance metadata vs. supplying and managing our own identifiers.

@lewfish
Copy link
Contributor

lewfish commented Mar 1, 2017

I'm thinking about using AWS Batch to run lots of experiments in parallel. Does that sound ok? One issue is that Batch uses ECS, and ECS doesn't know about nvidia-docker. There's a workaround to be able to use the GPU even when running using regular docker in ECS: https://blog.cloudsight.ai/deep-learning-image-recognition-using-gpus-in-amazon-ecs-docker-containers-5bdb1956f30e#.mau60bfvo

@lewfish
Copy link
Contributor

lewfish commented Mar 8, 2017

I'm moving the conversation about parallelizing experiments to #10

@lewfish lewfish closed this as completed Mar 8, 2017
lewfish pushed a commit that referenced this issue Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants