Infrastructure improvements #1

tnation14 · 2017-02-14T18:45:25Z

@lewfish and I discussed some potential areas for infrastructure improvements:

Reducing EC2 Boot times

Build docker-images locally and push them to quay.io/ECR, rather than copying the entire local workspace up to EC2 for builds.
Get latest source code onto the EC2 instance by cloning this repository using cloud-init or a command run over SSH
Replace cloud-config installation of nvidia-docker with our own AMI, based on ami-50b4f047, that has nvidia-docker installed.

Optimizations for multi-user collaborations

Identify a user's EC2 instance via key-pair name: scripts/run uses aws ec2 wait to determine when Spot Fleet requests are complete. However, if multiple users are running the script at the same time, the script may wait for the wrong Spot Fleet request to finish. One way to avoid this is to allow users to use their own (named) key-pairs, and add key-name as an additional filter to aws ec2 wait instance-running.
Use the AWS CLI to terminate instances once the jobs have finished. Ideally we'd be able to run this from inside the container, but that would either require stored credentials in the container (a security risk), or access to the EC2 metadata service.

Concurrent processing across instances

We want to be able to run the same command with different parameters, simultaneously, across all available workers. We settled on the following:

Add instance name/index $INSTANCE_ID as an environment variable via cloud-config
Store the command parameters in files namespaced by instance ID. Instances would access a file like $INSTANCE_ID-command-options.json.

The text was updated successfully, but these errors were encountered:

hectcastro · 2017-02-14T20:02:44Z

I'm not sure introducing all of the overhead around baking an AMI is going to be worth it just to bake in the nvidia-docker dependency. That archive is ~2MB and decompresses to a single binary. Downloading and decompressing that on an EC2 instance should be taking a few seconds end-to-end.
I think it would be good to make the aws ec2 wait thing use tags vs. keypair names. That looks like an option supported by --filter.
Terminating the instances from within an EC2 instance could be tricky because I think we'd actually have to terminate the Spot Fleet request (doing that also indirectly messes with Terraform's state of the word).
Not entirely following the last solutions to processing work in parallel, but as a related note, instanceID is already available via instance metadata.

lewfish · 2017-02-14T20:13:43Z

I'm not sure introducing all of the overhead around baking an AMI is going to be worth it just to bake in the nvidia-docker dependency. That archive is ~2MB and decompresses to a single binary. Downloading and decompressing that on an EC2 instance should be taking a few seconds end-to-end.

I don't think we need to bake our own AMI with nvidia-docker on it. We can just use an existing AMI that has it. But if it only takes a few seconds to install it, then it doesn't matter.

I think it would be good to make the aws ec2 wait thing use tags vs. keypair names. That looks like an option supported by --filter.

Tags make more sense, but Terraform doesn't let us associate tags with instances created using a spot fleet request. See hashicorp/terraform#3263

Not entirely following the last solutions to processing work in parallel, but as a related note, instanceID is already available via instance metadata.

True, but each worker needs to know its index (ie. a number between 0 and n-1 if n workers) to figure out which batch jobs to run. I don't think we can turn the instanceid into a worker index.

hectcastro · 2017-02-14T20:53:12Z

Good point about #3263.

Regarding an index for workers, there is also ami-launch-index via instance metadata. Not entirely sure what value that returns when multiple instances get launched from a Spot Fleet though.

I guess my high level concern is that we try to make use of what's already there (if it applies) via instance metadata vs. supplying and managing our own identifiers.

lewfish · 2017-03-01T22:28:43Z

I'm thinking about using AWS Batch to run lots of experiments in parallel. Does that sound ok? One issue is that Batch uses ECS, and ECS doesn't know about nvidia-docker. There's a workaround to be able to use the GPU even when running using regular docker in ECS: https://blog.cloudsight.ai/deep-learning-image-recognition-using-gpus-in-amazon-ecs-docker-containers-5bdb1956f30e#.mau60bfvo

lewfish · 2017-03-08T15:10:25Z

I'm moving the conversation about parallelizing experiments to #10

Sync with master

lewfish closed this as completed Mar 8, 2017

lewfish pushed a commit that referenced this issue Nov 18, 2019

Merge pull request #1 from azavea/master

5b132b0

Sync with master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrastructure improvements #1

Infrastructure improvements #1

tnation14 commented Feb 14, 2017 •

edited

Loading

hectcastro commented Feb 14, 2017

lewfish commented Feb 14, 2017 •

edited

Loading

hectcastro commented Feb 14, 2017

lewfish commented Mar 1, 2017

lewfish commented Mar 8, 2017

Infrastructure improvements #1

Infrastructure improvements #1

Comments

tnation14 commented Feb 14, 2017 • edited Loading

Reducing EC2 Boot times

Optimizations for multi-user collaborations

Concurrent processing across instances

hectcastro commented Feb 14, 2017

lewfish commented Feb 14, 2017 • edited Loading

hectcastro commented Feb 14, 2017

lewfish commented Mar 1, 2017

lewfish commented Mar 8, 2017

tnation14 commented Feb 14, 2017 •

edited

Loading

lewfish commented Feb 14, 2017 •

edited

Loading