Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #205 - Docker install not finding GPUs #235

Closed
wants to merge 11 commits into from
Closed

Conversation

dhanainme
Copy link
Collaborator

@dhanainme dhanainme commented Apr 21, 2020

Fixes #205 - Docker install not finding GPUs

  • Invokes docker with --runtime=nvidia for GPU
  • Added the option to specify the GPU / specific GPU ids in start.sh
  • Fixed the documentation for start.sh script
  • Fixed the JDK version in Dockerfile.gpu

Tests :

Testing : This was tested on a DL AMI 27, Ubuntu 18 / p3.8xlarge (which has 4 CUDA-compatible GPUs). I built the Docker image with the --gpu directive.

Terminal Output for all 3 cases :

ubuntu@ip-172-31-32-255:~/serve$ ./start.sh
Starting pytorch/torchserve:latest-gpu docker image
Successfully started torchserve in docker
Registering resnet-18 model
.....
ubuntu@ip-172-31-32-255:~/serve$ docker exec -it 4a162a13f695 head logs/ts_log.log
....
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 32

ubuntu@ip-172-31-32-255:~/serve$ ./start.sh -g
Starting pytorch/torchserve:latest-gpu docker image
Successfully started torchserve in docker
Registering resnet-18 model
....
ubuntu@ip-172-31-32-255:~/serve$ docker exec -it 281fc5744248 head logs/ts_log.log
....
Temp directory: /home/model-server/tmp
Number of GPUs: 4
Number of CPUs: 32

ubuntu@ip-172-31-32-255:~/serve$ ./start.sh --gpu --gpu_devices 1,2,3
Starting pytorch/torchserve:latest-gpu docker image
Successfully started torchserve in docker
Registering resnet-18 model
successfully registered resnet-18 model with torchserve
...

ubuntu@ip-172-31-32-255:~/serve$ docker exec -it e05ff19681f9 head logs/ts_log.log
...
Number of GPUs: 3
Number of CPUs: 32
...

Invokes docker with  --runtime=nvidia for GPU
Added the option to specify the GPU / specific GPU ids in start.sh
Fixed the documentation for start.sh script
Fixed the JDK version in Dockerfile.gpu
@dhanainme
Copy link
Collaborator Author

Fixed Docker tags

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: d4f79d9
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: d4f79d9
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: c6fe3db
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: c6fe3db
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

exit 0
;;
-g|--gpu)
DOCKER_RUNTIME="--runtime=nvidia"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhanainme The new way of running docker with gpus is with --gpus flag. Please change to that as described in #191

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was trying to use this originally, But the new format uses a very weird quoting for multiple GPUs which does not play well with bash. Would prefer leaving it this way.

docker run --rm -it --gpus '"device=0,1,2"' -p 8080:8080 -p 8081:8081 torchserve:v0.1-gpu-latest

Copy link
Contributor

@chauhang chauhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments inline

Copy link
Contributor

@fbbradheintz fbbradheintz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news first: Testing on multi-GPU machines, the updated scripts seem functionally correct.

There are two more items to fix, besides my inline comments:

  • serve/docker/README.md is missing any mention of the docker command-line flags for specifying GPU runtime and specific devices - this needs to be corrected.
  • serve/docker/README.md has no mention of the build_image.sh and start.sh scripts. Was this intentional? If not, please update the Docker README.

README.md Outdated Show resolved Hide resolved
IMAGE_NAME="pytorch/torchserve:latest-gpu"
shift
;;
-d|--gpu_devices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't -d imply -g? Right now, it's possible to do something like:

./start.sh -d <devices>

... with the net effect that GPU devices are specified, but not the Nvidia runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested all combinations of options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: Should there be an example command line that shows what a device specifier looks like for this command line? Is it 0? cuda:0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't -d imply -g?

Yes. But I presumed this is understood implicitly. Have fixed this.

Have you tested all combinations of options?

Yes. I had tested CPU only, GPU w/o devices flag. GPU with devices flag (with different set of devices to confirm that it works)

Also: Should there be an example command line that shows what a device specifier looks like for this command line? Is it 0? cuda:0?

The usage is clearly specified in the README.md

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serve/docker/README.md is missing any mention of the docker command-line flags for specifying GPU runtime and specific devices - this needs to be corrected.

Have added this.

serve/docker/README.md has no mention of the build_image.sh and start.sh scripts. Was this intentional? If not, please update the Docker README.

I think its intentional. The start.sh & build_image.sh are more off a quick start guide for folks trying to get things up & running quickly with out even using docker commands.

Infact build_image.sh is just a wrapper for docker build & start.sh is a wrapper for docker run both of which are explained with the right options in serve/docker/README.md

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: Should there be an example command line that shows what a device ID for a GPU looks like.

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: b1a8517
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: b1a8517
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: 21c761e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: 21c761e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Manoj Rao and others added 7 commits April 23, 2020 22:09
* update tags in docker readme
* update conda binary names and pip repo names
to say "latest" instead of hardcoded version
Updated links in pypi rst for model-archiver
Invokes docker with  --runtime=nvidia for GPU
Added the option to specify the GPU / specific GPU ids in start.sh
Fixed the documentation for start.sh script
Fixed the JDK version in Dockerfile.gpu
@dhanainme dhanainme self-assigned this Apr 23, 2020
@dhanainme dhanainme closed this Apr 23, 2020
@dhanainme dhanainme deleted the issue_205 branch April 23, 2020 22:23
@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-cpu
  • Commit ID: a6a767f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-neo-ci-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: torch-serve-build-gpu
  • Commit ID: a6a767f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@dhanainme
Copy link
Collaborator Author

Too many merge conflicts / updates. Deleting this branch for sanity.

Tracked in #262. All the feedback from this PR have been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker install not finding GPUs
6 participants