Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update AWS and GCP GPU images to have cuda 12.1 #49

Merged
merged 3 commits into from
Nov 15, 2023
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Nov 14, 2023

The latest pytorch 2.1.0 requires CUDA 12.1 by default. This is to update the images for both AWS and GCP to support the latest cuda driver.
We should get skypilot-org/skypilot#2788 in at the same time.

Tested:

  • sky launch serve-openai-api.yaml -c vllm-dbg3 --env HF_TOKEN -i30 --down without rolling back the vllm commit
  • sky launch serve-openai-api.yaml -c vllm-dbg-aws2 --gpus A10g --env HF_TOKEN -i30 --down without rolling back the vllm commit
  • sky launch --cloud gcp --gpus L4 pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116 cuda has good backward compatibility

skypilot:gpu-ubuntu-2004,us-east-2,ubuntu,20.04,ami-0692f9ae92252aab9,20230103
skypilot:gpu-ubuntu-2004,us-west-1,ubuntu,20.04,ami-0b61d2979f583d63d,20230103
skypilot:gpu-ubuntu-2004,us-west-2,ubuntu,20.04,ami-06b81ce928c07a34f,20230103
skypilot:gpu-ubuntu-2004,af-south-1,ubuntu,20.04,ami-0abc73eadd231f5b8,20231103
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add to https://github.com/skypilot-org/skypilot-catalog/blob/281a6b5febf770d7726dde1a6dc05d7b99180d2a/README.md on how to update various image.csv (e.g., cmd used to get all these new AMI IDs)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The images.csv for AWS should be updated by the fetch_aws.py, i.e., this change is nice to have, and it will always be updated by the fetch_aws.py. Added a description in README.md

@@ -3,5 +3,6 @@ skypilot:cpu-debian-10,,debian,10,projects/deeplearning-platform-release/global/
skypilot:k80-debian-10,,debian,10,projects/deeplearning-platform-release/global/images/common-cu113-v20220701,20220701
skypilot:gpu-debian-10,,debian,10,projects/deeplearning-platform-release/global/images/common-cu113-v20221215,20221215
skypilot:cuda118-debian-11,,debian,11,projects/deeplearning-platform-release/global/images/common-gpu-v20230615-debian-11-py310,20230615
skypilot:cpu-debian-11,,debian,11,projects/deeplearning-platform-release/global/images/common-cpu-v20230615-debian-11-py310,20230615
skypilot:gpu-debian-11,,debian,11,projects/deeplearning-platform-release/global/images/common-gpu-v20230615-debian-11-py310,20230615
skypilot:cuda121-debian-11,,debian,11,projects/deeplearning-platform-release/global/images/common-cu121-v20231105-debian-11-py310,20231105
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: when do we add a new row, skypilot:cuda121-debian-11? Is it always when we're upgrading the default image's CUDA version? Worth adding to README too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the readme. PTAL. : )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants