-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update AWS and GCP GPU images to have cuda 12.1 #49
Conversation
skypilot:gpu-ubuntu-2004,us-east-2,ubuntu,20.04,ami-0692f9ae92252aab9,20230103 | ||
skypilot:gpu-ubuntu-2004,us-west-1,ubuntu,20.04,ami-0b61d2979f583d63d,20230103 | ||
skypilot:gpu-ubuntu-2004,us-west-2,ubuntu,20.04,ami-06b81ce928c07a34f,20230103 | ||
skypilot:gpu-ubuntu-2004,af-south-1,ubuntu,20.04,ami-0abc73eadd231f5b8,20231103 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add to https://github.com/skypilot-org/skypilot-catalog/blob/281a6b5febf770d7726dde1a6dc05d7b99180d2a/README.md on how to update various image.csv (e.g., cmd used to get all these new AMI IDs)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The images.csv for AWS should be updated by the fetch_aws.py
, i.e., this change is nice to have, and it will always be updated by the fetch_aws.py
. Added a description in README.md
@@ -3,5 +3,6 @@ skypilot:cpu-debian-10,,debian,10,projects/deeplearning-platform-release/global/ | |||
skypilot:k80-debian-10,,debian,10,projects/deeplearning-platform-release/global/images/common-cu113-v20220701,20220701 | |||
skypilot:gpu-debian-10,,debian,10,projects/deeplearning-platform-release/global/images/common-cu113-v20221215,20221215 | |||
skypilot:cuda118-debian-11,,debian,11,projects/deeplearning-platform-release/global/images/common-gpu-v20230615-debian-11-py310,20230615 | |||
skypilot:cpu-debian-11,,debian,11,projects/deeplearning-platform-release/global/images/common-cpu-v20230615-debian-11-py310,20230615 | |||
skypilot:gpu-debian-11,,debian,11,projects/deeplearning-platform-release/global/images/common-gpu-v20230615-debian-11-py310,20230615 | |||
skypilot:cuda121-debian-11,,debian,11,projects/deeplearning-platform-release/global/images/common-cu121-v20231105-debian-11-py310,20231105 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: when do we add a new row, skypilot:cuda121-debian-11
? Is it always when we're upgrading the default image's CUDA version? Worth adding to README too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the readme. PTAL. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The latest pytorch 2.1.0 requires CUDA 12.1 by default. This is to update the images for both AWS and GCP to support the latest cuda driver.
We should get skypilot-org/skypilot#2788 in at the same time.
Tested:
sky launch serve-openai-api.yaml -c vllm-dbg3 --env HF_TOKEN -i30 --down
without rolling back the vllm commitsky launch serve-openai-api.yaml -c vllm-dbg-aws2 --gpus A10g --env HF_TOKEN -i30 --down
without rolling back the vllm commitsky launch --cloud gcp --gpus L4 pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
cuda has good backward compatibility