Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing torch compile benchmark #3179

Merged
merged 3 commits into from
Jun 7, 2024
Merged

Fixing torch compile benchmark #3179

merged 3 commits into from
Jun 7, 2024

Conversation

udaij12
Copy link
Collaborator

@udaij12 udaij12 commented Jun 6, 2024

Description

Torch compile nightly tests are running for 22+ hours and are terminating due to extended use. Root cause is that torchtext installation is causing torch cpu to be installed rather then cu121 which causes the tests to run on the CPU rather then GPU.

This can be verified through ec2 stats showing 99% CPU usage during the failed tests and nvidia-smi showing no running processes during the tests.

As well as torch version being torch-2.4.0.dev20240605+cpu in the torch compile nightly job https://github.com/pytorch/serve/actions/runs/9391396278/job/25867236574

Solution to add torchtext back to being installed with torch rather than separately.

Logs

https://github.com/pytorch/serve/actions/runs/9406402297/job/25909766586
Can verify torch version and can also see

GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G

@udaij12 udaij12 marked this pull request as ready for review June 6, 2024 18:22
@udaij12 udaij12 requested a review from namannandan June 6, 2024 18:22
Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the prescribed way to install torchtext

#3011

As a work around you can skip installing torchtext nightlies.

We will be removing this when 2.4 is released

@udaij12
Copy link
Collaborator Author

udaij12 commented Jun 6, 2024

#3011

There is no reason to do this anymore as the torchtext in https://download.pytorch.org/whl/nightly/cpu and the torchtext in https://download.pytorch.org/whl/nightly/cu121 link to the same thing.

Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@agunapal agunapal enabled auto-merge June 7, 2024 17:07
@agunapal agunapal added this pull request to the merge queue Jun 7, 2024
Merged via the queue into master with commit 36049cb Jun 7, 2024
12 checks passed
@udaij12 udaij12 deleted the torch_fix branch June 10, 2024 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants