-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use AWS Neuron SDK 2.16 packages #398
Conversation
39c2785
to
cf0fcc6
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
b2e15f9
to
d5815af
Compare
This will allow early failures during non-regressions.
6605d12
to
9353a27
Compare
9353a27
to
3e6415f
Compare
@@ -33,7 +35,23 @@ jobs: | |||
python -m pip install -U pip | |||
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com | |||
python -m pip install .[neuronx,tests] | |||
- name: Run tests | |||
- name: Run CLI tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it only CLI tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I split the tests step into multiple steps per directory, because the inf2 tests take two hours.
@@ -213,7 +214,7 @@ def __init__( | |||
self.use_venv = use_venv | |||
self.should_install_requirements = install_requirements | |||
self.venv_dir = TemporaryDirectory() | |||
self.python_name = "python" | |||
self.python_name = sys.executable | |||
self.pip_name = "pip" | |||
self.torchrun_name = "torchrun" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case still works considering that you needed to change the name for self.python_name
and we either use self.python_name
or self.torchrun_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for fixing the CIs!
if random_pick is not None: | ||
return sorted(random.choices(models_to_test, k=random_pick)) | ||
return sorted(random.choices(models_to_test, k=int(random_pick))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the CIs running all tests now? Did you set MAX_EXPORT_TEST_COMBINATIONS
? I wonder if it would take too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used it while debugging the tests, but eventually I ran all of them because if I set it to just 1 it runs just ... one test.
We would need to figure out a compromise for these tests. Maybe have a few quick steps with just one test to catch early failures, then the whole list that takes tens of minutes.
This bumps the version of all AWS neuron packages referenced in setup.py and text-generation-inference Dockerfile.
The latest AWS Neuron SDK pip packages are not compatible with the old drivers in the legacy CI AMI.
The runners need to be updated to use the latest DLAMI from AWS that corresponds to the AWS Neuron SDK 2.16: ami-0fbea04d7389bcd4e.
Runners status:
All tests are OK, except for the
trainium
common tests that are failing;I would suggest to merge that pull-request so that @michaelbenayoun can further investigate the failing tests.