Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use AWS Neuron SDK 2.16 packages #398

Merged
merged 7 commits into from
Jan 10, 2024
Merged

Use AWS Neuron SDK 2.16 packages #398

merged 7 commits into from
Jan 10, 2024

Conversation

dacorvo
Copy link
Collaborator

@dacorvo dacorvo commented Jan 8, 2024

This bumps the version of all AWS neuron packages referenced in setup.py and text-generation-inference Dockerfile.

The latest AWS Neuron SDK pip packages are not compatible with the old drivers in the legacy CI AMI.

The runners need to be updated to use the latest DLAMI from AWS that corresponds to the AWS Neuron SDK 2.16: ami-0fbea04d7389bcd4e.

Runners status:

  • inf2: DONE
  • inf1: DONE
  • trn1: DONE

All tests are OK, except for the trainium common tests that are failing;

  • three test_runner tests (summarization, translation, image_classification),
  • one staging test on optimum cache.

I would suggest to merge that pull-request so that @michaelbenayoun can further investigate the failing tests.

@dacorvo dacorvo force-pushed the aws_neuron_2.16 branch 5 times, most recently from 39c2785 to cf0fcc6 Compare January 9, 2024 08:37
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dacorvo dacorvo force-pushed the aws_neuron_2.16 branch 2 times, most recently from b2e15f9 to d5815af Compare January 9, 2024 10:40
@dacorvo dacorvo force-pushed the aws_neuron_2.16 branch 3 times, most recently from 6605d12 to 9353a27 Compare January 9, 2024 13:50
@@ -33,7 +35,23 @@ jobs:
python -m pip install -U pip
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
python -m pip install .[neuronx,tests]
- name: Run tests
- name: Run CLI tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only CLI tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I split the tests step into multiple steps per directory, because the inf2 tests take two hours.

@@ -213,7 +214,7 @@ def __init__(
self.use_venv = use_venv
self.should_install_requirements = install_requirements
self.venv_dir = TemporaryDirectory()
self.python_name = "python"
self.python_name = sys.executable
self.pip_name = "pip"
self.torchrun_name = "torchrun"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case still works considering that you needed to change the name for self.python_name and we either use self.python_name or self.torchrun_name?

Copy link
Collaborator

@JingyaHuang JingyaHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for fixing the CIs!

if random_pick is not None:
return sorted(random.choices(models_to_test, k=random_pick))
return sorted(random.choices(models_to_test, k=int(random_pick)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the CIs running all tests now? Did you set MAX_EXPORT_TEST_COMBINATIONS? I wonder if it would take too long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used it while debugging the tests, but eventually I ran all of them because if I set it to just 1 it runs just ... one test.
We would need to figure out a compromise for these tests. Maybe have a few quick steps with just one test to catch early failures, then the whole list that takes tens of minutes.

@dacorvo dacorvo merged commit 3b3afa4 into main Jan 10, 2024
7 of 8 checks passed
@dacorvo dacorvo deleted the aws_neuron_2.16 branch January 10, 2024 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants