Use AWS Neuron SDK 2.16 packages #398

dacorvo · 2024-01-08T09:24:42Z

This bumps the version of all AWS neuron packages referenced in setup.py and text-generation-inference Dockerfile.

The latest AWS Neuron SDK pip packages are not compatible with the old drivers in the legacy CI AMI.

The runners need to be updated to use the latest DLAMI from AWS that corresponds to the AWS Neuron SDK 2.16: ami-0fbea04d7389bcd4e.

Runners status:

inf2: DONE
inf1: DONE
trn1: DONE

All tests are OK, except for the trainium common tests that are failing;

three test_runner tests (summarization, translation, image_classification),
one staging test on optimum cache.

I would suggest to merge that pull-request so that @michaelbenayoun can further investigate the failing tests.

HuggingFaceDocBuilderDev · 2024-01-09T08:40:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

This will allow early failures during non-regressions.

michaelbenayoun · 2024-01-10T09:27:13Z

.github/workflows/test_inf2.yml

@@ -33,7 +35,23 @@ jobs:
          python -m pip install -U pip
          python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
          python -m pip install .[neuronx,tests]
-      - name: Run tests
+      - name: Run CLI tests


Is it only CLI tests?

Yes, I split the tests step into multiple steps per directory, because the inf2 tests take two hours.

michaelbenayoun · 2024-01-10T09:29:01Z

optimum/neuron/utils/runner.py

@@ -213,7 +214,7 @@ def __init__(
        self.use_venv = use_venv
        self.should_install_requirements = install_requirements
        self.venv_dir = TemporaryDirectory()
-        self.python_name = "python"
+        self.python_name = sys.executable
        self.pip_name = "pip"
        self.torchrun_name = "torchrun"


This case still works considering that you needed to change the name for self.python_name and we either use self.python_name or self.torchrun_name?

JingyaHuang

LGTM! Thanks for fixing the CIs!

JingyaHuang · 2024-01-10T09:30:06Z

tests/exporters/test_export.py

    if random_pick is not None:
-        return sorted(random.choices(models_to_test, k=random_pick))
+        return sorted(random.choices(models_to_test, k=int(random_pick)))


Are the CIs running all tests now? Did you set MAX_EXPORT_TEST_COMBINATIONS? I wonder if it would take too long.

I used it while debugging the tests, but eventually I ran all of them because if I set it to just 1 it runs just ... one test.
We would need to figure out a compromise for these tests. Maybe have a few quick steps with just one test to catch early failures, then the whole list that takes tens of minutes.

dacorvo force-pushed the aws_neuron_2.16 branch 5 times, most recently from 39c2785 to cf0fcc6 Compare January 9, 2024 08:37

dacorvo added 2 commits January 9, 2024 09:48

ci: add steps to display the neuron system packages

d310b78

chore: use AWS Neuron SDK 2.16 packages

26b4947

dacorvo force-pushed the aws_neuron_2.16 branch 2 times, most recently from b2e15f9 to d5815af Compare January 9, 2024 10:40

dacorvo added 3 commits January 9, 2024 12:16

test(generation): update hub models

49d74de

fix(runner): use sys executable instead of hardcoded name for python

b572d2e

ci(inferentia): split tests in several steps

cf8389b

This will allow early failures during non-regressions.

dacorvo force-pushed the aws_neuron_2.16 branch 3 times, most recently from 6605d12 to 9353a27 Compare January 9, 2024 13:50

test(export): allow reducing tests using env variable

3e6415f

dacorvo force-pushed the aws_neuron_2.16 branch from 9353a27 to 3e6415f Compare January 9, 2024 17:54

dacorvo marked this pull request as ready for review January 9, 2024 20:06

dacorvo requested review from philschmid, michaelbenayoun and JingyaHuang January 9, 2024 20:06

test(training): temporarily disable failing tests

069dd05

michaelbenayoun approved these changes Jan 10, 2024

View reviewed changes

JingyaHuang approved these changes Jan 10, 2024

View reviewed changes

dacorvo merged commit 3b3afa4 into main Jan 10, 2024
7 of 8 checks passed

dacorvo deleted the aws_neuron_2.16 branch January 10, 2024 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use AWS Neuron SDK 2.16 packages #398

Use AWS Neuron SDK 2.16 packages #398

dacorvo commented Jan 8, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 9, 2024

michaelbenayoun Jan 10, 2024

dacorvo Jan 10, 2024

michaelbenayoun Jan 10, 2024

JingyaHuang left a comment

JingyaHuang Jan 10, 2024

dacorvo Jan 10, 2024

Use AWS Neuron SDK 2.16 packages #398

Use AWS Neuron SDK 2.16 packages #398

Conversation

dacorvo commented Jan 8, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jan 9, 2024

michaelbenayoun Jan 10, 2024

Choose a reason for hiding this comment

dacorvo Jan 10, 2024

Choose a reason for hiding this comment

michaelbenayoun Jan 10, 2024

Choose a reason for hiding this comment

JingyaHuang left a comment

Choose a reason for hiding this comment

JingyaHuang Jan 10, 2024

Choose a reason for hiding this comment

dacorvo Jan 10, 2024

Choose a reason for hiding this comment

dacorvo commented Jan 8, 2024 •

edited

Loading