Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2023b] PyTorch v2.3.0 #20489

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

akesandgren
Copy link
Contributor

(created using eb --new-pr)

…2.3.0_disable_DataType_dependent_test_if_tensorboard_is_not_available.patch, PyTorch-2.3.0_disable_test_linear_package_if_no_half_types_are_available.patch
@akesandgren
Copy link
Contributor Author

Tests that are failing for me are:

inductor/test_torchinductor 1/1 failed!   test_multilayer_var_lowp
inductor/test_torchinductor_dynamic_shapes 1/1 failed!   test_multilayer_var_lowp
test_cpp_extensions_open_device_registration 1/1 failed!   test_open_device_registration (Not implemented yet ?)
inductor/test_cpu_repro 1/1 failed!    test_scatter_using_atomic_add
test_decomp 1/1 failed!   test_sdpa (_nn_functional_scaled_dot_product_attention_cpu_bfloat16)
inductor/test_torchinductor_opinfo 1/1 failed!
 inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_fft_ihfft2_cpu_int32 FAILED
    C++ error

@akesandgren
Copy link
Contributor Author

@Flamefire
The first two are the same, precision problem on AMD zen3 at least
the cpp_extensions_open_device_registration.... haven't a clue yet
the scatter_using_atomic_add looks like it's not compiling to the code it expects, not sure why
test_sdpa is also precision related
I didn't attack the C++ error

@akesandgren
Copy link
Contributor Author

@boegelbot Please test @ jsc-zen3

@boegelbot
Copy link
Collaborator

@akesandgren: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20489 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20489 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4085

Test results coming soon (I hope)...

- notification for comment with ID 2098172875 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@akesandgren
Copy link
Contributor Author

Test report by @akesandgren
FAILED
Build succeeded for 0 out of 1 (3 easyconfigs in total)
b-an02.hpc2n.umu.se - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz, Python 3.8.10
See https://gist.github.com/akesandgren/ef17ea2435926ca06bbe5cbbe6058158 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/1ce9b70e5410ebd3a1d8dbbce992b8c7 for a full test report.

@akesandgren
Copy link
Contributor Author

Test report by @akesandgren
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
b-cn1607.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz, 3 x NVIDIA NVIDIA A100 80GB PCIe, 545.29.06, Python 3.10.12
See https://gist.github.com/akesandgren/9a9b6ec51769af98d6d4689b4e1ba93a for a full test report.

@akesandgren
Copy link
Contributor Author

Interesting...
If I run the tests standalone there are fewer failing tests than when run during a build...

@Flamefire
Copy link
Contributor

Interesting... If I run the tests standalone there are fewer failing tests than when run during a build...

Not unusual for PyTorch ;-)
I just got bitten again by $XDG_CACHE_HOME: PyTorch uses that to store JIT compiled files so rerunning the same test again with the same value for that will result in a different behavior as it will load the file from that directory instead of JIT compiling it.

@akesandgren
Copy link
Contributor Author

akesandgren commented May 14, 2024

These fail because SANDCASTLE=1 when run as part of build

export/test_lift_unlift
export/test_serialize
export/test_torchbind
export/test_unflatten
higher_order_ops/test_with_effects
test_weak

And those are the diff between my standalone test run (which was without SANDCASTLE) and the test-while-building

@akesandgren
Copy link
Contributor Author

@Flamefire Do you know why we set SANDCASTLE=1 in the easyblock?
As far as I can see it is a specific machine that they run tests on...

@Flamefire
Copy link
Contributor

@Flamefire Do you know why we set SANDCASTLE=1 in the easyblock? As far as I can see it is a specific machine that they run tests on...

Yes, there are a lot of things like @unittest.skipIf(IS_SANDCASTLE, "NYI: fuser CPU support for Sandcastle") in the tests and the idea was: If they don't even run/work on their machine we shouldn't even try to do for us.

So we might need to patch those failing ones. For TestWithEffects it loads a different library, similar in test_weak.py and likely for the export tests although I couldn't find the exact ones you mentioned

@akesandgren
Copy link
Contributor Author

I'm doing a test without SANDCASTLE set and test_hub disabled, that's one of only two I found that is doing external downloads, the other being one test in test_nn.
By the looks of some of the comments around SANDCASTLE it doesn't feel like a normal x86_64 based machine...

@akesandgren
Copy link
Contributor Author

And I have manually run the full test suite without SANDCASTLE set on a previous build and saw only 3 failed tests.
So I don't think we need SANDCASTLE set.

@Flamefire
Copy link
Contributor

By the looks of some of the comments around SANDCASTLE it doesn't feel like a normal x86_64 based machine...

Might be. I used it because it disable a LOT of tests, especially those downloading stuff IIRC. See https://github.com/search?q=repo%3Apytorch%2Fpytorch%20IS_SANDCASTLE&type=code

Two such instances seems to skip whole classes of tests at once: https://github.com/pytorch/pytorch/blob/20aa7cc6788ff10dee2d927057b10a81af638a32/test/jit/test_backends.py#L69-L73 and https://github.com/pytorch/pytorch/blob/2e4d0111953e6db7e4ce5cf041e6a78770092495/test/jit/test_torchbind.py#L37-L38

And I have manually run the full test suite without SANDCASTLE set on a previous build and saw only 3 failed tests.

If it is indeed the case that now NOT setting it causes fewer failures then we should. Best to condition it on 2.3+ to not introduce regressions.

I'll try to push a change upstream to use something like @skip_if_sandcastle which would give us an easy way to skip all those tests by patching that function without changing any other behavior controlled by that env variable

@Flamefire
Copy link
Contributor

We have another issue: pytest-rerun-failures interferes with our test parsing. We want some output like

    # ===================== 2 failed, 128 passed, 2 skipped, 2 warnings in 3.43s =====================
    # test_quantization failed!

But now we get:

Running test_cpp_extensions_open_device_registration 1/1 ... [2024-05-13 16:48:56.717884]
Executing ['.../python', '-bb', 'test_cpp_extensions_open_device_registration.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2'] ... [2024-05-13 16:48:56.718522]
===================== test session starts =====================
[...]
('RERUN', {'yellow': True}) [1.1713s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration ('RERUN', {'yellow': True}) [0.0036s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration FAILED [0.0033s]                                                                         [100%]

===================== RERUNS =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== FAILURES =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== short test summary info =====================
FAILED [0.0033s] test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration - AssertionError: RuntimeError not raised
!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 2 rerun in 39.35s =====================
Got exit code 1
Retrying...
===================== test session starts =====================
[...]
('RERUN', {'yellow': True}) [1.9584s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration [W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

('RERUN', {'yellow': True}) [0.0036s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration [W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

FAILED [0.0023s]                                                                         [100%]

===================== RERUNS =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== FAILURES =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== short test summary info =====================
FAILED [0.0023s] test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration - AssertionError: RuntimeError not raised
!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 2 rerun in 40.27s =====================
Got exit code 1
Retrying...
===================== test session starts =====================
[...]
('RERUN', {'yellow': True}) [1.8911s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration [W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

('RERUN', {'yellow': True}) [0.0032s]                                                    [100%]
test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration [W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

FAILED [0.0021s]                                                                         [100%]

===================== RERUNS =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== FAILURES =====================
_____________________ TestCppExtensionOpenRgistration.test_open_device_registration _____________________
[...]
===================== short test summary info =====================
FAILED [0.0021s] test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration - AssertionError: RuntimeError not raised
!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!
===================== 1 failed, 2 rerun in 40.00s =====================
Got exit code 1
Retrying...
===================== test session starts =====================
[...]
===================== 1 deselected in 0.02s =====================
The following tests failed consistently: ['test/test_cpp_extensions_open_device_registration.py::TestCppExtensionOpenRgistration::test_open_device_registration']
test_cpp_extensions_open_device_registration 1/1 failed!
Running test_cuda 1/1 ... [2024-05-13 16:51:10.730579]
  1. I don't see how we could reasonably parse this
  2. It exits after the first failed test. This means even "1 failed, 2 rerun in 40.00s" just says: "1 test out of an unknown number of tests failed"

@akesandgren
Copy link
Contributor Author

@boegelbot Please test @ jsc-zen3
EB_ARGS="--include-easyblocks-from-pr 3330"

@boegelbot
Copy link
Collaborator

@akesandgren: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20489 EB_ARGS="--include-easyblocks-from-pr 3330" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20489 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4128

Test results coming soon (I hope)...

- notification for comment with ID 2112638268 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optree requires typing-extensions/4.10.0-GCCcore-13.2.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that? I installed it just fine:

... python -m pip check completed successfully

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are you getting typing-extensions? It is not part of Python-3.11.5-GCCcore-13.2.0.eb. optree build fails without typing-extensions.

== installing...
== ... (took 29 secs)
== taking care of extensions...
== restore after iterating...
== postprocessing...
== sanity checking...
== ... (took 3 secs)
== FAILED: Installation ended unsuccessfully (build directory: /build/optree/0.11.0/GCCcore-13.2.0): build failed (first 300 chars): `/app/software/Python/3.11.5-GCCcore-13.2.0/bin/python -m pip check` failed:
optree 0.11.0 requires typing-extensions, which is not installed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you need to reinstall Python. The current develop version and release 4.9.1 contains it:

('typing_extensions', '4.8.0', {
'checksums': ['df8e4339e9cb77357558cbdbceca33c303714cf861d1eef15e1070055ae8b7ef'],
}),

However it was a change between 4.8.2 and 4.9.x by #19777

From the looks of that PR this was made because too many other ECs depended on that. And IMO it makes sense to include it in Python by default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, --rebuild --skip added four packages. This will fix many things for me.

== installing extension tomli 2.0.1 (1/4)...
==      configuring...
==      building...
==      testing...
==      installing...
==      ... (took 11 secs)
== installing extension packaging 23.2 (2/4)...
==      configuring...
==      building...
==      testing...
==      installing...
==      ... (took 2 secs)
== installing extension typing_extensions 4.8.0 (3/4)...
==      configuring...
==      building...
==      testing...
==      installing...
==      ... (took 2 secs)
== installing extension setuptools-scm 8.0.4 (4/4)...

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3330
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/50db8a2d24c3a8139108dc99a9001182 for a full test report.

@boegel boegel added this to the 4.x milestone May 22, 2024
@akesandgren
Copy link
Contributor Author

@Flamefire Any ideas on how to deal with the error output parsing problem?

@Flamefire
Copy link
Contributor

@Flamefire Any ideas on how to deal with the error output parsing problem?

Not many. I still have an open issue for that: pytorch/pytorch#126523

No luck so far to get a machine readable output from PyTorch directly. I.e. I wanted them to get the --save-xml option work correctly but nothing yet after pytorch/pytorch#126690 failed.

We could try to get that option working by patching the test files to make sure --junit-xml-reruns and --save-xml is always set/passed. Then we can check if the XML files are any good for us.

Another option would be to revert their changes to the rerun feature using a custom implementation that broke our detection: pytorch/pytorch@3b7d60b

That might get difficult to keep going forward but I don't see any current alternatives.

@jpecar
Copy link
Contributor

jpecar commented Nov 12, 2024

optree in this pr is missing git as builddependency.

@boegel
Copy link
Member

boegel commented Dec 2, 2024

@Flamefire Any updates on this? I would really like to see this merged...

@boegel boegel mentioned this pull request Dec 2, 2024
2 tasks
@Flamefire
Copy link
Contributor

@Flamefire Any updates on this? I would really like to see this merged...

Currently working again on the test report generation/detection from PyTorch 2.2+

I finished 2.1.2 though for 2022a/b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants