Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ND hang of SD unit tests on N300 device #7560

Closed
mtatsumiTT opened this issue Apr 17, 2024 · 47 comments
Closed

ND hang of SD unit tests on N300 device #7560

mtatsumiTT opened this issue Apr 17, 2024 · 47 comments
Assignees
Labels
bug Something isn't working ci-bug bugs found in CI didt_confirmed P2 Stable Diffusion

Comments

@mtatsumiTT
Copy link
Contributor

mtatsumiTT commented Apr 17, 2024

Running SD unit tests with WH_ARCH_YAML on N300 devices non-deterministically hangs.

To repro the issue, switch to main branch and run the following on N300 device:

WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/ttnn/integration_tests/stable_diffusion

EDIT:
Running the same test with enabling watcher in the fast-dispatch CI raises the std::runtime_error below on tests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py (full log):

terminate called after throwing an instance of 'std::runtime_error'
  what():  Read 0xffffffff from ARC scratch[6]: auto-reset succeeded.
Fatal Python error: Aborted
Thread 0x00007f3744ff9700 (most recent call first):
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 306 in wait
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 558 in wait
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x00007f38db2c1740 (most recent call first):
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 410 in call_wrapper
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 616 in call_wrapper
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 693 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 306 in time_sharded_attention
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 471 in get_attention_scores_opt
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 706 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_basic_transformer_block.py", line 90 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_transformer_2d.py", line 298 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attn_upblock.py", line 153 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py", line 321 in test_cross_attn_up_block_2d_512x512
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 1789 in runtest
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 339 in from_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/bin/pytest", line 8 in <module>

fyi @AleksKnezevic @vtangTT @TT-billteng

@TT-billteng
Copy link
Collaborator

TT-billteng commented Apr 18, 2024

hanging on N150 in main post-commit

https://github.com/tenstorrent/tt-metal/actions/runs/8728371259/job/23948244160
https://github.com/tenstorrent/tt-metal/actions/runs/8739120961/job/23980015228

seems to be specifically this test

tests/ttnn/unit_tests/test_sd_e2e.py::test_unet_2d_condition_model_512x512[batch_size=2-in_channels=4-input_height=64-input_width=64]

@jliangTT
Copy link

some discussions are happening over here - https://tenstorrent.slack.com/archives/C055REZR6Q3/p1713992657339109

@mtatsumiTT
Copy link
Contributor Author

quick update: some one the unit tests were raising OOM and allocator errors, but e2e test was passing. I'll skip the tests with OOM errors, and launch a pipeline after rebasing to the latest main to double-check all SD unit tests pass

@jliangTT
Copy link

jliangTT commented May 1, 2024

Next step: Please try to repro it on the lastest FD2/main branch.

@jliangTT
Copy link

jliangTT commented May 1, 2024

Next step:

  • debugging/repro with watcher

@AleksKnezevic
Copy link
Contributor

I have been able to successfully reproduce the hang running watcher without NOC sanitization three times. Twice on the same op, once on a different one. I also ran without hangs ~5 times, so still ND. The different one is in the same submodule (the one @mtatsumiTT identified as problematic), so probably related.

@AleksKnezevic
Copy link
Contributor

AleksKnezevic commented May 2, 2024

To repro, run WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml TT_METAL_WATCHER=1 TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 pytest --count=100 -svv tests/ttnn/integration_tests/stable_diffusion/test_unet_2d_condition_model.py -k 512 on aknezevic/hang_debug

@AleksKnezevic
Copy link
Contributor

hang_debug.txt

@jvasilje
Copy link
Collaborator

jvasilje commented May 2, 2024

Sounds like we should try to repro on the submodule as next step. Then we will have a smaller test to debug in detail. Less ops.

@s-jovic
Copy link
Contributor

s-jovic commented Jul 22, 2024

Is the previous di dt issue exposed more by host constraints?

Shouldn't be, but it does depend on the chips - N300 or N150 chips aren't identical, some expose hangs more often than the others. I was using a chip that was proved to expose a lot of hang repros we already have, but it doesn't have to mean it can expose all di/dt hangs that can happen.

And how many iterations did you try?

I ran 15 iterations, the demo takes a bit longer.

@mywoodstock
Copy link
Contributor

Might be good to test the full demo with 50 iters?

@tt-aho
Copy link
Contributor

tt-aho commented Jul 23, 2024

I've isolated the specific change that is causing the hang, though still unknown why it's causing it.

This commit changed which FD core is used for what, specifically eth core (0, 4) was used for dispatcher, and (0, 5) was used for prefetcher. Was changed so that (0, 4) is now prefetcher and (0, 5) is now dispatcher. This should have no real change in functionality and other tests/models are functional, so something weird is happening as a result of this (potentially some timing/race issue).

@tt-aho
Copy link
Contributor

tt-aho commented Jul 30, 2024

I have a fix for this in this pr #10911. Didn't enable the test in CI though. Will you enable it after doing di/dt testing on latest main @s-jovic ?

@s-jovic
Copy link
Contributor

s-jovic commented Jul 31, 2024

After the fix is merged, I guess it should be checked if the ND hangs still exist or not with the latest fw. However, we won't be doing these checks and removing workarounds for all models, model owners will need to check if they can remove workarounds from the models. We will let everybody know once the software fix and fw testing is done, so that model owners can addess this.

@tt-aho
Copy link
Contributor

tt-aho commented Aug 2, 2024

The fix is on main. So retesting/re-enabling of SD tests can be done

@tt-rkim
Copy link
Collaborator

tt-rkim commented Aug 6, 2024

Should we re-enable the unstable tests?

@esmalTT
Copy link
Contributor

esmalTT commented Aug 7, 2024

Assigning myself to this since I'm taking ownership of SD for now. I'll test out @tt-aho's fix to see if we can re-enable these tests on CI.

@esmalTT
Copy link
Contributor

esmalTT commented Aug 12, 2024

Tests were re-enabled in 9492740 and no longer hang on N300 due to @tt-aho's fix. From my testing, we still require SLOW_MATMULS=1 to avoid hanging.

Should we close this or keep it open until the di/dt issues are completely resolved?

There is another spurious failure (see here) but I will track that in a separate issue.

@tt-rkim
Copy link
Collaborator

tt-rkim commented Aug 13, 2024

Unless others have objections, I think we can close this specific issue.

Should we make a follow up to eventually get rid of SLOW_MATMULS from the stack once we no longer support WH?

@esmalTT
Copy link
Contributor

esmalTT commented Aug 13, 2024

Unless others have objections, I think we can close this specific issue.

Should we make a follow up to eventually get rid of SLOW_MATMULS from the stack once we no longer support WH?

Yes I'll create an issue to track it 👍 Unless some says otherwise, I will close this once I create the 2 follow on issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci-bug bugs found in CI didt_confirmed P2 Stable Diffusion
Projects
None yet
Development

No branches or pull requests