-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ND hang of SD unit tests on N300 device #7560
Comments
hanging on N150 in main post-commit https://github.com/tenstorrent/tt-metal/actions/runs/8728371259/job/23948244160 seems to be specifically this test
|
some discussions are happening over here - https://tenstorrent.slack.com/archives/C055REZR6Q3/p1713992657339109 |
quick update: some one the unit tests were raising OOM and allocator errors, but e2e test was passing. I'll skip the tests with OOM errors, and launch a pipeline after rebasing to the latest main to double-check all SD unit tests pass |
Next step: Please try to repro it on the lastest FD2/main branch. |
Next step:
|
I have been able to successfully reproduce the hang running watcher without NOC sanitization three times. Twice on the same op, once on a different one. I also ran without hangs ~5 times, so still ND. The different one is in the same submodule (the one @mtatsumiTT identified as problematic), so probably related. |
To repro, run |
Sounds like we should try to repro on the submodule as next step. Then we will have a smaller test to debug in detail. Less ops. |
Shouldn't be, but it does depend on the chips - N300 or N150 chips aren't identical, some expose hangs more often than the others. I was using a chip that was proved to expose a lot of hang repros we already have, but it doesn't have to mean it can expose all di/dt hangs that can happen.
I ran 15 iterations, the demo takes a bit longer. |
Might be good to test the full demo with 50 iters? |
I've isolated the specific change that is causing the hang, though still unknown why it's causing it. This commit changed which FD core is used for what, specifically eth core (0, 4) was used for dispatcher, and (0, 5) was used for prefetcher. Was changed so that (0, 4) is now prefetcher and (0, 5) is now dispatcher. This should have no real change in functionality and other tests/models are functional, so something weird is happening as a result of this (potentially some timing/race issue). |
After the fix is merged, I guess it should be checked if the ND hangs still exist or not with the latest fw. However, we won't be doing these checks and removing workarounds for all models, model owners will need to check if they can remove workarounds from the models. We will let everybody know once the software fix and fw testing is done, so that model owners can addess this. |
The fix is on main. So retesting/re-enabling of SD tests can be done |
Should we re-enable the unstable tests? |
Assigning myself to this since I'm taking ownership of SD for now. I'll test out @tt-aho's fix to see if we can re-enable these tests on CI. |
Tests were re-enabled in 9492740 and no longer hang on N300 due to @tt-aho's fix. From my testing, we still require Should we close this or keep it open until the di/dt issues are completely resolved? There is another spurious failure (see here) but I will track that in a separate issue. |
Unless others have objections, I think we can close this specific issue. Should we make a follow up to eventually get rid of |
Yes I'll create an issue to track it 👍 Unless some says otherwise, I will close this once I create the 2 follow on issues. |
Running SD unit tests with
WH_ARCH_YAML
on N300 devices non-deterministically hangs.To repro the issue, switch to
main
branch and run the following on N300 device:EDIT:
Running the same test with enabling watcher in the fast-dispatch CI raises the
std::runtime_error
below ontests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py
(full log):fyi @AleksKnezevic @vtangTT @TT-billteng
The text was updated successfully, but these errors were encountered: