Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ND behavior in test_matmul.py::test_sd_matmul #7126

Closed
TT-billteng opened this issue Apr 4, 2024 · 13 comments
Closed

ND behavior in test_matmul.py::test_sd_matmul #7126

TT-billteng opened this issue Apr 4, 2024 · 13 comments
Assignees
Labels
bug Something isn't working ci-bug bugs found in CI op_cat: mm P1

Comments

@TT-billteng
Copy link
Collaborator

TT-billteng commented Apr 4, 2024

pytest tests/ttnn/unit_tests/operations/test_matmul.py::test_sd_matmul

This appears only on N150, but I haven't been able to repro locally yet though

This is not failing on a specific VM or BM.

Some failing runs:

https://github.com/tenstorrent-metal/tt-metal/actions/runs/8558424886/job/23453740581
https://github.com/tenstorrent-metal/tt-metal/actions/runs/8558853752/job/23454745853
https://github.com/tenstorrent-metal/tt-metal/actions/runs/8545275089/job/23414372279
https://github.com/tenstorrent-metal/tt-metal/actions/runs/8542369244/job/23403999132
https://github.com/tenstorrent-metal/tt-metal/actions/runs/8559093722/job/23455622019

@TT-billteng
Copy link
Collaborator Author

TT-billteng commented Apr 4, 2024

not test_sd_matmul specifically, but this is a deterministic failure and may be related:

pip install pytest-repeat
pytest --count=2 tests/ttnn/unit_tests/operations/test_matmul.py -xv

PASSED tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_with_matched_width_height_from_1D[k_size=4-n_size=4-2-2]
PASSED tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_with_matched_width_height_4D[n_size=1-c=1-h=2-w=4-1-2]
SKIPPED [4] tests/ttnn/unit_tests/operations/test_matmul.py:70: ttnn.reshape doesn't support reshaping the input tensors used in this test
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_with_matched_width_height_4D[n_size=1-c=1-h=2-w=4-2-2] - AssertionError: 0.9996211303338057

@cfjchu
Copy link
Collaborator

cfjchu commented Apr 4, 2024

not test_sd_matmul specifically, but this is a deterministic failure and may be related:

pip install pytest-repeat pytest --count=2 tests/ttnn/unit_tests/operations/test_matmul.py -xv

PASSED tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_with_matched_width_height_from_1D[k_size=4-n_size=4-2-2]
PASSED tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_with_matched_width_height_4D[n_size=1-c=1-h=2-w=4-1-2]
SKIPPED [4] tests/ttnn/unit_tests/operations/test_matmul.py:70: ttnn.reshape doesn't support reshaping the input tensors used in this test
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_with_matched_width_height_4D[n_size=1-c=1-h=2-w=4-2-2] - AssertionError: 0.9996211303338057

@TT-billteng that just looks like a PCC thresholding assertion maybe due to different seeds across runs?

@TT-billteng
Copy link
Collaborator Author

not test_sd_matmul specifically, but this is a deterministic failure and may be related:
pip install pytest-repeat pytest --count=2 tests/ttnn/unit_tests/operations/test_matmul.py -xv

PASSED tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_with_matched_width_height_from_1D[k_size=4-n_size=4-2-2]
PASSED tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_with_matched_width_height_4D[n_size=1-c=1-h=2-w=4-1-2]
SKIPPED [4] tests/ttnn/unit_tests/operations/test_matmul.py:70: ttnn.reshape doesn't support reshaping the input tensors used in this test
FAILED tests/ttnn/unit_tests/operations/test_matmul.py::test_matmul_with_matched_width_height_4D[n_size=1-c=1-h=2-w=4-2-2] - AssertionError: 0.9996211303338057

@TT-billteng that just looks like a PCC thresholding assertion maybe due to different seeds across runs?

Ah yes that's true, running again with lower threshold

I feel like we should use a "standard" globally-agreed upon PCC threshold to use for testing, or is this too much of an ask?

@cfjchu
Copy link
Collaborator

cfjchu commented Apr 4, 2024

I don't think that's feasible because there could be variability based on:

  • operation type
  • distribution of floating point values based on seed
  • stacking operations
  • data formats relative to torch.float32/torch.bfloat16

@TT-billteng
Copy link
Collaborator Author

can we disable? I still see this failing on main

@eyonland
Copy link
Contributor

eyonland commented Apr 8, 2024

I would rather not disable this test. @TT-BrianLiu , do we need to lower the PCC for this matmul test when running on WH or is this something bigger?

@TT-BrianLiu
Copy link
Contributor

Was it not failing before? Otherwise, 0.999 is pretty reasonable pcc for a matmul

@bbradelTT
Copy link
Contributor

@TT-billteng

  1. Re: I feel like we should use a "standard" globally-agreed upon PCC threshold to use for testing, or is this too much of an ask?
    Too big of an ask. Model owners use PCC thresholds as close as possible to existing values to notice any changes.

Having said that, the PCCs in test_matmul.py should be updated so that anything above .999 or with too many digits would be changed.

  1. test_matmul.py has many tests with skip_for_wormhole_b0. If we enable the tests again, I'm worried about All commit runtime increasing. When updating the PCCs, would you prefer to
  • just enable test_sd_matmul
  • enable all the tests
  • enable some subset of tests, in which case, what would be the criteria?

@prajaramanTT
Copy link

@TT-billteng @bbradelTT Can we close this issue ?

@bbradelTT
Copy link
Contributor

@prajaramanTT We can't.

@prajaramanTT
Copy link

@bbradelTT Do we have any updates on this ?

@bbradelTT
Copy link
Contributor

I just looked into this.

On WH N150 and BH the tests pass. They are skipped on N300 since the grid is too small.

I'll create a PR to re-enable the tests.

bbradelTT added a commit that referenced this issue Jan 15, 2025
### Ticket
Link to Github Issue #7126

### Problem description
A test was failing and was skipped

### What's changed
After various issues were fixed over time the test now passes. Therefore
enable it again.

### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/12776412082
- [ ] Blackhole Post commit (if applicable) Too many issues, but test
passed locally.
- [ ] Model regression CI testing passes (if applicable) N/A
- [ ] Device performance regression CI testing passes (if applicable)
N/A
- [ ] **(For models and ops writers)** Full [new
models](https://github.com/tenstorrent/tt-metal/actions/workflows/full-new-models-suite.yaml)
tests passes
- [ ] New/Existing tests provide coverage for changes
@bbradelTT
Copy link
Contributor

All post commit passed in main after the merge. Checked subsequent runs as well, and there are o failures in all post commit related to GS.

Closing.

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in External Requests and Reports Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci-bug bugs found in CI op_cat: mm P1
Projects
None yet
Development

No branches or pull requests

9 participants