Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with low accuracy exp and expm1 ops in tt-mlir #1199

Open
ajakovljevicTT opened this issue Nov 8, 2024 · 4 comments
Open

Issues with low accuracy exp and expm1 ops in tt-mlir #1199

ajakovljevicTT opened this issue Nov 8, 2024 · 4 comments

Comments

@ajakovljevicTT
Copy link
Contributor

During the testing of the expm1 op, I noticed major discrepancies between the results of the exp and expm1 ops ran on CPU and Tenstorrent silicon. This was the cause for widening the atol tolerance in the test_exp_op test in the tt-xla repo (https://github.com/tenstorrent/tt-xla/blob/main/tests/TTIR/test_basic_ops.py).

Running some examples for exp op, I get:
e^1: real value - 2.7182; tt-mlir: 2.6864; tt-metal(ttnn): 2.6719
e^2: real value - 7.3890; tt-mlir: 7.2165; tt-metal(ttnn): 7.1875

Small differences between tt-mlir and tt-metal could be attributed to the differences in environments (as I ran ttnn with python, and tt-mlir with ttrt). However, they both vary significantly from the real value. This leads to some of our tests needing relatively large margins of error, and could pose future problems as we onboards models to MLIR.

@sdjordjevicTT
Copy link
Contributor

sdjordjevicTT commented Nov 8, 2024

What is the intention of this issue, to track it on our side? Should we open an issue on the tt-metal side for investigation?

@ajakovljevicTT
Copy link
Contributor Author

ajakovljevicTT commented Nov 8, 2024

What is the intention of this issue, to track it on our side? Should we open an issue on the tt-metal side for investigation?

The idea is just to track the issue on our side for now, and if accuracy becomes a problem in any model bringup, reference this issue and open one on the tt-metal side for investigation. It also helps folks with e2e tests on our side, as for example, I had tests failing for accuracy, which prompted me to do this little investigation of accuracy.

It is important to note that I did have a quick chat with tt-metal folks, and they are aware that they utilise a low-precision exp kernel. However, the original more precise one was slower, so they settled on this compromise. As we develop tt-mlir more, maybe this compromise should be revisited.

@sdjordjevicTT
Copy link
Contributor

What is the intention of this issue, to track it on our side? Should we open an issue on the tt-metal side for investigation?

The idea is just to track the issue on our side for now, and if accuracy becomes a problem in any model bringup, reference this issue and open one on the tt-metal side for investigation. It also helps folks with e2e tests on our side, as for example, I had tests failing for accuracy, which prompted me to do this little investigation of accuracy.

It is important to note that I did have a quick chat with tt-metal folks, and they are aware that they utilise a low-precision exp kernel. However, the original more precise one was slower, so they settled on this compromise. As we develop tt-mlir more, maybe this compromise should be revisited.

Thanks for clarifying this. As I understand well this is currently not an actionable issue on our side, hence I am not sure in what component to put this issue right now. Any suggestions?

@ajakovljevicTT
Copy link
Contributor Author

What is the intention of this issue, to track it on our side? Should we open an issue on the tt-metal side for investigation?

The idea is just to track the issue on our side for now, and if accuracy becomes a problem in any model bringup, reference this issue and open one on the tt-metal side for investigation. It also helps folks with e2e tests on our side, as for example, I had tests failing for accuracy, which prompted me to do this little investigation of accuracy.
It is important to note that I did have a quick chat with tt-metal folks, and they are aware that they utilise a low-precision exp kernel. However, the original more precise one was slower, so they settled on this compromise. As we develop tt-mlir more, maybe this compromise should be revisited.

Thanks for clarifying this. As I understand well this is currently not an actionable issue on our side, hence I am not sure in what component to put this issue right now. Any suggestions?

Yes, you are correct that this is not an actionable issue currently, not sure about where to put it. Maybe @mrakitaTT has some ideas about the place of this issue in the overall effort?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants