Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoScheduler] Bug fix for layout rewrite CI error in i386 #6830

Merged
merged 13 commits into from
Nov 4, 2020

Conversation

jcf94
Copy link
Contributor

@jcf94 jcf94 commented Nov 3, 2020

No description provided.

@jcf94
Copy link
Contributor Author

jcf94 commented Nov 4, 2020

Problem

Compute get wrong result after layout rewrite, and this only occurs in i386 CI.

Current Status

After trying many different tests, I guess I have finally found the reason. Only i386 CI used llvm-4 to build the TVM.

This test has set the random seed to a fixed number that AutoScheduler can always generate a same schedule.
i386 CI with llvm-4: https://ci.tlcpack.ai/blue/rest/organizations/jenkins/pipelines/tvm/branches/PR-6830/runs/4/nodes/253/steps/331/log/?start=0

Same schedule in i386 CI with llvm-8: https://ci.tlcpack.ai/blue/rest/organizations/jenkins/pipelines/tvm/branches/PR-6830/runs/7/nodes/253/steps/331/log/?start=0

The lowered result of TVM is exactly the same, so I think the only cause may be some special bug during llvm codegen in llvm-4.

To fully confirm it, we may need to compare their llvm ir. I'm trying llvm-4 in my local runtime to see if this bug can be reproduced.

cc @merrymercy @comaniac @tqchen @masahi

@jcf94
Copy link
Contributor Author

jcf94 commented Nov 4, 2020

This problem can be reproduced in my local runtime with ci-i386 docker.

Seems the float point operations under 32bit environment trends to be less accurate than 64bit?

I've tried more tests on different llvm versions, codegen results with higher llvm version can still encounter accuracy problem, but with lower possibility. In x86_64 environment, different llvm versions all worked well even with atol and rtol setting to 1e-7.

Currently a better way to fix this may still be setting a bigger atol and rtol value.

@jcf94 jcf94 changed the title [WIP][AutoScheduler] Bug fix for layout rewrite CI error in i386 [AutoScheduler] Bug fix for layout rewrite CI error in i386 Nov 4, 2020
@jcf94 jcf94 marked this pull request as ready for review November 4, 2020 08:19
@tqchen tqchen merged commit b8761ed into apache:main Nov 4, 2020
@tqchen
Copy link
Member

tqchen commented Nov 4, 2020

Thanks @jcf94 for timely fix and indepth analysis

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TEST][FLAKY] auto_scheduler layout rewrite tests
2 participants