Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST][FLAKY] test_arm_compute_lib #8417

Closed
tqchen opened this issue Jul 7, 2021 · 12 comments · Fixed by #8573
Closed

[TEST][FLAKY] test_arm_compute_lib #8417

tqchen opened this issue Jul 7, 2021 · 12 comments · Fixed by #8573

Comments

@tqchen
Copy link
Member

tqchen commented Jul 7, 2021

https://ci.tlcpack.ai/job/tvm/job/main/1262/execution/node/352/log/

@tqchen
Copy link
Member Author

tqchen commented Jul 7, 2021

cc @leandron @masahi @areusch perhaps we should consider temporary disable the test

@tqchen
Copy link
Member Author

tqchen commented Jul 7, 2021

#8400

@u99127
Copy link
Contributor

u99127 commented Jul 7, 2021

Perhaps this a flaky CI machine that is causing this occasionally failing not too dissimilar to how certain other tests fail in ci-i386 ?

@tqchen
Copy link
Member Author

tqchen commented Jul 7, 2021

I am not too sure about the cause, but this case it seems to be more frequent than the i386 ones

@u99127
Copy link
Contributor

u99127 commented Jul 7, 2021

Note the CI for AArch64 is using images built from source. We have a docker image update for ci_arm which uses pre-built binaries for the binaries which is due and we are waiting for it for the past many weeks. I haven't seen any of these failures occur in the gazillion times I've run this with the latest Dockerfile.ci_arm image baked by myself locally. Perhaps we can look to move that ahead and see how it runs in the CI and if we have similar flakiness ?

@tqchen
Copy link
Member Author

tqchen commented Jul 7, 2021

Yah, let us do that. However, the CI Jenkinsfile will only take in effect in main, so perhaps a good first step is to first cleanup the compute_lib.sh so we unblock others, then upgrade the image. @leandron should be able to push to the branch ci-docker-staging to testout the new image

@u99127
Copy link
Contributor

u99127 commented Jul 7, 2021

I think that's already done by @areusch and @mbrookhart as part of their work in updating the ci images in #8177 .

@u99127
Copy link
Contributor

u99127 commented Jul 7, 2021

I've updated #8400 as suggested till we figure this out - is there a way to access the CI machine to debug interactively what is going on ? I am unable to reproduce the issue at all at my end having tried it on quite a few AArch64 linux boxes I have control of.

@areusch
Copy link
Contributor

areusch commented Jul 7, 2021

@u99127 do you have either a) a packer build flow for the CI ARM machine or b) a suggested AMI or recipe for building the CI machine? my understanding is that the ARM machines use an image we built in-house, and it would be great to just document the build process.

@u99127
Copy link
Contributor

u99127 commented Jul 7, 2021

@u99127 do you have either a) a packer build flow for the CI ARM machine or b) a suggested AMI or recipe for building the CI machine? my understanding is that the ARM machines use an image we built in-house, and it would be great to just document the build process.

On the machines I have access to I'm using bog standard 18.04 ubuntu + the docker image baked from the Docker scripts .

On your query about the machine possibly @zhiics might be able to help ?

Ramana

@u99127
Copy link
Contributor

u99127 commented Jul 27, 2021

Ok I've tried this for quite a few times this evening after having acquired access to an m6g4xlarge instance - not sure if this is the same as CI

  • Running with image version v0.05 I could make the test trigger and fall over at least 2-3 times
  • Running with image version v0.06 at least 25 times in a loop with 2 different build areas gave no failures.

for i in {1..25} ; do ./tests/scripts/task_python_arm_compute_library.sh ; done

(obviously I had re-enabled the local testing in my tree)

Now that the ci-arm image has been updated , I think we should try and re-enable this testing and see how it goes.

Thoughts , @areusch ?

Ramana

@areusch
Copy link
Contributor

areusch commented Jul 27, 2021

I agree let us re-enable the tests and see how they do in CI with 0.06

u99127 pushed a commit to u99127/tvm that referenced this issue Jul 28, 2021
ci image v0.06 does not appear to have the flakiness shown in ci image v0.05.
However what changed between the 2 remains a mystery and needs further
debugging. However for now re-enable this to see how this fares in CI

Fixes apache#8417
jcf94 pushed a commit that referenced this issue Jul 29, 2021
ci image v0.06 does not appear to have the flakiness shown in ci image v0.05.
However what changed between the 2 remains a mystery and needs further
debugging. However for now re-enable this to see how this fares in CI

Fixes #8417
ylc pushed a commit to ylc/tvm that referenced this issue Sep 29, 2021
ci image v0.06 does not appear to have the flakiness shown in ci image v0.05.
However what changed between the 2 remains a mystery and needs further
debugging. However for now re-enable this to see how this fares in CI

Fixes apache#8417
ylc pushed a commit to ylc/tvm that referenced this issue Jan 13, 2022
ci image v0.06 does not appear to have the flakiness shown in ci image v0.05.
However what changed between the 2 remains a mystery and needs further
debugging. However for now re-enable this to see how this fares in CI

Fixes apache#8417
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants