Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky tests on Linux-aarch64 build #368

Closed
jacobperron opened this issue Dec 21, 2018 · 18 comments
Closed

Flaky tests on Linux-aarch64 build #368

jacobperron opened this issue Dec 21, 2018 · 18 comments
Labels
bug Something isn't working

Comments

@jacobperron
Copy link
Member

E.g. https://ci.ros2.org/job/ci_linux-aarch64/2467/

Seems that repeating tests is causing something to crash and not report test results.

Looks like this issue was introduced by #297.

@jacobperron jacobperron added the bug Something isn't working label Dec 21, 2018
@wjwwood
Copy link
Member

wjwwood commented Dec 21, 2018

Dang, I guess the CI was old, but I figured that nothing else had really changed on rviz... I'll revert it and look into the issue.

@wjwwood
Copy link
Member

wjwwood commented Dec 21, 2018

Reverted in #369

@jacobperron
Copy link
Member Author

jacobperron commented Dec 21, 2018

Dang, I guess the CI was old, but I figured that nothing else had really changed on rviz... I'll revert it and look into the issue.

@wjwwood
To clarify, I believe CI passes with the build flag --retest-until-pass 10. But our nightly jobs run --retest-until-fail 10, which is where the failures were originally noticed.
So our regular CI wouldn't have caught this.

@cottsay
Copy link
Member

cottsay commented Dec 27, 2018

Is there anything else to be done here, or did #369 solve the issue?

@wjwwood
Copy link
Member

wjwwood commented Jan 2, 2019

I was leaving this open until the revert could be undone, but I guess we could close this if desired since it only really applies to fixing the CI, not also fixing the originating change and getting it re-merged.

@Martin-Idel
Copy link
Contributor

The revert will definitely not have changed anything. The commit only introduced changes in rviz_default_plugins, while the test failures start with tests in rviz_rendering (e.g. point_cloud_renderable_test_target, which is built before rviz_default_plugins.

I recall that we had the same test failures in the nightlies when enabling display tests on Linux systems back in July and I don't remember them ever having been resolved. I tried to debug the issue once but couldn't reproduce - but it seems to have been Ogre not coming up successfully...

@jacobperron
Copy link
Member Author

There has been a slew of test failures in the nightlies over the past few days for rviz_default_plugins and rviz_rendering: https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_repeated/655/
The symptoms are missing test results.
@Martin-Idel Sounds like the same issues you experienced in the past?

@wjwwood
Copy link
Member

wjwwood commented Jan 7, 2019

I was working under an assumption that @jacobperron saw these as new flaky tests, indicating this was true:

Looks like this issue was introduced by #297.

If the flaky-ness was preexisting then we could undo the revert and address the flaky tests separately. I didn't investigate deeply. I was just trying to keep CI clean leading up to the release.

@andreasholzner
Copy link

I am sure that the flaky aarch64 tests are not related to #297 as we have seen those also in the past. We had no direct access to aarch64 machines and thus could not debug as these problems do not show on amd64.

@clalancette
Copy link
Contributor

@andreasholzner If we can get you access to an aarch64 machine (probably in AWS), would you or @Martin-Idel have time to look into this?

@wjwwood
Copy link
Member

wjwwood commented Jan 14, 2019

So it seems like the changes in #297 were not the cause of this? Does it just make it more flaky? I'm about to do the release for Crystal patch 1 and need to resolve this...

@jacobperron
Copy link
Member Author

So it seems like the changes in #297 were not the cause of this? Does it just make it more flaky? I'm about to do the release for Crystal patch 1 and need to resolve this

It does seem that they are not the cause. I wasn't aware of the previous failures when I originally reported. I can't say that it makes it more flakey. I think I opened this issue after seeing the failures for two or three days. I'd say it would be okay to add the change back.

@jacobperron
Copy link
Member Author

Here's the (original?) issue I missed: ros2/build_farmer#144

@andreasholzner
Copy link

I could (almost) reproduce the test results, however, the console output is a little different.

Output seen while trying to reproduce

Running main() from gmock_main.cc
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from PointCloudRenderableTestFixture
[rviz_rendering:debug] Available Renderers(1): OpenGL Rendering Subsystem, at /home/ubuntu/rviz_ws/src/rviz/rviz_rendering/src/rviz_rendering/render_system.cpp:273

The line starting with [rviz_rendering:debug] is not present in the failing ci builds. In both cases the tests fail with a segfault.

I could trace the segfault to the call to Ogre::Root::createRenderWindow.

window = ogre_root_->createRenderWindow(name, width, height, false, params);

The call stack is not helpful.

(gdb) bt
#0 0x0000ffffbc77e670 in ?? () from /usr/lib/aarch64-linux-gnu/libGLX.so.0
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I am out of ideas. Maybe a Ogre upgrade could help, but this would require some work. During a quick try to use the Ogre 1.11.5 the rviz_ogre_vendor package did not compile.

@dirk-thomas
Copy link
Member

Is there any chance that these test failure will be addressed in the near future? If not I would propose to exclude the affected packages (rviz_common, rviz_default_plugins, rviz_rendering, rviz_rendering_tests) from the nightly aarch64 repeated job.

@andreasholzner
Copy link

I don't see how so I am in favor of excluding the packages from the nightly job.
After an Ogre upgrade they could be enabled again.

@Martin-Idel
Copy link
Contributor

This should have been solved by #394 (at least that's what my test showed there). Are the tests reenabled? Could somebody try it again?

@clalancette
Copy link
Contributor

I just ran a repeated job with this stuff reenabled: https://ci.ros2.org/job/ci_linux-aarch64/9301/. While there are test failures, none of them are in rviz-related components, so this is probably fixed. I'm going to close this out and open a PR to reenable these tests on the nightlies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants