Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.0] refactor threading of snapshot_scheduler_test #1821

Merged
merged 1 commit into from
Oct 25, 2023

Conversation

spoonincode
Copy link
Member

In the process of investigating #1794 I got my build in to a state (via various compile options & compilers I guess) where snapshot_scheduler_test would consistently fail via

test_snapshot_scheduler.cpp(115): fatal error: in "producer_snapshot_scheduler_tests/snapshot_scheduler_test": critical check validate_snapshot_request(4, 12, 10, true) has failed

Upon closer look, some aspects of threading look invalid and racy in this test. After this refactor my troublesome build always passes. I'd like to consider this resolving #1794 until we know otherwise (we should know shortly 🙂)

In particular,
There is a race between the app->exec() and the chain_plug->chain().block_start.connect() -- blocks could be created before that connection is made and thus missed by the connection. I resolved this by moving the connection to occur before app->exec(). I believe this is where my local failure stumbled.

pp->schedule_snapshot() and pp->get_snapshot_requests() touch variables that are not guarded by any mutex. These should only be called by the "main thread" (in this case the app_thread) so wrap them in a .post() to the app.

I also replaced the effectively sleep(10) for a wait on block 20 (however long it takes).

Copy link
Member

@linh2931 linh2931 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix!

The failed to read response from monitor process failure now appears in another test: https://github.com/AntelopeIO/leap/actions/runs/6634407607/job/18023991204. I added some some debug logging in compile_monitor.cpp in oc_compile_monitor_debug branch, run CICD manually hoping to see some logs, but the logs I added did not show up. I thought the monitor process might have already exited before get_connection_to_compile_monitor was called.

I could not reproduce it locally.

@spoonincode
Copy link
Member Author

Not sure... I've been working under the premise that the error message was actually indicative of a crash since that's what we were thinking with #1736/#1766. But these tests we're seeing the error message on are of the few that reuse appbase in this manner... so maybe it's something with the way plugins are getting destroyed/recreated that's messing something up in OC's external process glue.

fwiw I've been manually hitting retry on this PR's jobs some this evening (slow going), and have a handful of runs that passed (no errors so far),
https://github.com/AntelopeIO/leap/actions/runs/6634069566/job/18023135214
https://github.com/AntelopeIO/leap/actions/runs/6634069566/job/18023135923
https://github.com/AntelopeIO/leap/actions/runs/6634069566/job/18023136456
https://github.com/AntelopeIO/leap/actions/runs/6634069566/job/18023136593
https://github.com/AntelopeIO/leap/actions/runs/6634069566/job/18023637759
https://github.com/AntelopeIO/leap/actions/runs/6634069566/job/18023755275
https://github.com/AntelopeIO/leap/actions/runs/6634069566/job/18023846713
https://github.com/AntelopeIO/leap/actions/runs/6634069566/job/18024167831
https://github.com/AntelopeIO/leap/actions/runs/6634069566/job/18024243576
https://github.com/AntelopeIO/leap/actions/runs/6634069566/job/18024819879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test Failure: failed to read response from monitor process in snapshot_scheduler_test
3 participants