Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed test suite: add an additional use of the poll_while function #42758

Closed
wants to merge 1 commit into from

Conversation

DilumAluthge
Copy link
Member

@DilumAluthge DilumAluthge commented Oct 22, 2021

Follow-up to #42499


This pull request is intended to fix the following test failure, which I am seeing very rarely in the tester_linux64_mt Buildkite job:

Distributed                         (1) |        started at 2021-10-22T10:10:08.433
Test Failed at /cache/build/amdci7-2/julialang/julia-master/julia-609a4a0a1e/share/julia/stdlib/v1.8/Distributed/test/distributed_exec.jl:270
  Expression: remotecall_fetch((k->begin
                haskey(Distributed.PGRP.refs, k)
            end), wid1, rrid) == true
   Evaluated: false == true
ERROR: LoadError: There was an error during testing
in expression starting at /cache/build/amdci7-2/julialang/julia-master/julia-609a4a0a1e/share/julia/stdlib/v1.8/Distributed/test/distributed_exec.jl:263
Distributed                         (1) |         failed at 2021-10-22T10:10:33.207

Full log: https://buildkite.com/julialang/julia-master/builds/5028#dfdf3a45-5510-402b-99d4-9daa860f0f09

@DilumAluthge DilumAluthge added parallelism Parallel or distributed computation test This change adds or pertains to unit tests backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Oct 22, 2021
@DilumAluthge
Copy link
Member Author

I'm not sure if this change actually makes any sense.

Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's from multi-threaded tests, I don't think this is unexpected? I think it'd be better to move tester_linux64_mt (or only Distributed's tests) to experimental or skip this test when nthreads()> 1.

I think moving multithreaded tests for Distributed to experimental would be better than skipping it since we can do some post-hoc analysis on how much flaky it has been.

@DilumAluthge
Copy link
Member Author

Hmmm. I definitely don't want to mark the entire tester_linux64_mt job as experimental, since all other test sets pass

I guess we could make a separate job that specifically runs the Distributed job with multithreading enabled, and mark that job as experimental.

@DilumAluthge
Copy link
Member Author

I guess we could make a separate job that specifically runs the Distributed job with multithreading enabled, and mark that job as experimental.

Hmmm. This doesn't fix the case where someone runs the tests locally with e.g. JULIA_NUM_THREADS=16 ./julia test/runtests.jl all.

If this test is known/expected to sometimes fail when Julia was started with multiple threads, I think I would prefer to go with your suggestion of skipping those specific tests if Threads.nthreads > 1.

@DilumAluthge
Copy link
Member Author

Actually, we can do both. We can skip it by default, but have a Buildkite job that doesn't skip it.

@DilumAluthge
Copy link
Member Author

I'll do this in two parts:

The first part (skip these tests when Threads.nthreads() > 1, whether we are on CI or running locally) is here: #42764

I'll make the PR for the second part (experimental Buildkite job that runs the Distributed test suite with Threads.nthreads() > 1) once #42764 is merged.

@DilumAluthge DilumAluthge deleted the dpa/distributed-test-poll-while branch October 25, 2021 03:34
@KristofferC KristofferC removed backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation test This change adds or pertains to unit tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants