Distributed test suite: add an additional use of the `poll_while` function #42758

DilumAluthge · 2021-10-22T11:00:03Z

Follow-up to #42499

This pull request is intended to fix the following test failure, which I am seeing very rarely in the tester_linux64_mt Buildkite job:

Distributed                         (1) |        started at 2021-10-22T10:10:08.433
Test Failed at /cache/build/amdci7-2/julialang/julia-master/julia-609a4a0a1e/share/julia/stdlib/v1.8/Distributed/test/distributed_exec.jl:270
  Expression: remotecall_fetch((k->begin
                haskey(Distributed.PGRP.refs, k)
            end), wid1, rrid) == true
   Evaluated: false == true
ERROR: LoadError: There was an error during testing
in expression starting at /cache/build/amdci7-2/julialang/julia-master/julia-609a4a0a1e/share/julia/stdlib/v1.8/Distributed/test/distributed_exec.jl:263
Distributed                         (1) |         failed at 2021-10-22T10:10:33.207

Full log: https://buildkite.com/julialang/julia-master/builds/5028#dfdf3a45-5510-402b-99d4-9daa860f0f09

…ction

DilumAluthge · 2021-10-22T11:23:40Z

I'm not sure if this change actually makes any sense.

tkf

If it's from multi-threaded tests, I don't think this is unexpected? I think it'd be better to move tester_linux64_mt (or only Distributed's tests) to experimental or skip this test when nthreads()> 1.

I think moving multithreaded tests for Distributed to experimental would be better than skipping it since we can do some post-hoc analysis on how much flaky it has been.

DilumAluthge · 2021-10-22T16:44:19Z

Hmmm. I definitely don't want to mark the entire tester_linux64_mt job as experimental, since all other test sets pass

I guess we could make a separate job that specifically runs the Distributed job with multithreading enabled, and mark that job as experimental.

DilumAluthge · 2021-10-22T17:09:30Z

I guess we could make a separate job that specifically runs the Distributed job with multithreading enabled, and mark that job as experimental.

Hmmm. This doesn't fix the case where someone runs the tests locally with e.g. JULIA_NUM_THREADS=16 ./julia test/runtests.jl all.

If this test is known/expected to sometimes fail when Julia was started with multiple threads, I think I would prefer to go with your suggestion of skipping those specific tests if Threads.nthreads > 1.

DilumAluthge · 2021-10-22T17:13:05Z

Actually, we can do both. We can skip it by default, but have a Buildkite job that doesn't skip it.

DilumAluthge · 2021-10-22T17:32:51Z

I'll do this in two parts:

The first part (skip these tests when Threads.nthreads() > 1, whether we are on CI or running locally) is here: #42764

I'll make the PR for the second part (experimental Buildkite job that runs the Distributed test suite with Threads.nthreads() > 1) once #42764 is merged.

Distributed test suite: add an additional use of the poll_while fun…

41e79d5

…ction

DilumAluthge added parallelism Parallel or distributed computation test This change adds or pertains to unit tests backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Oct 22, 2021

DilumAluthge requested review from tkf, vchuravy and vtjnash October 22, 2021 11:00

tkf reviewed Oct 22, 2021

View reviewed changes

KristofferC mentioned this pull request Oct 22, 2021

release-1.7: Backports for 1.7.0/1.7.0-rc3 #42765

Merged

66 tasks

DilumAluthge closed this Oct 25, 2021

DilumAluthge deleted the dpa/distributed-test-poll-while branch October 25, 2021 03:34

KristofferC removed backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed test suite: add an additional use of the `poll_while` function #42758

Distributed test suite: add an additional use of the `poll_while` function #42758

DilumAluthge commented Oct 22, 2021 •

edited

Loading

DilumAluthge commented Oct 22, 2021

tkf left a comment

DilumAluthge commented Oct 22, 2021

DilumAluthge commented Oct 22, 2021

DilumAluthge commented Oct 22, 2021

DilumAluthge commented Oct 22, 2021

Distributed test suite: add an additional use of the poll_while function #42758

Distributed test suite: add an additional use of the poll_while function #42758

Conversation

DilumAluthge commented Oct 22, 2021 • edited Loading

DilumAluthge commented Oct 22, 2021

tkf left a comment

Choose a reason for hiding this comment

DilumAluthge commented Oct 22, 2021

DilumAluthge commented Oct 22, 2021

DilumAluthge commented Oct 22, 2021

DilumAluthge commented Oct 22, 2021

Distributed test suite: add an additional use of the `poll_while` function #42758

Distributed test suite: add an additional use of the `poll_while` function #42758

DilumAluthge commented Oct 22, 2021 •

edited

Loading