-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix "Sporadic test failures in BPD tests #3309" #3330
Conversation
The bpd test bind a socket in order to test the protocol implementation. When running concurrently this often resulted in an attempt to bind an already occupied port. By using the port number `0` we instead let the OS choose a free port. We then have to extract it from the socket (which is handled by `bluelet`) via `mock.patch`ing.
While testing I noticed a different error then the one in #3309. I'm not sure if and how I have caused it. In 64 test runs with each in python 3.7 and 2.7 with each running all 64 test_player tests (i.e.
This timeout occurs when reading from the socket and is thus unlike #3309 which happen while connecting. The failed tests were (number of test run, python version, failed test):
|
Interestingly AppVeyor fails differently. It claims to not be able to pickle local object, but both Travis and my laptop can. I guess I'll refactor |
Under some circumstances (maybe under MS Windows?) local objects can't be pickled. When `start_server` is a local this causes a crash: https://ci.appveyor.com/project/beetbox/beets/builds/25996163/job/rbp3frnkwsvbuwx6#L541 Make `start_server` a freestanding function to mitigate this.
The pickeling issue seems to be resolved but the |
This general approach looks promising. I ran into the pickling issue on Windows when I wrote these tests.Since on Windows there's no These tests are much more delicate than I'd wanted but as Adrian probably remembers I tried many different approaches to concurrency and this was the only way I could get it working. The BPD server (which handles connections using bluelet) needs to be running at the same time as the clients launched by the tests. Multiple types of concurrency plus sockets makes this challenging to get right in a portable way. It seems that there are 3 distinct failures in the current CI run. On linux there's the
and
I wonder if there's a scheduling issue related to how the tests are being run (I'm not sure how nose implements the parallelism). Or perhaps there's some sort of race when setting up and tearing down the different components. I tried to make all that as explicit as possible and I thought that I'd avoided races, but I could well be wrong. Locally I use pytest but even with nose I could never reproduce this flakiness outside the CI. The tests are very useful when developing the BPD plugin, but unless BPD is being changed there's probably not much benefit in running the whole suite. I wonder if at this point we should just turn (most of) it off by default to avoid inflicting it on unrelated PRs. We can ask people changing BPD to make sure they run the tests locally. I know that's not ideal but maybe it's the simplest answer. |
Wow; this is pretty tricky. Thanks for looking into it, both of you. The error on Windows about the SQLite database file is something we've run into before, when initially setting up the Windows tests—on that platform, you're not allowed to open the same database file concurrently from different processes. It's usually not too hard to avoid this restriction by making child processes/threads make a new database file or, if possible, just use SQLite's special The racy port & connection issues, however, are pretty much a total mystery to me. It really seems like this should work, and I don't see any obvious place where there should be a race. 😱 It will take some real debugging wizardry to pin that down… |
I do not think the problem is accessing the database from different processes, the traceback mentions Though we do open the library from two different processes without closing it inbetween.
This does not seem to cause a problem, but we could try closing it anyway. |
Close sockets in `finally`-clauses only after they have actually been created.
Is it possible to cancel AppVeyor jobs? These take way to long. |
Are there multiple |
Use a `multiprocessing.Queue` instead of a `multiprocessing.Value` to avoid the manual polling/timeout handling. TODO: Strangely Listener seems to be constructed twice. Only the second one is used. Fix that and then remove the code working around it.
And the second entry in the queue is always the correct port. I have hacked up some code always waiting for the second value. Seems quite stable on my system, let's try on the CI! But we should definitely search for the root cause (i.e. the duplicated |
Well, now, that is a new error:
This module should be supplied by futures. |
When setting up bpd tests, two servers are startet: first a control server, then bpd. Both send their assigned ports down a queue. The recipient only needs bpd's port and thus skips the first queue entry.
…tbox#3330 Add a changelog entry asking plugin developers to report any further occurrences of this failure.
Wow! This is truly heroic! Thank you for digging deep into this problem—this was not easy to find, obviously, but this looks to me like exactly the right fix. I've tried running the tests locally several times, including in parallel (using |
The bpd test bind a socket in order to test the protocol implementation. When
running concurrently this often resulted in an attempt to bind an already
occupied port.
By using the port number
0
we instead let the OS choose a free port. We thenhave to extract it from the socket (which is handled by
bluelet
) viamock.patch
ing.Please note that I'm not really familiar with any of the relevant technologies (the
bpd
plugin, python'smultiprocessing
,bluelet
, sockets andunittest.mock.patch
). Please review thoroughly.