fix "Sporadic test failures in BPD tests #3309" #3330

zsinskri · 2019-07-15T19:57:32Z

The bpd test bind a socket in order to test the protocol implementation. When
running concurrently this often resulted in an attempt to bind an already
occupied port.

By using the port number 0 we instead let the OS choose a free port. We then
have to extract it from the socket (which is handled by bluelet) via
mock.patching.

Please note that I'm not really familiar with any of the relevant technologies (the bpd plugin, python's multiprocessing, bluelet, sockets and unittest.mock.patch). Please review thoroughly.

The bpd test bind a socket in order to test the protocol implementation. When running concurrently this often resulted in an attempt to bind an already occupied port. By using the port number `0` we instead let the OS choose a free port. We then have to extract it from the socket (which is handled by `bluelet`) via `mock.patch`ing.

zsinskri · 2019-07-15T20:11:29Z

While testing I noticed a different error then the one in #3309. I'm not sure if and how I have caused it.

In 64 test runs with each in python 3.7 and 2.7 with each running all 64 test_player tests (i.e. 64*64*2 = 8192 tests) 26 tests failed, 24 of these in python 2.7.

Traceback (most recent call last):
  File "<repo>/test/test_player.py", line <...>, in test_<...>
    with self.run_bpd() as client:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "<repo>/test/test_player.py", line 332, in run_bpd
    yield MPCClient(sock, do_hello)
  File "<repo>/test/test_player.py", line 153, in __init__
    hello = self.get_response()
  File "<repo>/test/test_player.py", line 166, in get_response
    line = self.readline()
  File "<repo>/test/test_player.py", line 224, in readline
    data = self.sock.recv(bufsize)
timeout: timed out

This timeout occurs when reading from the socket and is thus unlike #3309 which happen while connecting.

The failed tests were (number of test run, python version, failed test):

#07 py27 test_cmd_listplaylist (test.test_player.BPDPlaylistsTest)
#11 py27 test_cmd_playlistmove (test.test_player.BPDPlaylistsTest)
#12 py27 test_cmd_mixrampdb (test.test_player.BPDPlaybackTest)
#15 py27 test_cmd_listplaylists (test.test_player.BPDPlaylistsTest)
#17 py27 test_cmd_mixrampdb (test.test_player.BPDPlaybackTest)
#18 py27 test_cmd_playlistdelete (test.test_player.BPDPlaylistsTest)
#18 py27 test_cmd_save (test.test_player.BPDPlaylistsTest)
#20 py27 test_cmd_playlistinfo (test.test_player.BPDQueueTest)
#21 py27 test_cmd_consume_in_reverse (test.test_player.BPDPlaybackTest)
#21 py27 test_cmd_listplaylistinfo (test.test_player.BPDPlaylistsTest)
#31 py27 test_cmd_listplaylistinfo (test.test_player.BPDPlaylistsTest)
#37 py27 test_cmd_decoders (test.test_player.BPDReflectionTest)
#37 py27 test_empty_request (test.test_player.BPDTest)
#37 py37 test_unexpected_argument (test.test_player.BPDTest)
#38 py27 test_cmd_playlistdelete (test.test_player.BPDPlaylistsTest)
#42 py27 test_cmd_currentsong_tagtypes (test.test_player.BPDQueryTest)
#46 py27 test_cmd_ping (test.test_player.BPDConnectionTest)
#48 py27 test_missing_argument (test.test_player.BPDTest)
#49 py27 test_cmd_list (test.test_player.BPDDatabaseTest)
#51 py27 test_cmd_playlistinfo_tagtypes (test.test_player.BPDQueueTest)
#52 py27 test_cmd_mixrampdb (test.test_player.BPDPlaybackTest)
#54 py37 test_cmd_lsinfo (test.test_player.BPDDatabaseTest)
#57 py27 test_cmd_list (test.test_player.BPDDatabaseTest)
#57 py27 test_cmd_idle_with_pending (test.test_player.BPDQueryTest)
#62 py27 test_cmd_replay_gain (test.test_player.BPDPlaybackTest)
#62 py27 test_cmd_rename (test.test_player.BPDPlaylistsTest)

zsinskri · 2019-07-15T20:18:20Z

Interestingly AppVeyor fails differently. It claims to not be able to pickle local object, but both Travis and my laptop can. I guess I'll refactor start_server to be a freestanding function again.

Under some circumstances (maybe under MS Windows?) local objects can't be pickled. When `start_server` is a local this causes a crash: https://ci.appveyor.com/project/beetbox/beets/builds/25996163/job/rbp3frnkwsvbuwx6#L541 Make `start_server` a freestanding function to mitigate this.

zsinskri · 2019-07-15T20:42:42Z

The pickeling issue seems to be resolved but the sock.recv timeout persists.

arcresu · 2019-07-16T01:11:56Z

This general approach looks promising. I ran into the pickling issue on Windows when I wrote these tests.Since on Windows there's no os.fork(), Python works around it using some pickle-based IPC. One of the objects that it needs to pass includes the database connection in its hierarchy, which can't be pickled.

These tests are much more delicate than I'd wanted but as Adrian probably remembers I tried many different approaches to concurrency and this was the only way I could get it working. The BPD server (which handles connections using bluelet) needs to be running at the same time as the clients launched by the tests. Multiple types of concurrency plus sockets makes this challenging to get right in a portable way.

It seems that there are 3 distinct failures in the current CI run. On linux there's the soc.recv timeout, but on the Windows CI jobs there's the same failure and also:

======================================================================
ERROR: test_cmd_listplaylistinfo (test.test_player.BPDPlaylistsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\beets\test\test_player.py", line 277, in tearDown
    self.teardown_beets()
  File "C:\projects\beets\test\helper.py", line 211, in teardown_beets
    self.remove_temp_dir()
  File "C:\projects\beets\test\helper.py", line 480, in remove_temp_dir
    shutil.rmtree(self.temp_dir)
  File "c:\python27\Lib\shutil.py", line 275, in rmtree
    onerror(os.remove, fullname, sys.exc_info())
  File "c:\python27\Lib\shutil.py", line 273, in rmtree
    os.remove(fullname)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\appveyor\\appdata\\local\\temp\\1\\tmp5lipem\\library.db'

and

======================================================================
ERROR: test_cmd_listplaylistinfo (test.test_player.BPDPlaylistsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\beets\test\test_player.py", line 837, in test_cmd_listplaylistinfo
    with self.run_bpd() as client:
  File "c:\python27\Lib\contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "C:\projects\beets\test\test_player.py", line 321, in run_bpd
    raise RuntimeError('Timed out waiting for the BPD server')
RuntimeError: Timed out waiting for the BPD server

I wonder if there's a scheduling issue related to how the tests are being run (I'm not sure how nose implements the parallelism). Or perhaps there's some sort of race when setting up and tearing down the different components. I tried to make all that as explicit as possible and I thought that I'd avoided races, but I could well be wrong. Locally I use pytest but even with nose I could never reproduce this flakiness outside the CI.

The tests are very useful when developing the BPD plugin, but unless BPD is being changed there's probably not much benefit in running the whole suite. I wonder if at this point we should just turn (most of) it off by default to avoid inflicting it on unrelated PRs. We can ask people changing BPD to make sure they run the tests locally. I know that's not ideal but maybe it's the simplest answer.

sampsyo · 2019-07-16T13:55:03Z

Wow; this is pretty tricky. Thanks for looking into it, both of you.

The error on Windows about the SQLite database file is something we've run into before, when initially setting up the Windows tests—on that platform, you're not allowed to open the same database file concurrently from different processes. It's usually not too hard to avoid this restriction by making child processes/threads make a new database file or, if possible, just use SQLite's special :memory: database to avoid using a file at all.

The racy port & connection issues, however, are pretty much a total mystery to me. It really seems like this should work, and I don't see any obvious place where there should be a race. 😱 It will take some real debugging wizardry to pin that down…

zsinskri · 2019-07-17T11:15:41Z

I do not think the problem is accessing the database from different processes, the traceback mentions unlink. I think this is a test failing to shutdown the server. tearDown then tries to remove the tmpdir but fails as the library file is opened by the server.

Though we do open the library from two different processes without closing it inbetween.

we open it in the main process
add 2 items
do NOT close it
open it from the server

This does not seem to cause a problem, but we could try closing it anyway.

Close sockets in `finally`-clauses only after they have actually been created.

zsinskri · 2019-07-17T12:07:08Z

Is it possible to cancel AppVeyor jobs? These take way to long.

zsinskri · 2019-07-17T14:04:05Z

Are there multiple bluelet.Listener instances created? I tried replacing the mp.Value with a mp.Queue and I can consistently read exactly 2 different ports from that queue per run_bpd call, even though every (mock.patched) bluelet.Listener.__init__ binds once to port 0 and then sends the assigned port down the queue.

Use a `multiprocessing.Queue` instead of a `multiprocessing.Value` to avoid the manual polling/timeout handling. TODO: Strangely Listener seems to be constructed twice. Only the second one is used. Fix that and then remove the code working around it.

zsinskri · 2019-07-17T14:20:10Z

And the second entry in the queue is always the correct port.
That means the code polling the mp.Value was racing the construction of the second bluelet.Listener, reading the wrong port when winning the race. All reads on that port then throw a timeout as the port is bound but noone is listening.

I have hacked up some code always waiting for the second value. Seems quite stable on my system, let's try on the CI! But we should definitely search for the root cause (i.e. the duplicated bluelet.Listener construction).

zsinskri · 2019-07-17T14:32:20Z

Well, now, that is a new error:

ERROR: Failure: ImportError (No module named html.entities)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\beets\.tox\py27-test\lib\site-packages\nose\loader.py", line 418, in loadTestsFromName
    addr.filename, addr.module)
  File "C:\projects\beets\.tox\py27-test\lib\site-packages\nose\importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "C:\projects\beets\.tox\py27-test\lib\site-packages\nose\importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "C:\projects\beets\test\test_lastgenre.py", line 24, in <module>
    from beetsplug import lastgenre
  File "C:\projects\beets\beetsplug\lastgenre\__init__.py", line 28, in <module>
    import pylast
  File "C:\projects\beets\.tox\py27-test\lib\site-packages\pylast\__init__.py", line 25, in <module>
    import html.entities
ImportError: No module named html.entities

This module should be supplied by futures.
AppVeyor's py27 build has this failure three times. The other AppVeyor builds have not yet finished.
But it's promising that there are no more errors in test_player! At least not yet…

When setting up bpd tests, two servers are startet: first a control server, then bpd. Both send their assigned ports down a queue. The recipient only needs bpd's port and thus skips the first queue entry.

…tbox#3330 Add a changelog entry asking plugin developers to report any further occurrences of this failure.

zsinskri · 2019-07-17T17:33:22Z

I had 64 * (py27 + py37) successful local test runs and in addition to the currently running CI tests Travis and AppVeyor have already completed successfully once.

Someone with appropriate rights could maybe retrigger the CI tests a few times just to be sure.

sampsyo · 2019-07-19T00:24:16Z

Wow! This is truly heroic! Thank you for digging deep into this problem—this was not easy to find, obviously, but this looks to me like exactly the right fix. I've tried running the tests locally several times, including in parallel (using nosetests --processes=4) without incident. Awesome!!

sampsyo · 2019-07-19T00:26:28Z

In general, thank you for all your recent contributions, @zsinskri! I've sent you an invitation to the @beetbox org so you can push directly if need be. But don't let that stop you from asking for code reviews with PRs, of course. 😃

zsinskri added 2 commits July 17, 2019 13:21

bpd tests: terminate server upon connection failure

fb07a51

bpd tests: close only existing sockets

871f79c

Close sockets in `finally`-clauses only after they have actually been created.

zsinskri added 2 commits July 17, 2019 19:05

bpd tests: skip control_port in queue

088af4d

When setting up bpd tests, two servers are startet: first a control server, then bpd. Both send their assigned ports down a queue. The recipient only needs bpd's port and thus skips the first queue entry.

changelog: fix "Sporadic test failures in BPD tests beetbox#3309" bee…

5e5cb3c

…tbox#3330 Add a changelog entry asking plugin developers to report any further occurrences of this failure.

zsinskri marked this pull request as ready for review July 17, 2019 17:30

sampsyo merged commit 5adb3ec into beetbox:master Jul 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix "Sporadic test failures in BPD tests #3309" #3330

fix "Sporadic test failures in BPD tests #3309" #3330

zsinskri commented Jul 15, 2019

zsinskri commented Jul 15, 2019

zsinskri commented Jul 15, 2019

zsinskri commented Jul 15, 2019

arcresu commented Jul 16, 2019 •

edited

Loading

sampsyo commented Jul 16, 2019

zsinskri commented Jul 17, 2019

zsinskri commented Jul 17, 2019

zsinskri commented Jul 17, 2019

zsinskri commented Jul 17, 2019

zsinskri commented Jul 17, 2019 •

edited

Loading

zsinskri commented Jul 17, 2019

sampsyo commented Jul 19, 2019

sampsyo commented Jul 19, 2019

fix "Sporadic test failures in BPD tests #3309" #3330

fix "Sporadic test failures in BPD tests #3309" #3330

Conversation

zsinskri commented Jul 15, 2019

zsinskri commented Jul 15, 2019

zsinskri commented Jul 15, 2019

zsinskri commented Jul 15, 2019

arcresu commented Jul 16, 2019 • edited Loading

sampsyo commented Jul 16, 2019

zsinskri commented Jul 17, 2019

zsinskri commented Jul 17, 2019

zsinskri commented Jul 17, 2019

zsinskri commented Jul 17, 2019

zsinskri commented Jul 17, 2019 • edited Loading

zsinskri commented Jul 17, 2019

sampsyo commented Jul 19, 2019

sampsyo commented Jul 19, 2019

arcresu commented Jul 16, 2019 •

edited

Loading

zsinskri commented Jul 17, 2019 •

edited

Loading