Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix "Sporadic test failures in BPD tests #3309" #3330

Merged
merged 7 commits into from
Jul 19, 2019

Conversation

zsinskri
Copy link
Contributor

The bpd test bind a socket in order to test the protocol implementation. When
running concurrently this often resulted in an attempt to bind an already
occupied port.

By using the port number 0 we instead let the OS choose a free port. We then
have to extract it from the socket (which is handled by bluelet) via
mock.patching.

Please note that I'm not really familiar with any of the relevant technologies (the bpd plugin, python's multiprocessing, bluelet, sockets and unittest.mock.patch). Please review thoroughly.

The bpd test bind a socket in order to test the protocol implementation. When
running concurrently this often resulted in an attempt to bind an already
occupied port.

By using the port number `0` we instead let the OS choose a free port. We then
have to extract it from the socket (which is handled by `bluelet`) via
`mock.patch`ing.
@zsinskri
Copy link
Contributor Author

While testing I noticed a different error then the one in #3309. I'm not sure if and how I have caused it.

In 64 test runs with each in python 3.7 and 2.7 with each running all 64 test_player tests (i.e. 64*64*2 = 8192 tests) 26 tests failed, 24 of these in python 2.7.

Traceback (most recent call last):
  File "<repo>/test/test_player.py", line <...>, in test_<...>
    with self.run_bpd() as client:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "<repo>/test/test_player.py", line 332, in run_bpd
    yield MPCClient(sock, do_hello)
  File "<repo>/test/test_player.py", line 153, in __init__
    hello = self.get_response()
  File "<repo>/test/test_player.py", line 166, in get_response
    line = self.readline()
  File "<repo>/test/test_player.py", line 224, in readline
    data = self.sock.recv(bufsize)
timeout: timed out

This timeout occurs when reading from the socket and is thus unlike #3309 which happen while connecting.

The failed tests were (number of test run, python version, failed test):

#07 py27 test_cmd_listplaylist (test.test_player.BPDPlaylistsTest)
#11 py27 test_cmd_playlistmove (test.test_player.BPDPlaylistsTest)
#12 py27 test_cmd_mixrampdb (test.test_player.BPDPlaybackTest)
#15 py27 test_cmd_listplaylists (test.test_player.BPDPlaylistsTest)
#17 py27 test_cmd_mixrampdb (test.test_player.BPDPlaybackTest)
#18 py27 test_cmd_playlistdelete (test.test_player.BPDPlaylistsTest)
#18 py27 test_cmd_save (test.test_player.BPDPlaylistsTest)
#20 py27 test_cmd_playlistinfo (test.test_player.BPDQueueTest)
#21 py27 test_cmd_consume_in_reverse (test.test_player.BPDPlaybackTest)
#21 py27 test_cmd_listplaylistinfo (test.test_player.BPDPlaylistsTest)
#31 py27 test_cmd_listplaylistinfo (test.test_player.BPDPlaylistsTest)
#37 py27 test_cmd_decoders (test.test_player.BPDReflectionTest)
#37 py27 test_empty_request (test.test_player.BPDTest)
#37 py37 test_unexpected_argument (test.test_player.BPDTest)
#38 py27 test_cmd_playlistdelete (test.test_player.BPDPlaylistsTest)
#42 py27 test_cmd_currentsong_tagtypes (test.test_player.BPDQueryTest)
#46 py27 test_cmd_ping (test.test_player.BPDConnectionTest)
#48 py27 test_missing_argument (test.test_player.BPDTest)
#49 py27 test_cmd_list (test.test_player.BPDDatabaseTest)
#51 py27 test_cmd_playlistinfo_tagtypes (test.test_player.BPDQueueTest)
#52 py27 test_cmd_mixrampdb (test.test_player.BPDPlaybackTest)
#54 py37 test_cmd_lsinfo (test.test_player.BPDDatabaseTest)
#57 py27 test_cmd_list (test.test_player.BPDDatabaseTest)
#57 py27 test_cmd_idle_with_pending (test.test_player.BPDQueryTest)
#62 py27 test_cmd_replay_gain (test.test_player.BPDPlaybackTest)
#62 py27 test_cmd_rename (test.test_player.BPDPlaylistsTest)

@zsinskri
Copy link
Contributor Author

Interestingly AppVeyor fails differently. It claims to not be able to pickle local object, but both Travis and my laptop can. I guess I'll refactor start_server to be a freestanding function again.

Under some circumstances (maybe under MS Windows?) local objects can't be
pickled. When `start_server` is a local this causes a crash:
https://ci.appveyor.com/project/beetbox/beets/builds/25996163/job/rbp3frnkwsvbuwx6#L541

Make `start_server` a freestanding function to mitigate this.
@zsinskri
Copy link
Contributor Author

The pickeling issue seems to be resolved but the sock.recv timeout persists.

@arcresu
Copy link
Member

arcresu commented Jul 16, 2019

This general approach looks promising. I ran into the pickling issue on Windows when I wrote these tests.Since on Windows there's no os.fork(), Python works around it using some pickle-based IPC. One of the objects that it needs to pass includes the database connection in its hierarchy, which can't be pickled.

These tests are much more delicate than I'd wanted but as Adrian probably remembers I tried many different approaches to concurrency and this was the only way I could get it working. The BPD server (which handles connections using bluelet) needs to be running at the same time as the clients launched by the tests. Multiple types of concurrency plus sockets makes this challenging to get right in a portable way.

It seems that there are 3 distinct failures in the current CI run. On linux there's the soc.recv timeout, but on the Windows CI jobs there's the same failure and also:

======================================================================
ERROR: test_cmd_listplaylistinfo (test.test_player.BPDPlaylistsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\beets\test\test_player.py", line 277, in tearDown
    self.teardown_beets()
  File "C:\projects\beets\test\helper.py", line 211, in teardown_beets
    self.remove_temp_dir()
  File "C:\projects\beets\test\helper.py", line 480, in remove_temp_dir
    shutil.rmtree(self.temp_dir)
  File "c:\python27\Lib\shutil.py", line 275, in rmtree
    onerror(os.remove, fullname, sys.exc_info())
  File "c:\python27\Lib\shutil.py", line 273, in rmtree
    os.remove(fullname)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\appveyor\\appdata\\local\\temp\\1\\tmp5lipem\\library.db'

and

======================================================================
ERROR: test_cmd_listplaylistinfo (test.test_player.BPDPlaylistsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\beets\test\test_player.py", line 837, in test_cmd_listplaylistinfo
    with self.run_bpd() as client:
  File "c:\python27\Lib\contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "C:\projects\beets\test\test_player.py", line 321, in run_bpd
    raise RuntimeError('Timed out waiting for the BPD server')
RuntimeError: Timed out waiting for the BPD server

I wonder if there's a scheduling issue related to how the tests are being run (I'm not sure how nose implements the parallelism). Or perhaps there's some sort of race when setting up and tearing down the different components. I tried to make all that as explicit as possible and I thought that I'd avoided races, but I could well be wrong. Locally I use pytest but even with nose I could never reproduce this flakiness outside the CI.

The tests are very useful when developing the BPD plugin, but unless BPD is being changed there's probably not much benefit in running the whole suite. I wonder if at this point we should just turn (most of) it off by default to avoid inflicting it on unrelated PRs. We can ask people changing BPD to make sure they run the tests locally. I know that's not ideal but maybe it's the simplest answer.

@sampsyo
Copy link
Member

sampsyo commented Jul 16, 2019

Wow; this is pretty tricky. Thanks for looking into it, both of you.

The error on Windows about the SQLite database file is something we've run into before, when initially setting up the Windows tests—on that platform, you're not allowed to open the same database file concurrently from different processes. It's usually not too hard to avoid this restriction by making child processes/threads make a new database file or, if possible, just use SQLite's special :memory: database to avoid using a file at all.

The racy port & connection issues, however, are pretty much a total mystery to me. It really seems like this should work, and I don't see any obvious place where there should be a race. 😱 It will take some real debugging wizardry to pin that down…

@zsinskri
Copy link
Contributor Author

I do not think the problem is accessing the database from different processes, the traceback mentions unlink. I think this is a test failing to shutdown the server. tearDown then tries to remove the tmpdir but fails as the library file is opened by the server.

Though we do open the library from two different processes without closing it inbetween.

  • we open it in the main process
  • add 2 items
  • do NOT close it
  • open it from the server

This does not seem to cause a problem, but we could try closing it anyway.

Close sockets in `finally`-clauses only after they have actually been created.
@zsinskri
Copy link
Contributor Author

Is it possible to cancel AppVeyor jobs? These take way to long.

@zsinskri
Copy link
Contributor Author

Are there multiple bluelet.Listener instances created? I tried replacing the mp.Value with a mp.Queue and I can consistently read exactly 2 different ports from that queue per run_bpd call, even though every (mock.patched) bluelet.Listener.__init__ binds once to port 0 and then sends the assigned port down the queue.

Use a `multiprocessing.Queue` instead of a `multiprocessing.Value` to avoid the
manual polling/timeout handling.

TODO: Strangely Listener seems to be constructed twice. Only the second one is
used. Fix that and then remove the code working around it.
@zsinskri
Copy link
Contributor Author

And the second entry in the queue is always the correct port.
That means the code polling the mp.Value was racing the construction of the second bluelet.Listener, reading the wrong port when winning the race. All reads on that port then throw a timeout as the port is bound but noone is listening.

I have hacked up some code always waiting for the second value. Seems quite stable on my system, let's try on the CI! But we should definitely search for the root cause (i.e. the duplicated bluelet.Listener construction).

@zsinskri
Copy link
Contributor Author

zsinskri commented Jul 17, 2019

Well, now, that is a new error:

ERROR: Failure: ImportError (No module named html.entities)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\beets\.tox\py27-test\lib\site-packages\nose\loader.py", line 418, in loadTestsFromName
    addr.filename, addr.module)
  File "C:\projects\beets\.tox\py27-test\lib\site-packages\nose\importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "C:\projects\beets\.tox\py27-test\lib\site-packages\nose\importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "C:\projects\beets\test\test_lastgenre.py", line 24, in <module>
    from beetsplug import lastgenre
  File "C:\projects\beets\beetsplug\lastgenre\__init__.py", line 28, in <module>
    import pylast
  File "C:\projects\beets\.tox\py27-test\lib\site-packages\pylast\__init__.py", line 25, in <module>
    import html.entities
ImportError: No module named html.entities

This module should be supplied by futures.
AppVeyor's py27 build has this failure three times. The other AppVeyor builds have not yet finished.
But it's promising that there are no more errors in test_player! At least not yet…

When setting up bpd tests, two servers are startet: first a control server, then
bpd. Both send their assigned ports down a queue. The recipient only needs bpd's
port and thus skips the first queue entry.
…tbox#3330

Add a changelog entry asking plugin developers to report any further occurrences
of this failure.
@zsinskri zsinskri marked this pull request as ready for review July 17, 2019 17:30
@zsinskri
Copy link
Contributor Author

I had 64 * (py27 + py37) successful local test runs and in addition to the currently running CI tests Travis and AppVeyor have already completed successfully once.

Someone with appropriate rights could maybe retrigger the CI tests a few times just to be sure.

@sampsyo
Copy link
Member

sampsyo commented Jul 19, 2019

Wow! This is truly heroic! Thank you for digging deep into this problem—this was not easy to find, obviously, but this looks to me like exactly the right fix. I've tried running the tests locally several times, including in parallel (using nosetests --processes=4) without incident. Awesome!!

@sampsyo sampsyo merged commit 5adb3ec into beetbox:master Jul 19, 2019
@sampsyo
Copy link
Member

sampsyo commented Jul 19, 2019

In general, thank you for all your recent contributions, @zsinskri! I've sent you an invitation to the @beetbox org so you can push directly if need be. But don't let that stop you from asking for code reviews with PRs, of course. 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants