Python ClientWrapperTest fails occasionally #1714

danielskeenan · 2021-02-18T18:09:13Z

This has apparently been a known issue for a while, but I've narrowed down the circumstances it can happen in while working on OpenLightingProject/website#14. Warning - it's a bit involved.

When protobuf is installed from pip (i.e. pip3 install protobuf), ClientWrapperTest fails. When protobuf is installed from the system package manager (i.e. apt-get install python3-protobuf), the test passes. Checked with the latest commit in master (a95a6f7 currently) with Ubuntu 20.10 and Ubuntu 20.04. I haven't properly checked Fedora yet (only a cursory test while writing docs), but I suspect it will have the same issue for reasons detailed below.

Some speculation:

My first thought was an issue between Python2 packages and Python3 packages. These test systems only have Python3, so that's not it.

Further research reveals that GNOME distros (or, at least, Ubuntu and Fedora) come with gnome-online-accounts, which itself depends on python3-protobuf. This means that GNOME users will likely have protobuf for python already installed on their systems. If they install deps from pip (e.g. with pip3 install protobuf numpy), pip won't install another protobuf as it sees the system version already. This also explains why the issue occurs on headless systems (i.e. Travis), as those systems won't have the preinstalled python protobuf and will install it from pip if you ask.

But what's the difference between the two sources? My instinct is that something changed in newer Protobuf releases (I know this is a regular problem here). The latest release on pip is 3.14.0. Ubuntu 20.10 will get you 3.12.3. Ubuntu 20.04 provides 3.6.1. Fedora 33 has 3.12.4. Fedora 32 has 3.11.2.

To confirm this theory, I installed protobuf 3.12.4 from pip and ran tests, which all passed. protobuf 3.13.0 also passes. This leads me to conclude something about protobuf 3.14 is causing problems. Caveat - I havn't tested any of this with Python2.

Test log:

==================================
   OLA 0.10.8: ./test-suite.log
==================================

# TOTAL: 95
# PASS:  94
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: python/ola/ClientWrapperTest.sh
=====================================

...F/usr/lib/python3.8/unittest/case.py:704: ResourceWarning: unclosed <socket.socket fd=4, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
  outcome.errors.clear()
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
======================================================================
FAIL: testEventLoop (__main__.ClientWrapperTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./python/ola/ClientWrapperTest.py", line 195, in testEventLoop
    self.assertTrue(results.d_called - self.start >=
AssertionError: False is not true

----------------------------------------------------------------------
Ran 5 tests in 0.039s

FAILED (failures=1)
FAIL python/ola/ClientWrapperTest.sh (exit status: 1)

The text was updated successfully, but these errors were encountered:

kripton · 2021-02-18T18:50:21Z

Well investigated. However, it might be related to something else. On my machine (Gentoo Linux, protobuf-3.14.0 installed via system package manager), the test passes:

PASS: python/ola/ClientWrapperTest.sh

Protobuf (and its Python bindings) are already "stable" on Gentoo Linux (at least for amd64): https://packages.gentoo.org/packages/dev-libs/protobuf and https://packages.gentoo.org/packages/dev-python/protobuf-python. I do have pip installed but I actually only use it if I have to. Python versions installed are:

2.7.18_p7(2.7)(09:40:44 02/17/21)(bluetooth examples gdbm ipv6 ncurses readline sqlite ssl threads tk wide-unicode xml -berkdb -build -hardened -libressl -verify-sig -wininst ELIBC="-uclibc")
3.8.8_rc1(3.8)^t(14:23:18 02/18/21)(bluetooth examples gdbm ipv6 ncurses readline sqlite ssl tk xml -build -hardened -libressl -test -verify-sig -wininst)
3.9.2_rc1(3.9)^t(14:26:52 02/18/21)(bluetooth examples gdbm ipv6 ncurses readline sqlite ssl tk xml -build -hardened -libressl -test -verify-sig -wininst)

But basically no package (including protobuf) install bindings for Python 2 anymore.

Configuration of ola-0.10.8 goes like that:

./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --disable-dependency-tracking --disable-silent-rules --docdir=/usr/share/doc/ola-0.10.8 --htmldir=/usr/share/doc/ola-0.10.8/html --with-sysroot=/ --libdir=/usr/lib64 --disable-fatal-warnings --disable-osc --enable-examples --enable-libftdi --enable-ftdidmx --enable-http --enable-python-libs --enable-libusb
...
checking whether /usr/bin/python3.8 version is >= 2.6... yes
checking for /usr/bin/python3.8 version... 3.8
checking for /usr/bin/python3.8 platform... linux
checking for /usr/bin/python3.8 script directory... ${prefix}/lib/python3.8/site-packages
checking for /usr/bin/python3.8 extension module directory... ${exec_prefix}/lib/python3.8/site-packages
checking for python3.8 module: google.protobuf... yes
checking for flake8... no
checking for libprotobuf... yes
checking for protoc... /usr/bin/protoc
checking protoc version... 3.14.0
...

Gentoo Linux is a bit exotic for people not used to it. However, if you want I can provide you a Dockerfile so you can reproduce and/or continue investigating with that if you'd like to.

kripton · 2021-02-18T19:04:17Z

Just tested also with the latest master (commit a95a6f7) and the tests are also all PASSing. However, my system runs fewer tests than yours:

# TOTAL: 94
# PASS:  94

danielskeenan · 2021-02-18T19:37:09Z

That matches my experience on Fedora with protobuf 3.14, so it appears the issue is confined to Ubuntu. Going to start digging through changelogs.

danielskeenan · 2021-02-18T20:26:06Z

I tweaked the failing test a bit to get better info:

diff --git a/python/ola/ClientWrapperTest.py b/python/ola/ClientWrapperTest.py
index cd1e5cf2d..a24041d15 100644
--- a/python/ola/ClientWrapperTest.py
+++ b/python/ola/ClientWrapperTest.py
@@ -186,13 +186,13 @@ class ClientWrapperTest(unittest.TestCase):
     self.assertIsNotNone(results.c_called)
     self.assertIsNotNone(results.d_called)
 
-    self.assertTrue(results.a_called - self.start <
+    self.assertLess(results.a_called - self.start,
                     datetime.timedelta(milliseconds=5))
-    self.assertTrue(results.b_called - self.start >=
+    self.assertGreaterEqual(results.b_called - self.start,
                     datetime.timedelta(milliseconds=5))
-    self.assertTrue(results.c_called - self.start >=
+    self.assertGreaterEqual(results.c_called - self.start,
                     datetime.timedelta(milliseconds=10))
-    self.assertTrue(results.d_called - self.start >=
+    self.assertGreaterEqual(results.d_called - self.start,
                     datetime.timedelta(milliseconds=15))
 
     sockets[0].close()

With the new test that says what the difference is, it fails with:

AssertionError: datetime.timedelta(microseconds=14612) not greater than or equal to datetime.timedelta(microseconds=15000)

However, I ran the same test a couple more times, without changing anything, and it passed. Are we just dealing with system scheduler variances?

brucelowekamp · 2021-02-18T20:37:01Z

Glad you're investigating. Though I don't know what protobuf could have to do with either of those failures.

On Thu, Feb 18, 2021 at 1:09 PM Dan Keenan ***@***.***> wrote: FAIL: python/ola/ClientWrapperTest.sh ===================================== ...F/usr/lib/python3.8/unittest/case.py:704: ResourceWarning: unclosed <socket.socket fd=4, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0> outcome.errors.clear() ResourceWarning: Enable tracemalloc to get the object allocation traceback ..

Glancing at the tests, the last test (testSend) doesn't have the sockets[0].close() sockets[1].close() at the end. That looks like my mistake.

====================================================================== FAIL: testEventLoop (__main__.ClientWrapperTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "./python/ola/ClientWrapperTest.py", line 195, in testEventLoop self.assertTrue(results.d_called - self.start >= AssertionError: False is not true Only way I see that happening would be a strange rounding error, but I'd

be really surprised it's repeatable. Curious what d_called and start are. I considered putting in checks that were a bit less than the requested delays (so 14ms here) but never saw it fail. Bruce

brucelowekamp · 2021-02-18T20:43:16Z

(messages overlapped...) I don't see how, as the check for when to call the task is explicit in the code (it's not just a sleep): return self._run_at < now but I wonder if there's some issue with timer precision underlying the python implementation giving different results in different places. I would just subtract 1 from each of the comparisons and leave it at that. Bruce

…

On Thu, Feb 18, 2021 at 3:26 PM Dan Keenan ***@***.***> wrote: I tweaked the failing test a bit to get better info: diff --git a/python/ola/ClientWrapperTest.py b/python/ola/ClientWrapperTest.py index cd1e5cf2d..a24041d15 100644--- a/python/ola/ClientWrapperTest.py+++ b/python/ola/ClientWrapperTest.py@@ -186,13 +186,13 @@ class ClientWrapperTest(unittest.TestCase): self.assertIsNotNone(results.c_called) self.assertIsNotNone(results.d_called) - self.assertTrue(results.a_called - self.start <+ self.assertLess(results.a_called - self.start, datetime.timedelta(milliseconds=5))- self.assertTrue(results.b_called - self.start >=+ self.assertGreaterEqual(results.b_called - self.start, datetime.timedelta(milliseconds=5))- self.assertTrue(results.c_called - self.start >=+ self.assertGreaterEqual(results.c_called - self.start, datetime.timedelta(milliseconds=10))- self.assertTrue(results.d_called - self.start >=+ self.assertGreaterEqual(results.d_called - self.start, datetime.timedelta(milliseconds=15)) sockets[0].close() With the new test that says what the difference is, it fails with: AssertionError: datetime.timedelta(microseconds=14612) not greater than or equal to datetime.timedelta(microseconds=15000) However, I ran the same test a couple more times, without changing anything, and it passed. Are we just dealing with system scheduler variances? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1714 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADNHN5NWKT2RYBDWBEQSH5TS7VZXNANCNFSM4X2YUA5A> .

danielskeenan · 2021-02-18T20:45:16Z

I've got a VM rebuilding the project on a single core to see what happens (it was difficult to get it to fail locally on my 12-core i7). Will work on a PR that relaxes the timing slightly.

peternewman · 2021-02-19T01:05:33Z

Oops, I thought I'd raised this elsewhere and tagged @brucelowekamp for some input but clearly not.

I hit the same thing on the Arm builds on Travis, but only confusingly at some point later the commits to get them merged worked fine, so I guess after some system change or similar, or package upgrade. I get (from https://travis-ci.org/github/OpenLightingProject/ola/jobs/755992758 which is a build of #1687):


FAIL: testEventLoop (__main__.ClientWrapperTest)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "../../python/ola/ClientWrapperTest.py", line 193, in testEventLoop

    datetime.timedelta(milliseconds=5))

AssertionError: datetime.timedelta(0, 0, 4923) not greater than or equal to datetime.timedelta(0, 0, 5000)

Relax timing on ClientWrapperTest

danielskeenan · 2021-02-20T14:37:29Z

As @peternewman on #1716 (comment), tests are still failing occasionally (at least on ARM) even after #1716 was merged.

Given what DMX timings can be in popular products, it's probably safe to raise the allowed timing delta from 500 µs to 1000 µs (1 ms). I suppose the real question is what resolution we want to guarantee callbacks will run with? Trying to avoid changing tests when perhaps those tests are saying there's a problem.

danielskeenan mentioned this issue Feb 18, 2021

Setup instructions revisions OpenLightingProject/website#14

Open

danielskeenan changed the title ~~Python ClientWrapperTest fails with protobuf 3.14.0~~ Python ClientWrapperTest fails on Ubuntu with protobuf 3.14.0 Feb 18, 2021

danielskeenan changed the title ~~Python ClientWrapperTest fails on Ubuntu with protobuf 3.14.0~~ Python ClientWrapperTest fails occasionally Feb 18, 2021

danielskeenan mentioned this issue Feb 18, 2021

Relax timing on ClientWrapperTest #1716

Merged

peternewman added a commit that referenced this issue Feb 19, 2021

Merge pull request #1716 from danielskeenan/#1714

448603a

Relax timing on ClientWrapperTest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python ClientWrapperTest fails occasionally #1714

Python ClientWrapperTest fails occasionally #1714

danielskeenan commented Feb 18, 2021

kripton commented Feb 18, 2021 •

edited

Loading

kripton commented Feb 18, 2021

danielskeenan commented Feb 18, 2021

danielskeenan commented Feb 18, 2021

brucelowekamp commented Feb 18, 2021 via email

brucelowekamp commented Feb 18, 2021 via email

danielskeenan commented Feb 18, 2021

peternewman commented Feb 19, 2021

danielskeenan commented Feb 20, 2021

Python ClientWrapperTest fails occasionally #1714

Python ClientWrapperTest fails occasionally #1714

Comments

danielskeenan commented Feb 18, 2021

kripton commented Feb 18, 2021 • edited Loading

kripton commented Feb 18, 2021

danielskeenan commented Feb 18, 2021

danielskeenan commented Feb 18, 2021

brucelowekamp commented Feb 18, 2021 via email

brucelowekamp commented Feb 18, 2021 via email

danielskeenan commented Feb 18, 2021

peternewman commented Feb 19, 2021

danielskeenan commented Feb 20, 2021

kripton commented Feb 18, 2021 •

edited

Loading