Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random test failures #3736

Open
1 task done
brianmay opened this issue Mar 12, 2024 · 20 comments · Fixed by #3738
Open
1 task done

random test failures #3736

brianmay opened this issue Mar 12, 2024 · 20 comments · Fixed by #3738
Labels
github_actions Pull requests that update Github_actions code note:needs investigation The issue must be investigated first

Comments

@brianmay
Copy link
Collaborator

brianmay commented Mar 12, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Noticed lots of test failures when running tests locally. The found I have to start tests with clean database. Oops.

But even then, was getting errors - up to 7. Then I found if I ran test files individually, was more like to work.

At a hunch I made the following change:

diff --git a/test/teslamate/vehicles/vehicle/charging_test.exs b/test/teslamate/vehicles/vehicle/charging_test.exs
index 106d64df..26fb4f65 100644
--- a/test/teslamate/vehicles/vehicle/charging_test.exs
+++ b/test/teslamate/vehicles/vehicle/charging_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.ChargingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
   alias TeslaMate.Log.ChargingProcess
diff --git a/test/teslamate/vehicles/vehicle/streaming_test.exs b/test/teslamate/vehicles/vehicle/streaming_test.exs
index 20f54723..c43db381 100644
--- a/test/teslamate/vehicles/vehicle/streaming_test.exs
+++ b/test/teslamate/vehicles/vehicle/streaming_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.StreamingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   import ExUnit.CaptureLog

diff --git a/test/teslamate/vehicles/vehicle/suspend_logging_test.exs b/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
index ae3d6677..f282debe 100644
--- a/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
+++ b/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.SuspendLoggingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
   alias TeslaMate.Vehicles.Vehicle
diff --git a/test/teslamate/vehicles/vehicle/suspend_test.exs b/test/teslamate/vehicles/vehicle/suspend_test.exs
index b89ee958..d8561f03 100644
--- a/test/teslamate/vehicles/vehicle/suspend_test.exs
+++ b/test/teslamate/vehicles/vehicle/suspend_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.SuspendTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary

Now all the tests pass for me. I suspect we have race conditions with multiple tests interfering with the database at the same time. Yuck.

Not sure I like this, but I think as our tests access the database they are more like integration tests not unit tests, and as a result they really do need to be run sequentially.

Expected Behavior

Tests pass without failures.

Steps To Reproduce

MIX_ENV=test mix test

Relevant log output

(after fix applied)
$ MIX_ENV=test mix test
....................................................................................................................................................................................................................................................................................................................................
Finished in 23.3 seconds (4.4s async, 18.8s sync)
324 tests, 0 failures

Randomized with seed 392195

Screenshots

No response

Additional data

No response

Type of installation

Docker

Version

master

@brianmay
Copy link
Collaborator Author

This still doesn't fix all the intermittent failures, but I now only see 1, not up to 7:

  1) test resumes logging when starting a drive (TeslaMate.Vehicles.Vehicle.StreamingTest)
     test/teslamate/vehicles/vehicle/streaming_test.exs:452
     Assertion failed, no matching message after 300ms
     The following variables were pinned:
       car = %TeslaMate.Log.Car{__meta__: #Ecto.Schema.Metadata<:built, "cars">, id: 55174, name: nil, efficiency: nil, model: "3", trim_badging: nil, marketing_name: nil, exterior_color: nil, wheel_type: nil, spoiler_type: nil, eid: 0, vid: 1000, vin: "1000", settings_id: nil, settings: %TeslaMate.Settings.CarSettings{__meta__: #Ecto.Schema.Metadata<:built, "car_settings">, id: nil, suspend_min: 999999999, suspend_after_idle_min: 15, req_not_unlocked: true, free_supercharging: false, use_streaming_api: true, car: #Ecto.Association.NotLoaded<association :car is not loaded>}, charging_processes: #Ecto.Association.NotLoaded<association :charging_processes is not loaded>, positions: #Ecto.Association.NotLoaded<association :positions is not loaded>, drives: #Ecto.Association.NotLoaded<association :drives is not loaded>, inserted_at: nil, updated_at: nil}
     The process mailbox is empty.
     code: assert_receive {:start_drive, ^car}
     stacktrace:
       test/teslamate/vehicles/vehicle/streaming_test.exs:478: (test)

@brianmay
Copy link
Collaborator Author

Just saw 2 more test failures.

I think the async flag needs to be turned off for all the tests in this directory.

diff --git a/test/teslamate/vehicles/vehicle/charging_test.exs b/test/teslamate/vehicles/vehicle/charging_test.exs
index 106d64df..26fb4f65 100644
--- a/test/teslamate/vehicles/vehicle/charging_test.exs
+++ b/test/teslamate/vehicles/vehicle/charging_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.ChargingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
   alias TeslaMate.Log.ChargingProcess
diff --git a/test/teslamate/vehicles/vehicle/driving_test.exs b/test/teslamate/vehicles/vehicle/driving_test.exs
index 11ac512e..38078ccc 100644
--- a/test/teslamate/vehicles/vehicle/driving_test.exs
+++ b/test/teslamate/vehicles/vehicle/driving_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.DrivingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
   alias TeslaMate.Log.Drive
diff --git a/test/teslamate/vehicles/vehicle/streaming_test.exs b/test/teslamate/vehicles/vehicle/streaming_test.exs
index 20f54723..c43db381 100644
--- a/test/teslamate/vehicles/vehicle/streaming_test.exs
+++ b/test/teslamate/vehicles/vehicle/streaming_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.StreamingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   import ExUnit.CaptureLog

diff --git a/test/teslamate/vehicles/vehicle/suspend_logging_test.exs b/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
index ae3d6677..f282debe 100644
--- a/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
+++ b/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.SuspendLoggingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
   alias TeslaMate.Vehicles.Vehicle
diff --git a/test/teslamate/vehicles/vehicle/suspend_test.exs b/test/teslamate/vehicles/vehicle/suspend_test.exs
index b89ee958..d8561f03 100644
--- a/test/teslamate/vehicles/vehicle/suspend_test.exs
+++ b/test/teslamate/vehicles/vehicle/suspend_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.SuspendTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary

diff --git a/test/teslamate/vehicles/vehicle/updating_test.exs b/test/teslamate/vehicles/vehicle/updating_test.exs
index d26cefb5..fc1a8fd6 100644
--- a/test/teslamate/vehicles/vehicle/updating_test.exs
+++ b/test/teslamate/vehicles/vehicle/updating_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.UpdatingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary

@brianmay
Copy link
Collaborator Author

Still getting the above error, but only rarely, but appears to be much better now.

Sometimes tests come up with the following message:

14:27:33.383 [error] Postgrex.Protocol (#PID<0.411.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.19679.0> exited

But this doesn't seem to matter.

@JakobLichterfeld
Copy link
Collaborator

Nice catch. Unreliable tests are a nightmare. I do see the asynchronous run approach to match reality therefore vote to turn asnyc off.

@JakobLichterfeld
Copy link
Collaborator

I pinned our CI to elixir-1.16.1-otp-26 as well and the tests run perfectly again, see #3733

@JakobLichterfeld JakobLichterfeld added note:needs investigation The issue must be investigated first github_actions Pull requests that update Github_actions code labels Mar 12, 2024
@JakobLichterfeld
Copy link
Collaborator

But came back when running on master, so unreliable with async test run.

@brianmay
Copy link
Collaborator Author

brianmay commented Mar 12, 2024

Yes, I upgraded my box to elixir 1.16, and I still get lots of test failures.

For some reason CI is more likely to have passing tests. Tests on my box never work without turning async off. Wonder if this is because the multi-tasking performance of my box is superior.

@brianmay
Copy link
Collaborator Author

brianmay commented Mar 12, 2024

Except this one test, I sometimes get failures here still:

  1) test resumes logging when starting a drive (TeslaMate.Vehicles.Vehicle.StreamingTest)
     test/teslamate/vehicles/vehicle/streaming_test.exs:452
     Assertion failed, no matching message after 300ms
     The following variables were pinned:
       car = %TeslaMate.Log.Car{__meta__: #Ecto.Schema.Metadata<:built, "cars">, id: 11376, name: nil, efficiency: nil, model: "3", trim_badging: nil, marketing_name: nil, exterior_color: nil, wheel_type: nil, spoiler_type: nil, eid: 0, vid: 1000, vin: "1000", settings_id: nil, settings: %TeslaMate.Settings.CarSettings{__meta__: #Ecto.Schema.Metadata<:built, "car_settings">, id: nil, suspend_min: 999999999, suspend_after_idle_min: 15, req_not_unlocked: true, free_supercharging: false, use_streaming_api: true, car: #Ecto.Association.NotLoaded<association :car is not loaded>}, charging_processes: #Ecto.Association.NotLoaded<association :charging_processes is not loaded>, positions: #Ecto.Association.NotLoaded<association :positions is not loaded>, drives: #Ecto.Association.NotLoaded<association :drives is not loaded>, inserted_at: nil, updated_at: nil}
     The process mailbox is empty.
     code: assert_receive {:start_drive, ^car}
     stacktrace:
       test/teslamate/vehicles/vehicle/streaming_test.exs:478: (test)

But it is always this one test, so wondering if that test is somehow broken or racey.

@brianmay
Copy link
Collaborator Author

Wonder if that timeout is just too short. Would hope that 300ms is plenty. But will try to reproduce with a timeout of 1000ms.

Might be hard to verify, a successful run doesn't mean anything.

@brianmay
Copy link
Collaborator Author

brianmay commented Mar 12, 2024

New theory: maybe tests are actually failing due to assert_receive timing out. This might explain "Assertion failed, no matching message after 300ms" message, but not the one "Unexpectedly received message" message.

But don't particularly want to update each and every assert_receive and refute_receive line by hand. If only we could set this globally. Oh, wait, we can.

ELIXIR_ASSERT_TIMEOUT=2000 mix test

But failures are persisting. Got 3-4 failures. And messages are showing new timeouts.

Just to be sure, I made the following change:

diff --git a/test/test_helper.exs b/test/test_helper.exs
index 58b9a493..3e6227c3 100644
--- a/test/test_helper.exs
+++ b/test/test_helper.exs
@@ -13,4 +13,4 @@ TeslaMate.Repo.start_link()
 Phoenix.PubSub.Supervisor.start_link(name: TeslaMate.PubSub)

 assert_timeout = String.to_integer(System.get_env("ELIXIR_ASSERT_TIMEOUT") || "300")
-ExUnit.start(assert_receive_timeout: assert_timeout)
+ExUnit.start(assert_receive_timeout: assert_timeout, refute_receive_timeout: assert_timeout)

As it bothered me that we set one, but not the other.

ELIXIR_ASSERT_TIMEOUT=2000 mix test

Now got 51+ test failures. Oops. I guess that was a bad change to make. But I think it does show that our tests are very timing sensitive, and perhaps this is not a good thing.

I don't think any of this helps, but documenting in case it gives anyone ideas.

@brianmay
Copy link
Collaborator Author

I rewrote the above text when I realised that increasing refute_receive_timeout increases the number of test failures.

@JakobLichterfeld
Copy link
Collaborator

But I think it does show that our tests are very timing sensitive, and perhaps this is not a good thing.

Total agree. Before looking deeper into it, I always thought our tests were great.

@brianmay
Copy link
Collaborator Author

To be fair, writing effective and efficient tests when you have lots of processes interacting like this is... hard.

Not really what is considered good practise in fact. I just know how to criticise bad practise :-)

brianmay added a commit that referenced this issue Mar 14, 2024
It appears to fix most of the random test errors.

Fixes: #3736
brianmay added a commit that referenced this issue Mar 14, 2024
It appears to fix most of the random test errors.

Fixes: #3736
@brianmay brianmay changed the title lots of test failures - suspect race conditions running tests in parallel lots of test failures Mar 14, 2024
@brianmay
Copy link
Collaborator Author

Still getting some errors. I think a lot less though then before.

This one I just saw is weird: https://github.com/teslamate-org/teslamate/actions/runs/8288206183/job/22682287560?pr=3485

** (UndefinedFunctionError) function System.convert_time_unit/3 is undefined (module System is not available)
    (elixir 1.16.1) System.convert_time_unit(-576460[71](https://github.com/teslamate-org/teslamate/actions/runs/8288206183/job/22682287560?pr=3485#step:10:72)9104225362, :native, :millisecond)
    (db_connection 2.6.0) lib/db_connection/connection_pool.ex:357: DBConnection.ConnectionPool.start_idle/2
** (EXIT from #PID<0.99.0>) shutdown

    (db_connection 2.6.0) lib/db_connection/connection_pool.ex:205: DBConnection.ConnectionPool.drop_idle/6
    (stdlib 5.2.1) gen_server.erl:1095: :gen_server.try_handle_info/3
    (stdlib 5.2.1) gen_server.erl:1183: :gen_server.handle_msg/6
    (stdlib 5.2.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message: {:timeout, #Reference<0.821106270.461373444.45001>, -576460[72](https://github.com/teslamate-org/teslamate/actions/runs/8288206183/job/22682287560?pr=3485#step:10:73)0104900933}

Either the function exists or it doesn't exist. It can't just not exist occasionally...

@brianmay brianmay reopened this Mar 14, 2024
@brianmay
Copy link
Collaborator Author

As expected, rebuild solved that.

@brianmay brianmay changed the title lots of test failures random test failures Mar 14, 2024
@brianmay
Copy link
Collaborator Author

In case it is not clear, I am still seeing one test failure. Occasionally. Not easy to reproduce on demand however.

@JakobLichterfeld
Copy link
Collaborator

On CI I do not see any failures

@brianmay
Copy link
Collaborator Author

I think I saw this failure recently on CI. But can't find it right now.

@brianmay
Copy link
Collaborator Author

brianmay commented Apr 1, 2024

This is the random failure:

  1) test resumes logging when starting a drive (TeslaMate.Vehicles.Vehicle.StreamingTest)
     test/teslamate/vehicles/vehicle/streaming_test.exs:452
     Assertion failed, no matching message after 300ms
     The following variables were pinned:
       car = %TeslaMate.Log.Car{__meta__: #Ecto.Schema.Metadata<:built, "cars">, id: 49656, name: nil, efficiency: nil, model: "3", trim_badging: nil, marketing_name: nil, exterior_color: nil, wheel_type: nil, spoiler_type: nil, eid: 0, vid: 1000, vin: "1000", settings_id: nil, settings: %TeslaMate.Settings.CarSettings{__meta__: #Ecto.Schema.Metadata<:built, "car_settings">, id: nil, suspend_min: 999999999, suspend_after_idle_min: 15, req_not_unlocked: true, free_supercharging: false, use_streaming_api: true, car: #Ecto.Association.NotLoaded<association :car is not loaded>}, charging_processes: #Ecto.Association.NotLoaded<association :charging_processes is not loaded>, positions: #Ecto.Association.NotLoaded<association :positions is not loaded>, drives: #Ecto.Association.NotLoaded<association :drives is not loaded>, inserted_at: nil, updated_at: nil}
     The process mailbox is empty.
     code: assert_receive {:start_drive, ^car}
     stacktrace:
       test/teslamate/vehicles/vehicle/streaming_test.exs:478: (test)

@JakobLichterfeld
Copy link
Collaborator

JakobLichterfeld commented Apr 2, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
github_actions Pull requests that update Github_actions code note:needs investigation The issue must be investigated first
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants