Fix and improve the reporting of errors from timeouts and crashes #82

nickrobinson251 · 2023-06-14T14:07:04Z

Fix the recording of timeouts and worker crashes, so that every attempt is recorded
- previously, we would only record an error if on timeout/crash if it was the last attempt, and so that testitem had failed overall. this means that for a testitem that timedout/crashed then passed on retry, there would be no record of the timeout/crash in the generated XML report
- now all attempts, including timeout/crash, are recorded
- To fix this, i refactored the code for handling retries so that it was separated from the code that handles the recording of an error, to reduce the paths through the code and ensure recording only needs to be handled in one place, independent of if we're retrying.
- The reporting is tested using reference tests, so please review the new additions in the committed test/reference/ JUnit XML reports
- hopefully this fixes https://relationalai.atlassian.net/browse/RAI-13829
Improve the error message reported for a timeout or worker crash
- Previously, we would log (to stderr) a timeout or worker crash, but the test error we recorded (and which would appear in the XML report) just said "test item \"foo\" didn't succeed"
- now it says either "Worker aborted evaluating test item \"foo\"" or "Timed out after XmYs evaluating test item \"foo\""
- fix https://relationalai.atlassian.net/browse/RAI-13675

nickrobinson251 · 2023-06-21T14:55:05Z

writing tests for timeouts is pretty tricky

i added tests to test/testfiles/_retry_tests.jl, but have to rely on writing out files to disk for tracking how many times we've run

nickrobinson251 · 2023-06-21T15:11:27Z

test/references/retry_tests_report_0worker.xml

+		<property name="dd_tags[perf.eval_time]" value="30.008791499"></property>
+		</properties>
+	</testcase>
+	<testcase name="Timeout first, pass after" timestamp="2023-06-21T12:12:48.263" time="30.029" tests="1" skipped="0" failures="0" errors="0">


since timeouts are not supported with nworkers=0 (#87), the "timeout tests" both run once and pass once (hence we have the addition here of an entry each for testcase name="Timeout always" and testcase name="Timeout first, pass after")

nickrobinson251 · 2023-06-21T15:12:47Z

test/references/retry_tests_report_1worker.xml

+  Timed out after 5s evaluating test item "Timeout always" (run=2)
+		</error>
+	</testcase>
+	<testcase name="Timeout always" timestamp="2023-06-21T13:40:26.634" time="0.0" tests="1" skipped="0" failures="0" errors="1">


All 3 runs of Timeout always should be here, i.e. there should be 3 <testcase name="Timeout always" entries

nickrobinson251 · 2023-06-21T15:13:05Z

test/references/retry_tests_report_1worker.xml

+  Timed out after 5s evaluating test item "Timeout first, pass after" (run=1)
+		</error>
+	</testcase>
+	<testcase name="Timeout first, pass after" timestamp="2023-06-21T13:40:33.616" time="0.062" tests="1" skipped="0" failures="0" errors="0">


there should be 2 <testcase name="Timeout first, pass after" (one timeout, then one pass)

quinnj · 2023-06-21T15:13:42Z

src/ReTestItems.jl

+                    Recording test error."
+                record_worker_terminated!(testitem, run_number)
+            elseif e isa TestSetFailure
+                # We already printed the error and recorded the testset.


where do we print the error? It seems we previously had a block that was removed above that did println(DEFAULT_STDOUT[]). were we double printing?

we had

if !(e isa TestSetFailure) println(DEFAULT_STDOUT[]) # Explicitly show captured logs or say there weren't any in case we're about # to terminte the worker _print_captured_logs(DEFAULT_STDOUT[], testitem, nretries + 1) end

but that moved into the if e isa TimeoutException block (since it doesn't actually apply in the worker crash case i don't think? but maybe it should be in that branch too?)

if e isa TimeoutException @debugv 1 "Test item $(repr(testitem.name)) timed out. Terminating worker $worker" # Explicitly show captured logs or say there weren't any before terminating worker println(DEFAULT_STDOUT[]) _print_captured_logs(DEFAULT_STDOUT[], testitem, run_number) terminate!(worker) wait(worker) ...

For the e isa TestSetFailure case, we only go to this whole catch block when we call throw_if_failed, and we have already called print_errors_and_captured_logs before that
Does this addition to the comment help?

Suggested change

# We already printed the error and recorded the testset.

# We already printed the error and recorded the testset before calling `throw_if_failed`.

I think we still want to try and print in the worker crash case, right @Drvi? like, ideally we get the segfault backtrace printed

ah yeah, makes sense -- i was misled by the in case we're about to terminate the worker comment, which i thought meant these logs wouldn't exist if the worker had already terminated, but that's not the case (idk what that comment was meaning to say)

okay, did a little refactor again so that everything related to the simple test failures is handled above, rather than purposefully throwing and then handling some stuff down here

in case we're about to terminate the worker

I don't remember, but I think that at some point we just wanted to be extra sure we didn't lose any logs from dead workers, so we explicitly print "No Captured Logs" to signal that the reporting process didn't fail.

LGTM; merge away!

quinnj

Overall great cleanup. I left one question that I wasn't sure I followed.

…or crash

test/integrationtests.jl

Test we retry on timeouts

cc36975

nickrobinson251 force-pushed the npr-better-worker-killed-test-errors branch from 18e5a5b to 7c8a050 Compare June 20, 2023 14:01

nickrobinson251 added 4 commits June 20, 2023 18:33

Refactor test-error handling to always record the error

4c6081a

Record more informative error: include timeout/abort info

93e9a9f

Update reference tests

0c07b6e

Test case where we timeout then pass on retry

cf72317

nickrobinson251 force-pushed the npr-better-worker-killed-test-errors branch from bd02082 to cf72317 Compare June 21, 2023 13:45

nickrobinson251 added 3 commits June 21, 2023 15:05

fixup! Test case where we timeout then pass on retry

b88fe42

fixup! Test case where we timeout then pass on retry

b4da580

Reduce time of test so worker=0 case isn't so slow

d939a56

nickrobinson251 changed the title ~~WIP: Better test errors from timeouts and crashes~~ Fix and improve the reporting of errors from timeouts and crashes Jun 21, 2023

nickrobinson251 requested review from quinnj, Drvi and NHDaly June 21, 2023 14:58

nickrobinson251 marked this pull request as ready for review June 21, 2023 14:59

nickrobinson251 commented Jun 21, 2023

View reviewed changes

quinnj reviewed Jun 21, 2023

View reviewed changes

quinnj approved these changes Jun 21, 2023

View reviewed changes

nickrobinson251 added 4 commits June 21, 2023 16:36

Handle test failure case differently to worker timeout or crash

1052948

Bump version

870b363

fixup! Handle test failure case differently to worker timeout or crash

1babb10

fixup! fixup! Handle test failure case differently to worker timeout …

f66565d

…or crash

nickrobinson251 commented Jun 21, 2023

View reviewed changes

test/integrationtests.jl Outdated Show resolved Hide resolved

nickrobinson251 added 2 commits June 21, 2023 16:49

Update test/integrationtests.jl - typo

71c483f

Don't use tempdir()

0e6253f

nickrobinson251 enabled auto-merge (squash) June 21, 2023 16:19

fixup! Don't use tempdir()

e8d40d2

nickrobinson251 merged commit 0859990 into main Jun 21, 2023

nickrobinson251 deleted the npr-better-worker-killed-test-errors branch June 21, 2023 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and improve the reporting of errors from timeouts and crashes #82

Fix and improve the reporting of errors from timeouts and crashes #82

nickrobinson251 commented Jun 14, 2023 •

edited

Loading

nickrobinson251 commented Jun 21, 2023

nickrobinson251 Jun 21, 2023

nickrobinson251 Jun 21, 2023

nickrobinson251 Jun 21, 2023

quinnj Jun 21, 2023

nickrobinson251 Jun 21, 2023

quinnj Jun 21, 2023

nickrobinson251 Jun 21, 2023

nickrobinson251 Jun 21, 2023

Drvi Jun 21, 2023

quinnj Jun 21, 2023

quinnj left a comment

	# We already printed the error and recorded the testset.
	# We already printed the error and recorded the testset before calling `throw_if_failed`.

Fix and improve the reporting of errors from timeouts and crashes #82

Fix and improve the reporting of errors from timeouts and crashes #82

Conversation

nickrobinson251 commented Jun 14, 2023 • edited Loading

nickrobinson251 commented Jun 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quinnj left a comment

Choose a reason for hiding this comment

nickrobinson251 commented Jun 14, 2023 •

edited

Loading