Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and improve the reporting of errors from timeouts and crashes #82

Merged
merged 15 commits into from
Jun 21, 2023

Conversation

nickrobinson251
Copy link
Collaborator

@nickrobinson251 nickrobinson251 commented Jun 14, 2023

  1. Fix the recording of timeouts and worker crashes, so that every attempt is recorded
    • previously, we would only record an error if on timeout/crash if it was the last attempt, and so that testitem had failed overall. this means that for a testitem that timedout/crashed then passed on retry, there would be no record of the timeout/crash in the generated XML report
    • now all attempts, including timeout/crash, are recorded
    • To fix this, i refactored the code for handling retries so that it was separated from the code that handles the recording of an error, to reduce the paths through the code and ensure recording only needs to be handled in one place, independent of if we're retrying.
    • The reporting is tested using reference tests, so please review the new additions in the committed test/reference/ JUnit XML reports
    • hopefully this fixes https://relationalai.atlassian.net/browse/RAI-13829
  2. Improve the error message reported for a timeout or worker crash
    • Previously, we would log (to stderr) a timeout or worker crash, but the test error we recorded (and which would appear in the XML report) just said "test item \"foo\" didn't succeed"
    • now it says either "Worker aborted evaluating test item \"foo\"" or "Timed out after XmYs evaluating test item \"foo\""
    • fix https://relationalai.atlassian.net/browse/RAI-13675

@nickrobinson251 nickrobinson251 force-pushed the npr-better-worker-killed-test-errors branch from 18e5a5b to 7c8a050 Compare June 20, 2023 14:01
@nickrobinson251 nickrobinson251 force-pushed the npr-better-worker-killed-test-errors branch from bd02082 to cf72317 Compare June 21, 2023 13:45
@nickrobinson251 nickrobinson251 changed the title WIP: Better test errors from timeouts and crashes Fix and improve the reporting of errors from timeouts and crashes Jun 21, 2023
@nickrobinson251
Copy link
Collaborator Author

writing tests for timeouts is pretty tricky

i added tests to test/testfiles/_retry_tests.jl, but have to rely on writing out files to disk for tracking how many times we've run

@nickrobinson251 nickrobinson251 marked this pull request as ready for review June 21, 2023 14:59
<property name="dd_tags[perf.eval_time]" value="30.008791499"></property>
</properties>
</testcase>
<testcase name="Timeout first, pass after" timestamp="2023-06-21T12:12:48.263" time="30.029" tests="1" skipped="0" failures="0" errors="0">
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since timeouts are not supported with nworkers=0 (#87), the "timeout tests" both run once and pass once (hence we have the addition here of an entry each for testcase name="Timeout always" and testcase name="Timeout first, pass after")

Timed out after 5s evaluating test item "Timeout always" (run=2)
</error>
</testcase>
<testcase name="Timeout always" timestamp="2023-06-21T13:40:26.634" time="0.0" tests="1" skipped="0" failures="0" errors="1">
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 3 runs of Timeout always should be here, i.e. there should be 3 <testcase name="Timeout always" entries

Timed out after 5s evaluating test item "Timeout first, pass after" (run=1)
</error>
</testcase>
<testcase name="Timeout first, pass after" timestamp="2023-06-21T13:40:33.616" time="0.062" tests="1" skipped="0" failures="0" errors="0">
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be 2 <testcase name="Timeout first, pass after" (one timeout, then one pass)

Recording test error."
record_worker_terminated!(testitem, run_number)
elseif e isa TestSetFailure
# We already printed the error and recorded the testset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we print the error? It seems we previously had a block that was removed above that did println(DEFAULT_STDOUT[]). were we double printing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had

            if !(e isa TestSetFailure)
                println(DEFAULT_STDOUT[])
                # Explicitly show captured logs or say there weren't any in case we're about
                # to terminte the worker
                _print_captured_logs(DEFAULT_STDOUT[], testitem, nretries + 1)
            end

but that moved into the if e isa TimeoutException block (since it doesn't actually apply in the worker crash case i don't think? but maybe it should be in that branch too?)

            if e isa TimeoutException
                @debugv 1 "Test item $(repr(testitem.name)) timed out. Terminating worker $worker"
                # Explicitly show captured logs or say there weren't any before terminating worker
                println(DEFAULT_STDOUT[])
                _print_captured_logs(DEFAULT_STDOUT[], testitem, run_number)
                terminate!(worker)
                wait(worker)
                ...

For the e isa TestSetFailure case, we only go to this whole catch block when we call throw_if_failed, and we have already called print_errors_and_captured_logs before that
Does this addition to the comment help?

Suggested change
# We already printed the error and recorded the testset.
# We already printed the error and recorded the testset before calling `throw_if_failed`.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still want to try and print in the worker crash case, right @Drvi? like, ideally we get the segfault backtrace printed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, makes sense -- i was misled by the in case we're about to terminate the worker comment, which i thought meant these logs wouldn't exist if the worker had already terminated, but that's not the case (idk what that comment was meaning to say)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, did a little refactor again so that everything related to the simple test failures is handled above, rather than purposefully throwing and then handling some stuff down here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case we're about to terminate the worker

I don't remember, but I think that at some point we just wanted to be extra sure we didn't lose any logs from dead workers, so we explicitly print "No Captured Logs" to signal that the reporting process didn't fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; merge away!

Copy link
Collaborator

@quinnj quinnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall great cleanup. I left one question that I wasn't sure I followed.

@nickrobinson251 nickrobinson251 enabled auto-merge (squash) June 21, 2023 16:19
@nickrobinson251 nickrobinson251 merged commit 0859990 into main Jun 21, 2023
@nickrobinson251 nickrobinson251 deleted the npr-better-worker-killed-test-errors branch June 21, 2023 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants