Spurious error telemetry when stopping Windows generator tasks #1454

ranweiler · 2021-11-16T01:35:43Z

Information

Onefuzz version: 3.2.0 (latest)
OS: Windows

Provide detailed reproduction steps (if any)

Create a Windows generator task, such as via the radamsa basic template.
Wait until the job's generic_generator fuzz task starts (is in running)
Stop the job with onefuzz template stop.

Expected result

The job is shut down, without any error telemetry.

Actual result

We send error-level telemetry, such as:

error running task: generate inputs failed

Caused by:
    0: generator failed to start: C:\onefuzz\ee26eeba-21b1-4da7-804c-233bd831d25b\task_tools_0\radamsa.exe
    1: The media is write protected. (os error 19)

This causes spurious integration test failures.

The text was updated successfully, but these errors were encountered:

ranweiler · 2021-11-16T01:41:21Z

Using the above steps, I've been able to repro this several times in a row.

ranweiler · 2021-11-16T02:20:19Z

Other errors elicited by the same repro procedure (all from a single job):

`generic_analysis`

error running task: poller failed

Caused by:
    0: QueueClient.pop
    1: storage queue pop failed
    2: request attempt 6 failed
    3: HTTP status client error (404 Not Found) for url (<REDACTED>)

`generic_crash_report`

error running task: poller failed

Caused by:
    0: QueueClient.pop
    1: storage queue pop failed
    2: request attempt 6 failed
    3: HTTP status client error (404 Not Found) for url (<REDACTED>)

`generic_generator`

error running task: OS Error 19 (FormatMessageW() returned error 19) (os error 19) at path "C:\\Windows\\TEMP\\.tmpr3WyJK"

ranweiler · 2021-11-16T02:36:26Z

The template stop CLI command is used internally in the integration tests. It is invoked indirectly here, when we stop test jobs that have had their checks completed. This is set to True in the call to Run.test(), here. The inner call the stop template command is here.

The stop template command is itself defined here. It is totally non-blocking and unsynchronized. Resources are effectively all torn down at the same time. This provides the fastest user experience, but is probably the root of the spurious errors.

ranweiler · 2021-11-16T02:40:57Z

It turns out that the invoking onefuzz tasks delete <task_id> alone is sufficient to repro this bug, when the task is running:

error running task: generate inputs failed

Caused by:
    0: generator failed to start: C:\onefuzz\99da0cf6-3adf-4f04-85ce-2f0e6038d1e4\task_tools_0\radamsa.exe
    1: The media is write protected. (os error 19)

ranweiler · 2021-11-16T03:20:14Z

Using the remaining tasks from the example above, I was also able to repro just the earlier QueueClient.pop errors just via onefuzz jobs delete <job_id>.

So, it looks like we have noisy task teardown both when stopping tasks directly and when stopping tasks indirectly, via stopping the job. The template stop command is not needed to repro.

ranweiler · 2021-11-16T03:40:47Z

When tasks are stopped, we grab the Node that the task is running on, here. Then, we invoke Node.stop_task().

The comment on that method is outdated. We really do "send" the StopTaskNode command immediately, to support task colocation (multiple tasks sharing a VMSS node). Here, "sending" just means saving the message to a table, which eventually gets queried by the node.

But the problem is that the rest of task teardown doesn't seem to be synchronized the receipt of this command, or more importantly, the actual task state. In particular, when we call Node.stop() in Node.stop_if_complete(), we immediately call to_reimage(). However, the node supervisor only checks for commands every 10 seconds, and apparently reimaging and Azure-level teardown is often fast enough to win the race, breaking the OS-local assumptions of the not-yet-stopped task.

ranweiler added the bug Something isn't working label Nov 16, 2021

ghost added the Needs: triage label Nov 16, 2021

mgreisen removed the Needs: triage label Nov 18, 2021

mgreisen mentioned this issue Nov 19, 2021

Bump tokio-util from 0.6.8 to 0.6.9 in /src/agent #1416

Merged

chkeita linked a pull request Dec 2, 2021 that will close this issue

Integration tests reliability fixes #1505

Merged

chkeita mentioned this issue Dec 2, 2021

Integration tests reliability fixes #1505

Merged

chkeita closed this as completed in #1505 Dec 3, 2021

ghost locked as resolved and limited conversation to collaborators Jan 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious error telemetry when stopping Windows generator tasks #1454

Spurious error telemetry when stopping Windows generator tasks #1454

ranweiler commented Nov 16, 2021

ranweiler commented Nov 16, 2021

ranweiler commented Nov 16, 2021

ranweiler commented Nov 16, 2021

ranweiler commented Nov 16, 2021

ranweiler commented Nov 16, 2021

ranweiler commented Nov 16, 2021

Spurious error telemetry when stopping Windows generator tasks #1454

Spurious error telemetry when stopping Windows generator tasks #1454

Comments

ranweiler commented Nov 16, 2021

Information

Provide detailed reproduction steps (if any)

Expected result

Actual result

ranweiler commented Nov 16, 2021

ranweiler commented Nov 16, 2021

generic_analysis

generic_crash_report

generic_generator

ranweiler commented Nov 16, 2021

ranweiler commented Nov 16, 2021

ranweiler commented Nov 16, 2021

ranweiler commented Nov 16, 2021

`generic_analysis`

`generic_crash_report`

`generic_generator`