Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Spurious error telemetry when stopping Windows generator tasks #1454

Closed
ranweiler opened this issue Nov 16, 2021 · 6 comments · Fixed by #1505
Closed

Spurious error telemetry when stopping Windows generator tasks #1454

ranweiler opened this issue Nov 16, 2021 · 6 comments · Fixed by #1505
Labels
bug Something isn't working

Comments

@ranweiler
Copy link
Member

Information

  • Onefuzz version: 3.2.0 (latest)
  • OS: Windows

Provide detailed reproduction steps (if any)

  1. Create a Windows generator task, such as via the radamsa basic template.
  2. Wait until the job's generic_generator fuzz task starts (is in running)
  3. Stop the job with onefuzz template stop.

Expected result

The job is shut down, without any error telemetry.

Actual result

We send error-level telemetry, such as:

error running task: generate inputs failed

Caused by:
    0: generator failed to start: C:\onefuzz\ee26eeba-21b1-4da7-804c-233bd831d25b\task_tools_0\radamsa.exe
    1: The media is write protected. (os error 19)

This causes spurious integration test failures.

@ranweiler ranweiler added the bug Something isn't working label Nov 16, 2021
@ghost ghost added the Needs: triage label Nov 16, 2021
@ranweiler
Copy link
Member Author

Using the above steps, I've been able to repro this several times in a row.

@ranweiler
Copy link
Member Author

Other errors elicited by the same repro procedure (all from a single job):

generic_analysis

error running task: poller failed

Caused by:
    0: QueueClient.pop
    1: storage queue pop failed
    2: request attempt 6 failed
    3: HTTP status client error (404 Not Found) for url (<REDACTED>)

generic_crash_report

error running task: poller failed

Caused by:
    0: QueueClient.pop
    1: storage queue pop failed
    2: request attempt 6 failed
    3: HTTP status client error (404 Not Found) for url (<REDACTED>)

generic_generator

error running task: OS Error 19 (FormatMessageW() returned error 19) (os error 19) at path "C:\\Windows\\TEMP\\.tmpr3WyJK"

@ranweiler
Copy link
Member Author

The template stop CLI command is used internally in the integration tests. It is invoked indirectly here, when we stop test jobs that have had their checks completed. This is set to True in the call to Run.test(), here. The inner call the stop template command is here.

The stop template command is itself defined here. It is totally non-blocking and unsynchronized. Resources are effectively all torn down at the same time. This provides the fastest user experience, but is probably the root of the spurious errors.

@ranweiler
Copy link
Member Author

It turns out that the invoking onefuzz tasks delete <task_id> alone is sufficient to repro this bug, when the task is running:

error running task: generate inputs failed

Caused by:
    0: generator failed to start: C:\onefuzz\99da0cf6-3adf-4f04-85ce-2f0e6038d1e4\task_tools_0\radamsa.exe
    1: The media is write protected. (os error 19)

@ranweiler
Copy link
Member Author

Using the remaining tasks from the example above, I was also able to repro just the earlier QueueClient.pop errors just via onefuzz jobs delete <job_id>.

So, it looks like we have noisy task teardown both when stopping tasks directly and when stopping tasks indirectly, via stopping the job. The template stop command is not needed to repro.

@ranweiler
Copy link
Member Author

When tasks are stopped, we grab the Node that the task is running on, here. Then, we invoke Node.stop_task().

The comment on that method is outdated. We really do "send" the StopTaskNode command immediately, to support task colocation (multiple tasks sharing a VMSS node). Here, "sending" just means saving the message to a table, which eventually gets queried by the node.

But the problem is that the rest of task teardown doesn't seem to be synchronized the receipt of this command, or more importantly, the actual task state. In particular, when we call Node.stop() in Node.stop_if_complete(), we immediately call to_reimage(). However, the node supervisor only checks for commands every 10 seconds, and apparently reimaging and Azure-level teardown is often fast enough to win the race, breaking the OS-local assumptions of the not-yet-stopped task.

@chkeita chkeita linked a pull request Dec 2, 2021 that will close this issue
@ghost ghost locked as resolved and limited conversation to collaborators Jan 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants