Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Fail fast if managed task workers are near-OOM #1657

Merged
merged 21 commits into from
Mar 1, 2022

Conversation

ranweiler
Copy link
Member

@ranweiler ranweiler commented Feb 15, 2022

Summary

  • Add onefuzz::memory::available_bytes() to enable checking system-wide memory usage
  • In managed task worker runs, heuristically check for imminent OOM conditions and try to exit early

Testing

  • Added unit tests for limited parsing of /proc/meminfo on Linux
  • Memory querying functionally tested locally on both OSes using memory example binary in onefuzz crate
  • Near-OOM check in onefuzz-agent functionally tested by forcing external on-VM OOM
    • Linux
    • Windows

Tested on Windows by using notmyfault from SysInternals to leak from the paged memory pool, on a node where a healthy task was in the running state. On Linux, disabled OOM-killer, set overcommit to always-allow, and ran small custom program that rapidly leaks heap memory. Checked for task failure, with the reason matching the new error message, as well as telemetry.

Closes #1633.

@ranweiler ranweiler changed the title Memory watchdog Fail fast if managed task workers are near-OOM Feb 15, 2022
@ranweiler ranweiler requested a review from chkeita February 15, 2022 20:22
@ranweiler ranweiler merged commit 1b01981 into microsoft:main Mar 1, 2022
@ranweiler ranweiler deleted the memory-watchdog branch March 1, 2022 05:36
@ghost ghost locked as resolved and limited conversation to collaborators Mar 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Excessive fuzzer memory usage can cause confusing task failures
4 participants