Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of snapshot restore-from-archive streaming and filtering into release/1.3.x #14243

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #13658 to be assessed for backporting due to the inclusion of the label backport/1.3.x.

The below text is copied from the body of the original PR.


This changeset implements two improvements to restoring FSM snapshots from archives:

  • The existing implementation decompresses the archive to a temporary file before reading it in to the FSM. For large snapshots this performs a lot of disk IO. Stream decompress the snapshot as we read it, without first writing to a temporary file. This also moves some of the work to a second core.
  • Add bexpr filters to the RestoreFromArchive helper. The operator can pass these as -filter arguments to nomad operator snapshot state (and other commands in the future) to include only desired data when reading the snapshot.

Deferred for this PR: the nomad operator snapshot state command still has to load everything that's been filtered into the FSM before writing it out to a large JSON blob. We should provide a tool that streams the decoded objects directly to an encoder without loading into the FSM, so that we can emit NDJSON, write out to a sqlite DB, etc.


Example:

Starting with a 439MB snapshot (~13GiB uncompressed), I want to filter for all objects associated with 3 different jobs and 3 different nodes:

time nomad operator snapshot state -filter '
    JobID == "job1" or
    JobID == "job2" or
    JobID == "job3" or
    NodeID == "3b3471d7-c519-8e3c-d7fd-dc692ca44744" or
    NodeID == "455775de-b4b4-0cb6-75eb-6c534618a005" or
    NodeID == "0d8e2a62-2712-cb4c-fb15-9831fdac57fe" or
    ID == "job1" or
    ID == "job2" or
    ID == "job3" or
    ID == "3b3471d7-c519-8e3c-d7fd-dc692ca44744" or
    ID == "455775de-b4b4-0cb6-75eb-6c534618a005" or
    ID == "0d8e2a62-2712-cb4c-fb15-9831fdac57fe"
' \
      ./nomad_operator_snapshot_save_2022_05_12_1543-0700.snap \
      > filtered-state.json

real    24m15.805s
user    29m55.036s
sys     7m50.618s

$ cat filtered-state.json| jq '.Allocs | length'
5490
$ cat filtered-state.json| jq '.Evals | length'
667

Previously this would write ~13GiB to disk, read 14GiB from disk, and saturate 1 core for over an hour before running out of memory on my machine (16GiB) and crashing.

With this change, the command reads ~450MiB from disk, only writes the 197MiB JSON blob to disk, and uses about 150% CPU, maxing out memory usage around 330MB.

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/snapshot-restore-filter/indirectly-oriented-hamster branch from e853704 to 22a8f7e Compare August 23, 2022 18:30
@hc-github-team-nomad-core hc-github-team-nomad-core merged commit 52879c4 into release/1.3.x Aug 23, 2022
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/snapshot-restore-filter/indirectly-oriented-hamster branch August 23, 2022 18:30
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants