snapshot restore-from-archive streaming and filtering #13658

tgross · 2022-07-08T19:53:07Z

This changeset implements two improvements to restoring FSM snapshots from archives:

The existing implementation decompresses the archive to a temporary file before reading it in to the FSM. For large snapshots this performs a lot of disk IO. Stream decompress the snapshot as we read it, without first writing to a temporary file. This also moves some of the work to a second core.
Add bexpr filters to the RestoreFromArchive helper. The operator can pass these as -filter arguments to nomad operator snapshot state (and other commands in the future) to include only desired data when reading the snapshot.

Deferred for this PR: the nomad operator snapshot state command still has to load everything that's been filtered into the FSM before writing it out to a large JSON blob. We should provide a tool that streams the decoded objects directly to an encoder without loading into the FSM, so that we can emit NDJSON, write out to a sqlite DB, etc.

Example:

Starting with a 439MB snapshot (~13GiB uncompressed), I want to filter for all objects associated with 3 different jobs and 3 different nodes:

time nomad operator snapshot state -filter '
    JobID == "job1" or
    JobID == "job2" or
    JobID == "job3" or
    NodeID == "3b3471d7-c519-8e3c-d7fd-dc692ca44744" or
    NodeID == "455775de-b4b4-0cb6-75eb-6c534618a005" or
    NodeID == "0d8e2a62-2712-cb4c-fb15-9831fdac57fe" or
    ID == "job1" or
    ID == "job2" or
    ID == "job3" or
    ID == "3b3471d7-c519-8e3c-d7fd-dc692ca44744" or
    ID == "455775de-b4b4-0cb6-75eb-6c534618a005" or
    ID == "0d8e2a62-2712-cb4c-fb15-9831fdac57fe"
' \
      ./nomad_operator_snapshot_save_2022_05_12_1543-0700.snap \
      > filtered-state.json

real    24m15.805s
user    29m55.036s
sys     7m50.618s

$ cat filtered-state.json| jq '.Allocs | length'
5490
$ cat filtered-state.json| jq '.Evals | length'
667

Previously this would write ~13GiB to disk, read 14GiB from disk, and saturate 1 core for over an hour before running out of memory on my machine (16GiB) and crashing.

With this change, the command reads ~450MiB from disk, only writes the 197MiB JSON blob to disk, and uses about 150% CPU, maxing out memory usage around 330MB.

The `RestoreFromArchive` helper decompresses the snapshot archive to a temporary file before reading it into the FSM. For large snapshots this performs a lot of disk IO. Stream decompress the snapshot as we read it, without first writing to a temporary file.

The operator can pass these as `-filter` arguments to `nomad operator snapshot state` (and other commands in the future) to include only desired data when reading the snapshot.

shoenig

LGTM!

.changelog/13658.txt

helper/raftutil/snapshot.go

nomad/fsm.go

github-actions · 2022-12-24T02:12:09Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/cli type/enhancement labels Jul 8, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui July 8, 2022 19:53 View deployment

tgross added this to the 1.3.x milestone Jul 8, 2022

vercel bot deployed to Preview – nomad-storybook-and-ui July 8, 2022 19:57 View deployment

tgross added 3 commits July 8, 2022 16:16

add bexpr filters to the RestoreFromArchive helper.

bc12749

The operator can pass these as `-filter` arguments to `nomad operator snapshot state` (and other commands in the future) to include only desired data when reading the snapshot.

changelog entry

5730498

tgross force-pushed the snapshot-restore-filter branch from f2e12f4 to 5730498 Compare July 8, 2022 20:16

vercel bot deployed to Preview – nomad-storybook-and-ui July 8, 2022 20:20 View deployment

tgross marked this pull request as ready for review July 8, 2022 20:33

tgross requested review from shoenig, schmichael and lgfa29 July 8, 2022 20:33

shoenig approved these changes Jul 11, 2022

View reviewed changes

.changelog/13658.txt Outdated Show resolved Hide resolved

helper/raftutil/snapshot.go Outdated Show resolved Hide resolved

nomad/fsm.go Outdated Show resolved Hide resolved

address comments from code review

c34cb8e

vercel bot deployed to Preview – nomad-storybook-and-ui July 11, 2022 14:23 View deployment

tgross merged commit 596203c into main Jul 11, 2022

tgross deleted the snapshot-restore-filter branch July 11, 2022 14:48

tgross modified the milestones: 1.3.x, 1.3.4 Aug 23, 2022

tgross added the backport/1.3.x backport to 1.3.x release line label Aug 23, 2022

hc-github-team-nomad-core mentioned this pull request Aug 23, 2022

Backport of snapshot restore-from-archive streaming and filtering into release/1.3.x #14243

Merged

github-actions bot locked as resolved and limited conversation to collaborators Dec 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snapshot restore-from-archive streaming and filtering #13658

snapshot restore-from-archive streaming and filtering #13658

tgross commented Jul 8, 2022 •

edited

Loading

shoenig left a comment

github-actions bot commented Dec 24, 2022

snapshot restore-from-archive streaming and filtering #13658

snapshot restore-from-archive streaming and filtering #13658

Conversation

tgross commented Jul 8, 2022 • edited Loading

shoenig left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 24, 2022

tgross commented Jul 8, 2022 •

edited

Loading