39 - implement ancestry chunking memory optimization #93

alexdunnjpl · 2024-01-12T00:49:21Z

🗒️ Summary

Implements a new chunked processing paradigm for ancestry sweeper to reduce provisioned memory requirement at the cost of some runtime speed and the need to provision enough disk space to perform swapping. The disk dump avoids the need to hold the full ancestry history in memory to generate AncestryRecords.

Performance tuning is available via env vars ANCESTRY_NONAGGREGATE_QUERY_PAGE_SIZE (default 2000) and ANCESTRY_DISK_DUMP_MEMORY_THRESHOLD (expressed in percent, default 80). Further details are available in ancestry.runtimeconstants.py

Performance test on en-prod resulted in ability to drop from 1vCPU and 4GB memory to 0.5vCPU and 1GB memory (cost savings of 42%), at the cost of 31% increased runtime. Proportional benefit is expected to be significantly higher for nodes with greater quantities of data as higher memory allocations require more vCPUs, all except one of which are wasted, though this will require confirmation. This also, critically, removes the inability to scale to large quantities of products due to finite ability to increase provisioned RAM.

⚙️ Test Data and/or Report

Regression tests augmented/updated/passed. Functional tests implemented for new dump/merge behaviour to ensure production of correct update content. Manually tested against en-prod with a dry-run and shown to produce equal update count to pre-PR version. Performance-tested against en-prod.

~~N.B. Docs are failing due to sphinx issue (locally and in github actions)~~ not anymore

♻️ Related Issues

fixes #39

Ops notes

I switched ECS task definitions to use tag stable rather than latest, thinking that latest would automatically reference the last-pushed image (and we don't want a dev pushing a development-related tag and causing all tasks to use that by default), but this appears not to be the case. Is latest a manual tag rather than something auto-generated? If so, task definitions should be switched back to using latest. @sjoshi-jpl @jordanpadams ?

@sjoshi-jpl once this is deployed to ECR, I'd recommend the following approach to tune provisioned memory/vCPUs

provision 200GB ephemeral ECS storage
cut vCPUs to 1, and memory to the lesser of current-value/2 and 8GB (max permitted for 1vCPU). For very-small nodes, 0.25-0.5vCPUs may be warranted
run sweeper once - if it crashes, set env var ANCESTRY_DISK_DUMP_MEMORY_THRESHOLD=60 and try again (you should not have to cut any further)
when complete, look for log line like 22:43:38,074::pds.registrysweepers.ancestry.generation::INFO::On-disk swap comprised of 3 files totalling 0.3GB
choose provisioned ephemeral ECS storage value according to the total size logged(maybe current*2?)

If you're able to benchmark provisioned memory/vCPUs before/after and runtime before/after, awesome, but I understand that's annoying work.

now covers case where a collection with no shared member refs exists

…cess

only applied to some queries, thus far. initial test suggests it may take up to 20% longer, but this may not be the case where pagination is deep

also improves logging

…try more slowly

…n approach with minimal use of objects

…_history_chunks()

missing stubs failure persists despite installation of psutil library stubs

this may be causing OOM condition where it should not occur

this is due to OOM exit on en-prod - either the estimation is off or the task containers are subject to a <100% kill threshold

nutjob4life

Hi folks. This looks all right and I've approved this PR.

However, I'm always a little suspicious when a process is asking about its own virtual memory usage and is writing things to disk itself. This comes from the whole Squid vs Varnish days, when memory-mapping a giant file into memory turned out to be the smarter solution (the Varnish solution), because the operating system itself will always know how to manage a process' memory better than the process ever could (since its own logic would either be waking up pages that were already swapped out and/or would duplicate OS logic).

However, I do recognize that AWS changes constraints a bit. I sort of wish I understood the problem a bit better.

In any case, you got my 👍

src/pds/registrysweepers/provenance/__init__.py

src/pds/registrysweepers/repairkit/__init__.py

src/pds/registrysweepers/repairkit/allarrays.py

alexdunnjpl · 2024-01-12T21:31:09Z

@nutjob4life firm agreement in principle.

I sort of wish I understood the problem a bit better.

If you're happy to sanity-check, the initial problem is that we need to take a large set of collection-level documents containing references to all non-aggregate products present in that collection, and create nonaggregate-level records containing

all parent collections of that non-agg
all parent bundles of those collections

This requires (naively) keeping the set of all non-agg records in memory from the start of the collection-level document iteration, to ensure that a non-agg's history is complete before a db update is sent off to the db (since reading, appending, then updating requires an infeasible quantity of db calls).

So my solution was to say "okay, let's write the current page of non-agg history to disk when it gets too large, then we can merge the histories together later in a way which doesn't require holding it all in-memory. There is now additional motivation to use as much memory as possible, since it's a bottleneck for execution time now, in addition to costing more to provision if the memory use is unnecessarily peaky.

The problem, then, is that a large proportion of the memory demand (let's say 100%, for simplicity's sake) is due to the need to take dict-like chunks of size S and then perform

designate a chunk "active"
for each other chunk, remove/merge its values to the active chunk for all keys which are present in the active chunk
now that the active chunk is known to contain complete values for all its keys, send it off to the db as part of a _bulk write
rinse and repeat for remaining chunks

Ideally, the amount of memory required should be ~2S, since two pages are loaded simultaneously.

What ends up happening is that additional data is accumulated into the active chunk, resulting in (2S + merged_data) memory use... my solution was to split the chunks in half when they got larger than the largest non-merged chunk. Kinda brainless, but it works.

Finally, the manual garbage collection. At this point, the memory usage spikes to ~3S for a split second right as a new inactive chunk is loaded from disk for merging. I think what is happening is that the active and previous-inactive chunks are loaded (2S), then the third chunk starts loading before the GC releases the now-unneeded previous-inactive chunk. Manually calling del then gc.collect() successfully prevents this by triggering the release as a blocking call prior to loading the next chunk from disk.

Possibly my understanding of what's going on is flawed. Almost-certainly there's a better solution for this, I'm just not aware of such.

…sion

alexdunnjpl added 28 commits January 2, 2024 15:24

implement AncestryMemoryOptimizedTestCase

4431c64

expand AncestryMemoryOptimizedTestCase

ec39198

now covers case where a collection with no shared member refs exists

implement utils.iterate_pages_of

a1f09c3

implement serialization and merge methods for AncestryRecord

d6a8d7a

implement ancestry.utils to assist with chunked record generation pro…

3007222

…cess

update generation.py to use chunk-merge paradigm

a375260

implement query_registry_db parameter request_timeout_seconds

ea7c3be

implement search-after database query

0b3830f

only applied to some queries, thus far. initial test suggests it may take up to 20% longer, but this may not be the case where pagination is deep

implement fix to search-after (track_total_hits)

84fcd90

also improves logging

make disk dump filenames portable

9f9deff

implement env vars for performance tweaks and disk dump location

6bcc6ed

implement ANCESTRY_DISABLE_CHUNKING env var

f5c6dd0

bump default write timeout, since they frequently take longer, and re…

b6eab9a

…try more slowly

replace document-count paging with memory-usage paging

c6e2066

simplify nonaggregate iteration

03cbc53

replace nonaggregate core iteration with more space-efficient mutatio…

279ac94

…n approach with minimal use of objects

remove noisy log

01afc35

apply further optimization and dynamic memory use sizing

19bb333

update .dockerignore/.gitignore

13ec291

fix bug where non-dumped chunk of products is dropped

684d041

add unlimited max_chunk_size default to ancestry.utils.merge_matching…

0e4d666

…_history_chunks()

satisfy mypy, may it rot

c8c80a7

missing stubs failure persists despite installation of psutil library stubs

update tests to reflect changes

f7e0eb3

add history dump disk usage log line for ops tuning

f1331e2

disable some noisy debug logs

2f5fef2

demote some noisy info logs

4daedda

implement memory recovery of non-dumped history chunk

1decf2b

this may be causing OOM condition where it should not occur

cut back default ancestry disk dump memory threshold

10a8905

this is due to OOM exit on en-prod - either the estimation is off or the task containers are subject to a <100% kill threshold

alexdunnjpl requested review from tloubrieu-jpl and nutjob4life as code owners January 12, 2024 00:49

alexdunnjpl requested a review from collinss-jpl as a code owner January 12, 2024 00:49

nutjob4life approved these changes Jan 12, 2024

View reviewed changes

src/pds/registrysweepers/provenance/__init__.py Outdated Show resolved Hide resolved

src/pds/registrysweepers/repairkit/__init__.py Outdated Show resolved Hide resolved

src/pds/registrysweepers/repairkit/allarrays.py Outdated Show resolved Hide resolved

alexdunnjpl added 2 commits January 16, 2024 10:36

pin sphinx components/deps for compatibility with existing sphinx ver…

1a562f7

…sion

excise log cruft

df2ed50

alexdunnjpl merged commit f8ba507 into main Jan 16, 2024
2 checks passed

alexdunnjpl deleted the ancestry-chunking-memory-optimization branch January 16, 2024 23:28

sjoshi-jpl mentioned this pull request Jan 17, 2024

Test latest sweeper changes on all production OpenSearch clusters #98

Closed

alexdunnjpl mentioned this pull request Jan 23, 2024

Fix image construction problem #99

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

39 - implement ancestry chunking memory optimization #93

39 - implement ancestry chunking memory optimization #93

alexdunnjpl commented Jan 12, 2024 •

edited

Loading

nutjob4life left a comment

alexdunnjpl commented Jan 12, 2024

39 - implement ancestry chunking memory optimization #93

39 - implement ancestry chunking memory optimization #93

Conversation

alexdunnjpl commented Jan 12, 2024 • edited Loading

🗒️ Summary

⚙️ Test Data and/or Report

♻️ Related Issues

Ops notes

nutjob4life left a comment

Choose a reason for hiding this comment

alexdunnjpl commented Jan 12, 2024

alexdunnjpl commented Jan 12, 2024 •

edited

Loading