-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
39 - implement ancestry chunking memory optimization #93
Conversation
now covers case where a collection with no shared member refs exists
only applied to some queries, thus far. initial test suggests it may take up to 20% longer, but this may not be the case where pagination is deep
also improves logging
…n approach with minimal use of objects
…_history_chunks()
missing stubs failure persists despite installation of psutil library stubs
this may be causing OOM condition where it should not occur
this is due to OOM exit on en-prod - either the estimation is off or the task containers are subject to a <100% kill threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi folks. This looks all right and I've approved this PR.
However, I'm always a little suspicious when a process is asking about its own virtual memory usage and is writing things to disk itself. This comes from the whole Squid vs Varnish days, when memory-mapping a giant file into memory turned out to be the smarter solution (the Varnish solution), because the operating system itself will always know how to manage a process' memory better than the process ever could (since its own logic would either be waking up pages that were already swapped out and/or would duplicate OS logic).
However, I do recognize that AWS changes constraints a bit. I sort of wish I understood the problem a bit better.
In any case, you got my 👍
@nutjob4life firm agreement in principle.
If you're happy to sanity-check, the initial problem is that we need to take a large set of collection-level documents containing references to all non-aggregate products present in that collection, and create nonaggregate-level records containing
This requires (naively) keeping the set of all non-agg records in memory from the start of the collection-level document iteration, to ensure that a non-agg's history is complete before a db update is sent off to the db (since reading, appending, then updating requires an infeasible quantity of db calls). So my solution was to say "okay, let's write the current page of non-agg history to disk when it gets too large, then we can merge the histories together later in a way which doesn't require holding it all in-memory. There is now additional motivation to use as much memory as possible, since it's a bottleneck for execution time now, in addition to costing more to provision if the memory use is unnecessarily peaky. The problem, then, is that a large proportion of the memory demand (let's say 100%, for simplicity's sake) is due to the need to take dict-like chunks of size S and then perform
Ideally, the amount of memory required should be ~2S, since two pages are loaded simultaneously. What ends up happening is that additional data is accumulated into the active chunk, resulting in (2S + merged_data) memory use... my solution was to split the chunks in half when they got larger than the largest non-merged chunk. Kinda brainless, but it works. Finally, the manual garbage collection. At this point, the memory usage spikes to ~3S for a split second right as a new inactive chunk is loaded from disk for merging. I think what is happening is that the active and previous-inactive chunks are loaded (2S), then the third chunk starts loading before the GC releases the now-unneeded previous-inactive chunk. Manually calling Possibly my understanding of what's going on is flawed. Almost-certainly there's a better solution for this, I'm just not aware of such. |
🗒️ Summary
Implements a new chunked processing paradigm for ancestry sweeper to reduce provisioned memory requirement at the cost of some runtime speed and the need to provision enough disk space to perform swapping. The disk dump avoids the need to hold the full ancestry history in memory to generate AncestryRecords.
Performance tuning is available via env vars
ANCESTRY_NONAGGREGATE_QUERY_PAGE_SIZE
(default 2000) andANCESTRY_DISK_DUMP_MEMORY_THRESHOLD
(expressed in percent, default80
). Further details are available in ancestry.runtimeconstants.pyPerformance test on en-prod resulted in ability to drop from 1vCPU and 4GB memory to 0.5vCPU and 1GB memory (cost savings of 42%), at the cost of 31% increased runtime. Proportional benefit is expected to be significantly higher for nodes with greater quantities of data as higher memory allocations require more vCPUs, all except one of which are wasted, though this will require confirmation. This also, critically, removes the inability to scale to large quantities of products due to finite ability to increase provisioned RAM.
⚙️ Test Data and/or Report
Regression tests augmented/updated/passed. Functional tests implemented for new dump/merge behaviour to ensure production of correct update content. Manually tested against
en-prod
with a dry-run and shown to produce equal update count to pre-PR version. Performance-tested againsten-prod
.N.B. Docs are failing due to sphinx issue (locally and in github actions)not anymore♻️ Related Issues
fixes #39
Ops notes
I switched ECS task definitions to use tag
stable
rather thanlatest
, thinking thatlatest
would automatically reference the last-pushed image (and we don't want a dev pushing a development-related tag and causing all tasks to use that by default), but this appears not to be the case. Islatest
a manual tag rather than something auto-generated? If so, task definitions should be switched back to usinglatest
. @sjoshi-jpl @jordanpadams ?@sjoshi-jpl once this is deployed to ECR, I'd recommend the following approach to tune provisioned memory/vCPUs
current-value/2
and8GB
(max permitted for 1vCPU). For very-small nodes, 0.25-0.5vCPUs may be warrantedANCESTRY_DISK_DUMP_MEMORY_THRESHOLD=60
and try again (you should not have to cut any further)22:43:38,074::pds.registrysweepers.ancestry.generation::INFO::On-disk swap comprised of 3 files totalling 0.3GB
current*2
?)If you're able to benchmark provisioned memory/vCPUs before/after and runtime before/after, awesome, but I understand that's annoying work.