-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
heroku[checksum-worker.1]: Error R14 (Memory quota exceeded) #1095
Comments
hitting heavily! we had some logs dumping issue so there would be misses and/or duplicates (across files) but ATM
someone really needs to memray that script etc |
I think I found the underlying cause of the memory errors. I connected my dev environment to the staging DB/S3 bucket and ran zarr ingest on the largest zarr dandiset in staging (~500 GB), and saw no abnormal memory usage. I also didn't see any of the new Zarr logging that I added recently (#1128) in the production papertrail logs during the last couple of times the worker crashed. So, it would seem the zarr ingestion isn't causing these memory errors. The other task that that worker is responsible for is SHA-256 checksum calculation, so I tested that out locally with the staging DB/S3 as well. I had previously tested it before on some local data, and saw no memory issues. However when I ran it on https://gui-staging.dandiarchive.org/dandiset/101392 from staging, the memory usage of my celery worker started increasing gradually over time. I tracked the memory usage over 15 mins and graphed it - The main issue appears to be this loop https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/tasks/__init__.py#L44-L47. That loop is iterating over every Asset associated with the AssetBlob being checksummed. Recall that Assets are immutable, and whenever a user modifies an Asset, a new one is minted and the old one remains in place. So an AssetBlob could have a lot of Assets depending on how many times it is modified (which is the case with the staging dandiset i ran this on, but not my local data, which is why i missed this at first), and the checksum task currently iterates over all of them and validates each of their metadata, which would definitely cause high memory usage if there's too many. I think we can do two things off the bat to mitigate this (it's possible these might even fix the problem all together)
I'm also unsure if every |
Nice digging!!! I am not certain though why validation in that loop would incrementally increase memory consumption? Wouldn't gc pick up/cleans assets from previous iteration? Can we explicitly |
Or is it that as you hint in 1. a list of instances instead of iterator?! (Now also that we seems to need just an id of an asset, could we get a list of those instead?) |
@mvandenburgh, it seems that you could test the More generally, if you can contrive a dandiset with as many assets as 101392, perhaps you could simply run the full test locally? |
|
I ran it through memray, and most of the memory is being used by the JSON decoder from the python standard library. I assume this is related to the pydantic model serialization. Calling Here are the flamegraphs from memray if anyone is interested.
Yes, currently it is just a list of
Garbage collection hasn't been deployed yet, no. |
I tested it locally with my dev server hooked up to the staging db/s3. .iterator() slightly decreases the memory usage (it would have a larger impact as the number of assets grows), but I found that delaying the asset validation tasks fixed the larger memory consumption (which is consistent with the memory profiling I mentioned in my previous comment) |
So if I read them right - it seems it is about 50MB which come from reading in (likely) dandi json schema since there is that where 500MB comes from? validation done in parallel and each thread loading its own copy of schema (would also be wasting traffic I guess)? |
Yes. Each celery worker runs up to 4 task threads at once, and those two flamegraphs are from 2 tasks that were running concurrently. There's also some memory overhead to running celery itself, on my local system it consumes ~100MB while idling with no tasks in the queue. |
@mvandenburgh Could you post a graph of memory usage over time (like #1095 (comment)) after the fix from #1139? It'd be good to have a direct comparison |
🚀 Issue was released in |
I am still a bit "surprised" with 50MB usage for
204kB of schema in there, and I did not spot any major memory utilization while trying to load them straight from drive
or from github as we do $> python -c 'import json,resource,requests; from glob import glob; print(resource.getrusage(resource.RUSAGE_SELF)); j=[requests.get(f"https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.3/{f}").json() for f in glob("*.json")]; print(f"Loaded {len(j)}"); print(resource.getrusage(resource.RUSAGE_SELF))'
resource.struct_rusage(ru_utime=0.10937, ru_stime=0.012152, ru_maxrss=25396, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=5092, ru_majflt=0, ru_nswap=0, ru_inblock=32, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=0, ru_nivcsw=18)
Loaded 5
resource.struct_rusage(ru_utime=0.20646899999999999, ru_stime=0.028339, ru_maxrss=26276, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=5370, ru_majflt=0, ru_nswap=0, ru_inblock=552, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=35, ru_nivcsw=19) and also when I memrayed this script version of aboveimport json
import resource
import requests;
from glob import glob;
print(resource.getrusage(resource.RUSAGE_SELF));
j = []
for f in glob("*.json"):
j.append(
requests.get(f"https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.3/{f}").json()
)
print(f"Loaded {len(j)}");
print(resource.getrusage(resource.RUSAGE_SELF)) memray flamegraph just showed that it was 10M due to import of requests, nothing about those actual requests.get . so may be that memray suggested location was red herring ... but regardless -- to validate each asset we fetch schema file from github over and over again... I wonder if we could/should somehow cache that to gain speed up and reduce amount of network chatter |
Didn't analyze if relate to prior reports https://github.com/dandi/dandi-archive/search?q=Memory+quota+exceeded&type=issues but ATM we have in the past 12 hours (if I got timing right) over a hundred of such messages:
and apparently it is something relative new (logging archival on drogon was "defunct" for a while, but we did get it back):
someone needs to analyze if that had actual impact ... There is 2802 not valid assets listed on https://api.dandiarchive.org/dashboard/ and those few which are shown are
Pending
and have no checksum, so might be the ones affected?The text was updated successfully, but these errors were encountered: