Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heroku[checksum-worker.1]: Error R14 (Memory quota exceeded) #1095

Closed
yarikoptic opened this issue May 11, 2022 · 14 comments · Fixed by #1139
Closed

heroku[checksum-worker.1]: Error R14 (Memory quota exceeded) #1095

yarikoptic opened this issue May 11, 2022 · 14 comments · Fixed by #1139
Assignees
Labels
released This issue/pull request has been released.

Comments

@yarikoptic
Copy link
Member

Didn't analyze if relate to prior reports https://github.com/dandi/dandi-archive/search?q=Memory+quota+exceeded&type=issues but ATM we have in the past 12 hours (if I got timing right) over a hundred of such messages:

(base) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -h 'Memory quota' 20220510-2201.log | nl | tail -n 2
   162	2022-05-11T03:00:19.341006+00:00 heroku[checksum-worker.1]: Error R14 (Memory quota exceeded)
   163	2022-05-11T03:00:41.016410+00:00 heroku[checksum-worker.1]: Error R14 (Memory quota exceeded)

and apparently it is something relative new (logging archival on drogon was "defunct" for a while, but we did get it back):

(base) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -c 'Memory quota' 202205*
20220506-0924.log:0
20220506-0934.log:0
20220506-1001.log:0
20220506-1401.log:0
20220507-1701.log:0
20220508-1901.log:0
20220509-1801.log:208
20220509-2001.log:180
20220510-2201.log:163

someone needs to analyze if that had actual impact ... There is 2802 not valid assets listed on https://api.dandiarchive.org/dashboard/ and those few which are shown are Pending and have no checksum, so might be the ones affected?

@yarikoptic
Copy link
Member Author

hitting heavily! we had some logs dumping issue so there would be misses and/or duplicates (across files) but ATM

...
20220525-1601.log:7
20220525-1701.log:11
20220525-1801.log:327
20220525-1901.log:62
20220525-2001.log:217
20220525-2101.log:3260
20220525-2201.log:3077
20220525-2301.log:2904
20220526-0001.log:2719
20220526-0101.log:2542
20220526-0201.log:2363
20220526-0301.log:2186
20220526-0401.log:2006
20220526-0501.log:1871
20220526-0601.log:1871
20220526-0701.log:1633
20220526-0801.log:1280
20220526-0901.log:863
20220526-1001.log:415
20220526-1052.log:13
20220526-1055.log:267

someone really needs to memray that script etc

@mvandenburgh
Copy link
Member

mvandenburgh commented Jun 28, 2022

I think I found the underlying cause of the memory errors. I connected my dev environment to the staging DB/S3 bucket and ran zarr ingest on the largest zarr dandiset in staging (~500 GB), and saw no abnormal memory usage. I also didn't see any of the new Zarr logging that I added recently (#1128) in the production papertrail logs during the last couple of times the worker crashed. So, it would seem the zarr ingestion isn't causing these memory errors.

The other task that that worker is responsible for is SHA-256 checksum calculation, so I tested that out locally with the staging DB/S3 as well. I had previously tested it before on some local data, and saw no memory issues. However when I ran it on https://gui-staging.dandiarchive.org/dandiset/101392 from staging, the memory usage of my celery worker started increasing gradually over time. I tracked the memory usage over 15 mins and graphed it -
Figure_1

The main issue appears to be this loop https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/tasks/__init__.py#L44-L47. That loop is iterating over every Asset associated with the AssetBlob being checksummed. Recall that Assets are immutable, and whenever a user modifies an Asset, a new one is minted and the old one remains in place. So an AssetBlob could have a lot of Assets depending on how many times it is modified (which is the case with the staging dandiset i ran this on, but not my local data, which is why i missed this at first), and the checksum task currently iterates over all of them and validates each of their metadata, which would definitely cause high memory usage if there's too many.

I think we can do two things off the bat to mitigate this (it's possible these might even fix the problem all together)

  1. Use .iterator() on that loop's queryset
  2. Delay those asset validation tasks instead of running them in the current worker thread

I'm also unsure if every Asset linked to the AssetBlob being checksummed needs to be validated... it seems to me that we would only need to validate the most recent one (i.e. the one that is currently linked to a Version), but I could be wrong on that.

@mvandenburgh mvandenburgh self-assigned this Jun 28, 2022
@yarikoptic
Copy link
Member Author

Nice digging!!! I am not certain though why validation in that loop would incrementally increase memory consumption? Wouldn't gc pick up/cleans assets from previous iteration? Can we explicitly del asset at the end of the loop block? Is there a way to run it with memray profiler to see exactly which objects occupy that memory?

@yarikoptic
Copy link
Member Author

Or is it that as you hint in 1. a list of instances instead of iterator?! (Now also that we seems to need just an id of an asset, could we get a list of those instead?)

@waxlamp
Copy link
Member

waxlamp commented Jun 29, 2022

@mvandenburgh, it seems that you could test the .iterator() fix fairly easily by deploying a small PR to staging, then run your local test again. Is that right? If it doesn't solve the problem, we can revert the PR.

More generally, if you can contrive a dandiset with as many assets as 101392, perhaps you could simply run the full test locally?

@yarikoptic
Copy link
Member Author

I'm also unsure if every Asset linked to the AssetBlob being checksummed needs to be validated... it seems to me that we would only need to validate the most recent one (i.e. the one that is currently linked to a Version), but I could be wrong on that.

  • indeed we should gc all those old, no longer associated with versions assets -- they just add to processing for no good value! Where are we standing on that (sorry - forgot if gc was ever deployed)?
  • so we need to loop only through assets associated with some version
  • (re)validation is needed since with addition of sha256 checksum (which should not change/mint new asset -- one of the "special cases") an asset might become valid. related: oddly enough, I think, one might become valid and another one for the same blob not if its metadata record is different for some reason (different version, bad client , etc)

@mvandenburgh
Copy link
Member

Nice digging!!! I am not certain though why validation in that loop would incrementally increase memory consumption? Wouldn't gc pick up/cleans assets from previous iteration? Can we explicitly del asset at the end of the loop block? Is there a way to run it with memray profiler to see exactly which objects occupy that memory?

I ran it through memray, and most of the memory is being used by the JSON decoder from the python standard library. I assume this is related to the pydantic model serialization. Calling .delay() on those validation tasks to dispatch them to another worker seems to solve the problem.

Here are the flamegraphs from memray if anyone is interested.
memray.zip

Or is it that as you hint in 1. a list of instances instead of iterator?! (Now also that we seems to need just an id of an asset, could we get a list of those instead?)

Yes, currently it is just a list of Asset instances. Based on the memray profile, the asset validation appears to be the main culprit here, but to make things even more memory-efficient I'll make it an iterator instead of a list, that only contains the asset ids instead of the entire asset object.

I'm also unsure if every Asset linked to the AssetBlob being checksummed needs to be validated... it seems to me that we would only need to validate the most recent one (i.e. the one that is currently linked to a Version), but I could be wrong on that.

  • indeed we should gc all those old, no longer associated with versions assets -- they just add to processing for no good value! Where are we standing on that (sorry - forgot if gc was ever deployed)?

Garbage collection hasn't been deployed yet, no.

@mvandenburgh
Copy link
Member

mvandenburgh commented Jun 29, 2022

@mvandenburgh, it seems that you could test the .iterator() fix fairly easily by deploying a small PR to staging, then run your local test again. Is that right? If it doesn't solve the problem, we can revert the PR.

More generally, if you can contrive a dandiset with as many assets as 101392, perhaps you could simply run the full test locally?

I tested it locally with my dev server hooked up to the staging db/s3. .iterator() slightly decreases the memory usage (it would have a larger impact as the number of assets grows), but I found that delaying the asset validation tasks fixed the larger memory consumption (which is consistent with the memory profiling I mentioned in my previous comment)

@yarikoptic
Copy link
Member Author

Here are the flamegraphs from memray if anyone is interested.
memray.zip

So if I read them right - it seems it is about 50MB which come from reading in (likely) dandi json schema since there is that requests.get( for https://github.com/dandi/dandi-schema/blob/master/dandischema/metadata.py#L184 ?
image

where 500MB comes from? validation done in parallel and each thread loading its own copy of schema (would also be wasting traffic I guess)?

@mvandenburgh
Copy link
Member

Here are the flamegraphs from memray if anyone is interested.
memray.zip

So if I read them right - it seems it is about 50MB which come from reading in (likely) dandi json schema since there is that requests.get( for https://github.com/dandi/dandi-schema/blob/master/dandischema/metadata.py#L184 ? image

where 500MB comes from? validation done in parallel and each thread loading its own copy of schema (would also be wasting traffic I guess)?

Yes. Each celery worker runs up to 4 task threads at once, and those two flamegraphs are from 2 tasks that were running concurrently. There's also some memory overhead to running celery itself, on my local system it consumes ~100MB while idling with no tasks in the queue.

@jjnesbitt
Copy link
Member

@mvandenburgh Could you post a graph of memory usage over time (like #1095 (comment)) after the fix from #1139? It'd be good to have a direct comparison

@mvandenburgh
Copy link
Member

@AlmightyYakob - I ran this over the same amount of time (15 mins) on the same dandiset -

Figure_2

There's still about ~370MB being used, but it at least stops climbing after a while.

@dandibot
Copy link
Member

🚀 Issue was released in v0.2.27 🚀

@dandibot dandibot added the released This issue/pull request has been released. label Jun 29, 2022
@yarikoptic
Copy link
Member Author

I am still a bit "surprised" with 50MB usage for requests.get (per process) if that is what consumes so much to make validation needing 370MB and likely also doing all that requests.get from github all the time... Just FTR ATM we have only about

(git)lena:~/proj/dandi/schema[master]releases/0.6.3
$> du -sck *
56	asset.json
12	context.json
36	dandiset.json
60	published-asset.json
40	published-dandiset.json
204	total

204kB of schema in there, and I did not spot any major memory utilization while trying to load them straight from drive

(git)lena:~/proj/dandi/schema[master]releases/0.6.3
$> python -c 'import json,resource,requests; from glob import glob; print(resource.getrusage(resource.RUSAGE_SELF)); j=[json.load(open(f)) for f in glob("*.json")]; print(f"Loaded {len(j)}"); print(resource.getrusage(resource.RUSAGE_SELF))'
resource.struct_rusage(ru_utime=0.116075, ru_stime=0.024014999999999998, ru_maxrss=25168, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=4998, ru_majflt=3, ru_nswap=0, ru_inblock=2680, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=138, ru_nivcsw=1)
Loaded 5
resource.struct_rusage(ru_utime=0.117529, ru_stime=0.024014999999999998, ru_maxrss=25168, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=5166, ru_majflt=3, ru_nswap=0, ru_inblock=2680, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=138, ru_nivcsw=1)

or from github as we do

$> python -c 'import json,resource,requests; from glob import glob; print(resource.getrusage(resource.RUSAGE_SELF)); j=[requests.get(f"https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.3/{f}").json() for f in glob("*.json")]; print(f"Loaded {len(j)}"); print(resource.getrusage(resource.RUSAGE_SELF))'    
resource.struct_rusage(ru_utime=0.10937, ru_stime=0.012152, ru_maxrss=25396, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=5092, ru_majflt=0, ru_nswap=0, ru_inblock=32, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=0, ru_nivcsw=18)
Loaded 5
resource.struct_rusage(ru_utime=0.20646899999999999, ru_stime=0.028339, ru_maxrss=26276, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=5370, ru_majflt=0, ru_nswap=0, ru_inblock=552, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=35, ru_nivcsw=19)

and also when I memrayed

this script version of above
import json
import resource
import requests; 
from glob import glob; 
print(resource.getrusage(resource.RUSAGE_SELF)); 

j = []
for f in glob("*.json"):
  j.append(
     requests.get(f"https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.3/{f}").json() 
  )

print(f"Loaded {len(j)}"); 

print(resource.getrusage(resource.RUSAGE_SELF))

memray flamegraph just showed that it was 10M due to import of requests, nothing about those actual requests.get .

so may be that memray suggested location was red herring ... but regardless -- to validate each asset we fetch schema file from github over and over again... I wonder if we could/should somehow cache that to gain speed up and reduce amount of network chatter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released This issue/pull request has been released.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants