heroku[checksum-worker.1]: Error R14 (Memory quota exceeded) #1095

yarikoptic · 2022-05-11T14:38:16Z

Didn't analyze if relate to prior reports https://github.com/dandi/dandi-archive/search?q=Memory+quota+exceeded&type=issues but ATM we have in the past 12 hours (if I got timing right) over a hundred of such messages:

(base) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -h 'Memory quota' 20220510-2201.log | nl | tail -n 2
   162	2022-05-11T03:00:19.341006+00:00 heroku[checksum-worker.1]: Error R14 (Memory quota exceeded)
   163	2022-05-11T03:00:41.016410+00:00 heroku[checksum-worker.1]: Error R14 (Memory quota exceeded)

and apparently it is something relative new (logging archival on drogon was "defunct" for a while, but we did get it back):

(base) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -c 'Memory quota' 202205*
20220506-0924.log:0
20220506-0934.log:0
20220506-1001.log:0
20220506-1401.log:0
20220507-1701.log:0
20220508-1901.log:0
20220509-1801.log:208
20220509-2001.log:180
20220510-2201.log:163

someone needs to analyze if that had actual impact ... There is 2802 not valid assets listed on https://api.dandiarchive.org/dashboard/ and those few which are shown are Pending and have no checksum, so might be the ones affected?

The text was updated successfully, but these errors were encountered:

yarikoptic · 2022-05-26T15:44:33Z

hitting heavily! we had some logs dumping issue so there would be misses and/or duplicates (across files) but ATM

...
20220525-1601.log:7
20220525-1701.log:11
20220525-1801.log:327
20220525-1901.log:62
20220525-2001.log:217
20220525-2101.log:3260
20220525-2201.log:3077
20220525-2301.log:2904
20220526-0001.log:2719
20220526-0101.log:2542
20220526-0201.log:2363
20220526-0301.log:2186
20220526-0401.log:2006
20220526-0501.log:1871
20220526-0601.log:1871
20220526-0701.log:1633
20220526-0801.log:1280
20220526-0901.log:863
20220526-1001.log:415
20220526-1052.log:13
20220526-1055.log:267

someone really needs to memray that script etc

mvandenburgh · 2022-06-28T22:36:19Z

I think I found the underlying cause of the memory errors. I connected my dev environment to the staging DB/S3 bucket and ran zarr ingest on the largest zarr dandiset in staging (~500 GB), and saw no abnormal memory usage. I also didn't see any of the new Zarr logging that I added recently (#1128) in the production papertrail logs during the last couple of times the worker crashed. So, it would seem the zarr ingestion isn't causing these memory errors.

The other task that that worker is responsible for is SHA-256 checksum calculation, so I tested that out locally with the staging DB/S3 as well. I had previously tested it before on some local data, and saw no memory issues. However when I ran it on https://gui-staging.dandiarchive.org/dandiset/101392 from staging, the memory usage of my celery worker started increasing gradually over time. I tracked the memory usage over 15 mins and graphed it -

The main issue appears to be this loop https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/tasks/__init__.py#L44-L47. That loop is iterating over every Asset associated with the AssetBlob being checksummed. Recall that Assets are immutable, and whenever a user modifies an Asset, a new one is minted and the old one remains in place. So an AssetBlob could have a lot of Assets depending on how many times it is modified (which is the case with the staging dandiset i ran this on, but not my local data, which is why i missed this at first), and the checksum task currently iterates over all of them and validates each of their metadata, which would definitely cause high memory usage if there's too many.

I think we can do two things off the bat to mitigate this (it's possible these might even fix the problem all together)

Use .iterator() on that loop's queryset
Delay those asset validation tasks instead of running them in the current worker thread

I'm also unsure if every Asset linked to the AssetBlob being checksummed needs to be validated... it seems to me that we would only need to validate the most recent one (i.e. the one that is currently linked to a Version), but I could be wrong on that.

yarikoptic · 2022-06-29T14:04:29Z

Nice digging!!! I am not certain though why validation in that loop would incrementally increase memory consumption? Wouldn't gc pick up/cleans assets from previous iteration? Can we explicitly del asset at the end of the loop block? Is there a way to run it with memray profiler to see exactly which objects occupy that memory?

yarikoptic · 2022-06-29T14:06:14Z

Or is it that as you hint in 1. a list of instances instead of iterator?! (Now also that we seems to need just an id of an asset, could we get a list of those instead?)

waxlamp · 2022-06-29T14:17:59Z

@mvandenburgh, it seems that you could test the .iterator() fix fairly easily by deploying a small PR to staging, then run your local test again. Is that right? If it doesn't solve the problem, we can revert the PR.

More generally, if you can contrive a dandiset with as many assets as 101392, perhaps you could simply run the full test locally?

yarikoptic · 2022-06-29T14:24:29Z

I'm also unsure if every Asset linked to the AssetBlob being checksummed needs to be validated... it seems to me that we would only need to validate the most recent one (i.e. the one that is currently linked to a Version), but I could be wrong on that.

indeed we should gc all those old, no longer associated with versions assets -- they just add to processing for no good value! Where are we standing on that (sorry - forgot if gc was ever deployed)?
so we need to loop only through assets associated with some version
(re)validation is needed since with addition of sha256 checksum (which should not change/mint new asset -- one of the "special cases") an asset might become valid. related: oddly enough, I think, one might become valid and another one for the same blob not if its metadata record is different for some reason (different version, bad client , etc)

mvandenburgh · 2022-06-29T16:09:54Z

Nice digging!!! I am not certain though why validation in that loop would incrementally increase memory consumption? Wouldn't gc pick up/cleans assets from previous iteration? Can we explicitly del asset at the end of the loop block? Is there a way to run it with memray profiler to see exactly which objects occupy that memory?

I ran it through memray, and most of the memory is being used by the JSON decoder from the python standard library. I assume this is related to the pydantic model serialization. Calling .delay() on those validation tasks to dispatch them to another worker seems to solve the problem.

Here are the flamegraphs from memray if anyone is interested.
memray.zip

Or is it that as you hint in 1. a list of instances instead of iterator?! (Now also that we seems to need just an id of an asset, could we get a list of those instead?)

Yes, currently it is just a list of Asset instances. Based on the memray profile, the asset validation appears to be the main culprit here, but to make things even more memory-efficient I'll make it an iterator instead of a list, that only contains the asset ids instead of the entire asset object.

I'm also unsure if every Asset linked to the AssetBlob being checksummed needs to be validated... it seems to me that we would only need to validate the most recent one (i.e. the one that is currently linked to a Version), but I could be wrong on that.

indeed we should gc all those old, no longer associated with versions assets -- they just add to processing for no good value! Where are we standing on that (sorry - forgot if gc was ever deployed)?

Garbage collection hasn't been deployed yet, no.

mvandenburgh · 2022-06-29T16:11:33Z

@mvandenburgh, it seems that you could test the .iterator() fix fairly easily by deploying a small PR to staging, then run your local test again. Is that right? If it doesn't solve the problem, we can revert the PR.

More generally, if you can contrive a dandiset with as many assets as 101392, perhaps you could simply run the full test locally?

I tested it locally with my dev server hooked up to the staging db/s3. .iterator() slightly decreases the memory usage (it would have a larger impact as the number of assets grows), but I found that delaying the asset validation tasks fixed the larger memory consumption (which is consistent with the memory profiling I mentioned in my previous comment)

yarikoptic · 2022-06-29T16:55:17Z

Here are the flamegraphs from memray if anyone is interested.
memray.zip

So if I read them right - it seems it is about 50MB which come from reading in (likely) dandi json schema since there is that requests.get( for https://github.com/dandi/dandi-schema/blob/master/dandischema/metadata.py#L184 ?

where 500MB comes from? validation done in parallel and each thread loading its own copy of schema (would also be wasting traffic I guess)?

mvandenburgh · 2022-06-29T17:24:11Z

Here are the flamegraphs from memray if anyone is interested.
memray.zip

So if I read them right - it seems it is about 50MB which come from reading in (likely) dandi json schema since there is that requests.get( for https://github.com/dandi/dandi-schema/blob/master/dandischema/metadata.py#L184 ?

where 500MB comes from? validation done in parallel and each thread loading its own copy of schema (would also be wasting traffic I guess)?

Yes. Each celery worker runs up to 4 task threads at once, and those two flamegraphs are from 2 tasks that were running concurrently. There's also some memory overhead to running celery itself, on my local system it consumes ~100MB while idling with no tasks in the queue.

jjnesbitt · 2022-06-29T18:12:27Z

@mvandenburgh Could you post a graph of memory usage over time (like #1095 (comment)) after the fix from #1139? It'd be good to have a direct comparison

mvandenburgh · 2022-06-29T18:37:50Z

@AlmightyYakob - I ran this over the same amount of time (15 mins) on the same dandiset -

There's still about ~370MB being used, but it at least stops climbing after a while.

dandibot · 2022-06-29T18:39:17Z

🚀 Issue was released in v0.2.27 🚀

yarikoptic · 2022-06-29T21:00:53Z

I am still a bit "surprised" with 50MB usage for requests.get (per process) if that is what consumes so much to make validation needing 370MB and likely also doing all that requests.get from github all the time... Just FTR ATM we have only about

(git)lena:~/proj/dandi/schema[master]releases/0.6.3
$> du -sck *
56	asset.json
12	context.json
36	dandiset.json
60	published-asset.json
40	published-dandiset.json
204	total

204kB of schema in there, and I did not spot any major memory utilization while trying to load them straight from drive

(git)lena:~/proj/dandi/schema[master]releases/0.6.3
$> python -c 'import json,resource,requests; from glob import glob; print(resource.getrusage(resource.RUSAGE_SELF)); j=[json.load(open(f)) for f in glob("*.json")]; print(f"Loaded {len(j)}"); print(resource.getrusage(resource.RUSAGE_SELF))'
resource.struct_rusage(ru_utime=0.116075, ru_stime=0.024014999999999998, ru_maxrss=25168, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=4998, ru_majflt=3, ru_nswap=0, ru_inblock=2680, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=138, ru_nivcsw=1)
Loaded 5
resource.struct_rusage(ru_utime=0.117529, ru_stime=0.024014999999999998, ru_maxrss=25168, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=5166, ru_majflt=3, ru_nswap=0, ru_inblock=2680, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=138, ru_nivcsw=1)

or from github as we do

$> python -c 'import json,resource,requests; from glob import glob; print(resource.getrusage(resource.RUSAGE_SELF)); j=[requests.get(f"https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.3/{f}").json() for f in glob("*.json")]; print(f"Loaded {len(j)}"); print(resource.getrusage(resource.RUSAGE_SELF))'    
resource.struct_rusage(ru_utime=0.10937, ru_stime=0.012152, ru_maxrss=25396, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=5092, ru_majflt=0, ru_nswap=0, ru_inblock=32, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=0, ru_nivcsw=18)
Loaded 5
resource.struct_rusage(ru_utime=0.20646899999999999, ru_stime=0.028339, ru_maxrss=26276, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=5370, ru_majflt=0, ru_nswap=0, ru_inblock=552, ru_oublock=0, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=35, ru_nivcsw=19)

and also when I memrayed

this script version of above

import json
import resource
import requests; 
from glob import glob; 
print(resource.getrusage(resource.RUSAGE_SELF)); 

j = []
for f in glob("*.json"):
  j.append(
     requests.get(f"https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.3/{f}").json() 
  )

print(f"Loaded {len(j)}"); 

print(resource.getrusage(resource.RUSAGE_SELF))

memray flamegraph just showed that it was 10M due to import of requests, nothing about those actual requests.get .

so may be that memray suggested location was red herring ... but regardless -- to validate each asset we fetch schema file from github over and over again... I wonder if we could/should somehow cache that to gain speed up and reduce amount of network chatter

yarikoptic added the severity-important label May 11, 2022

mvandenburgh self-assigned this Jun 28, 2022

mvandenburgh mentioned this issue Jun 29, 2022

Optimize loop in checksum task #1139

Merged

mvandenburgh closed this as completed in #1139 Jun 29, 2022

dandibot added the released This issue/pull request has been released. label Jun 29, 2022

mvandenburgh mentioned this issue Jun 29, 2022

calculate_sha256 : 'generator' object has no attribute 'values' #1142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

heroku[checksum-worker.1]: Error R14 (Memory quota exceeded) #1095

heroku[checksum-worker.1]: Error R14 (Memory quota exceeded) #1095

yarikoptic commented May 11, 2022

yarikoptic commented May 26, 2022

mvandenburgh commented Jun 28, 2022 •

edited

Loading

yarikoptic commented Jun 29, 2022

yarikoptic commented Jun 29, 2022

waxlamp commented Jun 29, 2022

yarikoptic commented Jun 29, 2022

mvandenburgh commented Jun 29, 2022

mvandenburgh commented Jun 29, 2022 •

edited

Loading

yarikoptic commented Jun 29, 2022

mvandenburgh commented Jun 29, 2022

jjnesbitt commented Jun 29, 2022

mvandenburgh commented Jun 29, 2022

dandibot commented Jun 29, 2022

yarikoptic commented Jun 29, 2022

heroku[checksum-worker.1]: Error R14 (Memory quota exceeded) #1095

heroku[checksum-worker.1]: Error R14 (Memory quota exceeded) #1095

Comments

yarikoptic commented May 11, 2022

yarikoptic commented May 26, 2022

mvandenburgh commented Jun 28, 2022 • edited Loading

yarikoptic commented Jun 29, 2022

yarikoptic commented Jun 29, 2022

waxlamp commented Jun 29, 2022

yarikoptic commented Jun 29, 2022

mvandenburgh commented Jun 29, 2022

mvandenburgh commented Jun 29, 2022 • edited Loading

yarikoptic commented Jun 29, 2022

mvandenburgh commented Jun 29, 2022

jjnesbitt commented Jun 29, 2022

mvandenburgh commented Jun 29, 2022

dandibot commented Jun 29, 2022

yarikoptic commented Jun 29, 2022

mvandenburgh commented Jun 28, 2022 •

edited

Loading

mvandenburgh commented Jun 29, 2022 •

edited

Loading