-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory problem in parallel processing while using uproot 4, it wasn't the case in uproot 3 #277
Comments
Same clarification as on StackOverflow: I said that for all I know, it could be a small difference, like 10%. I didn't say that it is or should be 10%. The statement was about my lack of knowledge. |
I guess a GitHub discussion would have been a better choice than StackOverflow since it's more closer to an actual discussion than to a problem with a definite answer 😉 I had a quick look and tried the code above (with 8 The minimum (uncompressed) data size of the resulting >>> sum(dt.itemsize for dt in branches.dtypes) * branches.shape[0] / 1024**3
4.912072330713272 You should in principle be able to do the decompression and import sys
import resource
import uproot
def peak_memory_usage():
"""Return peak memory usage in MB"""
mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
factor_mb = 1 / 1024
if sys.platform == "darwin":
factor_mb = 1 / (1024 * 1024)
return mem * factor_mb
f = uproot.open('10k_events_PFSimplePlainTree.root')
f["PlainTree"].arrays()
print(peak_memory_usage()) Also please note that Python is garbage collected and sometimes it's a bit lazy of cleaning things up, so your mileage may vary and peak memory usage might give weird results. Do you have an actual comparison of the peak memory compared to your I don't know much about the internals of the Anyways... If memory is an issue, I'd definitely cite Jim who already pointed out: parallel I/O usually trades memory for speed. This could explain the higher memory footprint (however, still a bit of a mystery why the footprint is the same for 1 or 8 executors) and regarding the speed: you might not gain much benefit from parallel processes due to a GIL fight, which might even be responsible for a slowdown in some cases, see for example the wonderful work of David Beazley (http://www.dabeaz.com/GIL/) which is from 2010 but still valid. My current conclusion is: don't use parallel executors at all and simply go with I am still curious about the footprint of the uproot3 parallel approach. |
I am sorry for misquoting you, it is due to my English. It is my 3rd language and sometimes I may confuse may and should. I edited again my posts. |
There's another doubling because all of the little arrays from the TBaskets have to be concatenated into a big array representing the whole TBranch. That particular doubling would not scale with the number of executor, since each task is responsible for a disjoint subset of the TBaskets and the single output array is common to all. So that's 1× 5GB for the individual TBaskets, 1× 5GB for the resulting array, and 1× 5GB for the Pandas DataFrame = 15GB. That's pretty close to 18GB.
If you have the memory available, the garbage collector won't bother cleaning it up, so you could get different results on a machine with a lower ceiling. It's not smart enough to delay tasks until previous ones are done and their garbage gets collected.
I took a quick look at what and puts those Series (called While writing this comment, I got sucked into it and investigated. (There's a few hours between this sentence and the previous.) This This is the first time I've ever used There were a few places where TBaskets (with associated raw data), arrays from TBaskets, and arrays before computing expressions could be deleted, which lowers the overall memory use before approaching Pandas, and these are the data that need to be collected with None of this had anything to do with parallelizing execution. The parallel processing was completely done and we were back on a single thread before any unnecessary data could be deleted. Parallelizing the decompression does make it considerably faster, so this file is in the regime of spending most of its time in the GIL-released decompression routines. |
The PR that trims memory usage is #281, so if that works for you, this issue can be closed. |
OK that's a nice wrap-up!
Exactly. Thanks for the better explanation. I had issues with this in past quite often where users were stuck in debugging memory leaks, instead you could easily limit the (V)RAM of the process and everything was "fine", so it was a no-problem. It is definitely a (let's call it) Python feature which can confuse and mislead people. You made my day with Now it's time for @shahidzk1 to try #281. Unfortunately I cannot redo the test on the same machine currently because I am running a processing chain and need every possible memory address... |
I have this root file which is available on Google drive at this link, and when I used to convert it to arrays in root 3 using parallel processing, it took less time and memory. The code I was using was something like
`
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(8)
branches = pd.DataFrame.from_dict(uproot.open(''+file_with_path+'')[''+tree_name+''].arrays(namedecode='utf-8', executor = executor))`
But now it consumes all my memory in root 4, may be I am not doing it properly. Could you please have a look at it? Also it is not that speedy as it used to be.
`from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(8)
input_tree = uproot.open('/path/10k_events_PFSimplePlainTree.root:PlainTree', decompression_executor=executor)
branches = input_tree.arrays(library='pd', decompression_executor=executor)`
@jpivarski and I discussed this in the issue on this link and he suggested that it may be just 10% more memory but it is more than 10% for me. May be 60-80% more
The text was updated successfully, but these errors were encountered: