-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: pandas performance on files with many branches #1086
Conversation
The example file is small; it has 1000 branches, but they're not filled with much data. Unfortunately, we can't keep large test files in our CI, so this is something that can only be tested manually. I found a CMS NanoAOD file (which I can share with you privately). By selecting import uproot, time
tree = uproot.open("Run2018D-DoubleMuon-Nano25Oct2019"
"_ver2-v1-974F28EE-0FCE-4940-92B5-870859F880B1.root:Events")
tick = time.perf_counter()
for _ in tree.iterate(filter_typename="bool", step_size=100000, library="np"):
tock = time.perf_counter()
print(tock - tick)
tick = tock prints times on average 3.4 ± 0.1 sec, the time needed to read the arrays at all (using NumPy). Swapping In I found another file, from issue #288, which has a lot of large, simple-typed branches that can be selected with filter_typename=["double", "/double\[[0-9]+\]/", "bool", "/bool\[[0-9]+\]/"] It takes The --- a/src/uproot/interpretation/library.py
+++ b/src/uproot/interpretation/library.py
@@ -856,7 +856,9 @@ class Pandas(Library):
elif isinstance(how, str) or how is None:
arrays, names = _pandas_only_series(pandas, arrays, expression_context)
- return pandas.DataFrame(data=arrays, columns=names)
+ out = pandas.concat(arrays, axis=1, ignore_index=True)
+ out.columns = names
+ return out
else:
raise TypeError( it now takes 2.6 ± 0.1 sec for Going back to the NanoAOD file, it's 3.1 ± 0.2 sec with So the bottom line is that using the |
After adjusting the code so that it passes all tests (54cab87), the time with the file from from issue #288 is still 2.6 ± 0.1 sec with ... until I manually reverted the code to pandas.DataFrame(data=arrays, columns=names) just to be sure that it's persistent, that I'm not just imagining things. It is definitely the case that Pandas is doing something bad when we run its constructor. (Surely that was the first thing that I tried, way back when, and encountered some bad behavior that made me write Oh, if I run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ready to merge. I'll let you merge it, in case you want to do any more counter-edits or tests.
No description provided.