Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: supply a pre-calculated base form to avoid file opening #1077

Merged
merged 1 commit into from
Jan 3, 2024

Conversation

lgray
Copy link
Contributor

@lgray lgray commented Jan 2, 2024

Add known_base_form option (awkward.forms.Form | None) to uproot.dask so that opening root files at the client can be avoided for mature analyses where the base forms of all files are well known. This is pertinent for root tree with large numbers of keys like NanoAOD.

Right now it is mutually exclusive with open_files=True, since it brings no advantage in that case.

An initial implementation is included.

I tested the speedup with the following code:

import uproot
import json
import gzip
import awkward

import pyinstrument

if __name__ == "__main__":

    prof = pyinstrument.Profiler()
    prof.start()

    with gzip.open("nano_form.json.gz", "r") as in_form:
	known_form = awkward.forms.from_json(in_form.read())

    for _ in range(20):
	events = uproot.dask({"tests/samples/nano_dy.root": {"object_path": "Events", "steps": [[0,40]]}}, open_files=False, known_base_form=known_form)

    prof.stop()
    print(prof.output_text(unicode=True, color=True, show_all=True))

    prof = pyinstrument.Profiler()
    prof.start()

    for _ in range(20):
        events = uproot.dask({"tests/samples/nano_dy.root": {"object_path": "Events", "steps": [[0,40]]}}, open_files=False)

    prof.stop()
    print(prof.output_text(unicode=True, color=True, show_all=True))

Which on a arm64 macbook pro with SSD gives about a 6x speedup. The gain will be enormous when using xrootd.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. At the last minute, I dithered on this, thinking that we could solve the underlying problem by optionally incorporating Numba into the TBranch and TLeaf-deserialization process, since these are the two classes that are especially numerous for TTree metadata. In the end, though, I saw that it would have to go through read_object_any, which builds a generic object graph by reference. It would be a bigger project than the timescale on which you need this.

I'll merge this now and release 5.2.1.

@jpivarski jpivarski merged commit 9b475eb into scikit-hep:main Jan 3, 2024
21 checks passed
lobis added a commit that referenced this pull request Jan 22, 2024
* origin/main: (41 commits)
  chore: hatch-vcs generates 'version.py'; add it to .gitignore
  docs: add bnavigator as a contributor for test (#1087)
  feat: support for writing hist derived profiles (#1000)
  feat: add the ability to read RNTuple alias columns (#1004)
  chore: update pre-commit hooks (#1082)
  docs: fix ReadTheDocs documentation (#1084)
  chore: update pre-commit hooks (#1073)
  chore(deps): bump actions/upload-artifact from 3 to 4 (#1071)
  chore(deps): bump actions/download-artifact from 3 to 4 (#1072)
  build: change build to autogen version info (#1062)
  chore: remove src/uproot/version.py (for #1062).
  The next release will be 5.2.1.
  add known_base_form option so that opening root files can be avoided for mature analyses (#1077)
  test: fsspec cache (#1075)
  test: xrootd server fixture (#1076)
  The next release will be 5.2.0.
  fix: correct typo in fsspec globbing (#1067)
  chore(deps): bump actions/setup-python from 4 to 5 (#1059)
  The next release will be 5.2.0rc5.
  fix: dask distributed fsspec issue (#1065)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants