Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for current RNTuple files #928

Closed
eguiraud opened this issue Aug 9, 2023 · 2 comments · Fixed by #1004
Closed

Add support for current RNTuple files #928

eguiraud opened this issue Aug 9, 2023 · 2 comments · Fixed by #1004
Assignees
Labels
feature New feature or request

Comments

@eguiraud
Copy link

eguiraud commented Aug 9, 2023

Due to some recent changes in RNTuple's format, uproot is not able to read RNTuples correctly anymore.

A reproducer:

python -c 'import uproot; print(uproot.__version__); uproot.open("https://xrootd-local.unl.edu:1094//store/user/AGC/nanoaod-rntuple/zstd/TT_TuneCUETP8M1_13TeV-powheg-pythia8/cmsopendata2015_ttbar_19980_PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1_00000_0000.root")["Events"].arrays(["nTau"])

results in

5.0.10
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 394, in arrays
    entry_stop = entry_stop or self._length
                               ^^^^^^^^^^^^
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 180, in _length
    return sum(x.num_entries for x in self.cluster_summaries)
                                      ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 175, in cluster_summaries
    return self.footer.cluster_summaries
           ^^^^^^^^^^^
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 164, in footer
    f = FooterReader().read(self._footer_chunk, cursor, context)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 697, in read
    out.extension_links = self.extension_header_links.read(chunk, cursor, context)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 556, in read
    assert num_bytes < 0, f"num_bytes={num_bytes}"
           ^^^^^^^^^^^^^
AssertionError: num_bytes=36

IIUC the Julia implementation already implemented the necessary changes at https://github.com/JuliaHEP/UnROOT.jl/pull/264/files

EDIT:
maybe I should have marked it as a bug?

@eguiraud eguiraud added the feature New feature or request label Aug 9, 2023
@ioanaif ioanaif self-assigned this Aug 9, 2023
@jpivarski
Copy link
Member

@ioanaif has started to look into this, and found that some of the changes are headers/metadata, while another is that variable-length integers and zig-zag encoding are now included in RNTuple.

Conversion to and from a variable-length integer format is not NumPy-vectorizable: it is necessary to write for loops to do this conversion. Here's what that looks like in Python:

def to_varint(data):
    assert issubclass(data.dtype.type, np.uint64)

    output = []
    for value in data:
        mask = np.uint64(0x7f)
        more = np.uint64(np.iinfo(np.uint64).max)
        for shift in np.arange(0, 7 * 9, 7, dtype=np.uint64):
            byte = ((value & mask) >> shift).astype(np.uint8)
            mask <<= np.uint64(7)
            more <<= np.uint64(7)

            if not (value & more):
                output.append(byte)
                break
            else:
                output.append(byte | np.uint8(0x80))

    return b"".join(output)

def from_varint(buffer):
    data = []
    pos = 0
    while pos < len(buffer):
        shift = np.uint64(0)
        result = np.uint64(0)
        while True:
            byte = np.uint64(buffer[pos])
            pos += 1

            if shift == 7 * 9:
                raise Exception("number is too big for uint64")

            result |= (byte & np.uint64(0x7f)) << shift
            shift += np.uint64(7)

            if not (byte & np.uint64(0x80)):
                break

        data.append(result)

    return np.array(data)
>>> data = np.array([0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385], np.uint64)
>>> data.tolist()
[0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385]

>>> buffer = to_varint(data)
>>> buffer
b'\x00\x01\x02\x7f\x80\x01\x81\x01\x82\x01\xff\x7f\x80\x80\x01\x81\x80\x01'

>>> from_varint(buffer).tolist()
[0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385]

But we want to avoid Python for loops. AwkwardForth provides a way to do that when converting from a variable-length encoding into integers. Here's how it can be done:

>>> from awkward.forth import ForthMachine64
>>> vm = ForthMachine64("""
... input buffer
... output data uint64
... 
... begin
...     buffer varint-> data
... again
... """)

>>> buffer = to_varint(np.array([0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385], np.uint64))

>>> vm.run({"buffer": buffer}, raise_read_beyond=False)
'read beyond'
>>> vm.outputs["data"]
array([    0,     1,     2,   127,   128,   129,   130, 16383, 16384,
       16385], dtype=uint64)

What the above does:

  • declares input and output buffers (inputs are given; outputs are created and grow as needed)
  • uses the varint-> word to decode one variable-length integer from the input buffer to the output data
  • uses the begin .. again construct to do an infinite loop (like while True)
  • uses raise_read_beyond=False to catch a "read beyond length of input buffer" exception and return it as a string instead of raising a Python exception.

But since AwkwardForth is compiled code, it's a lot faster than pure Python:

%%timeit
from_varint(buffer);
# 5.11 s ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
vm.run({"buffer": buffer}, raise_read_beyond=False);
# 21.3 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(240× faster, in this case).

The value of this encoding is that integers that are close to zero use the fewest bytes, but any integer can be encoded (including integers larger than uint64, though that's all we care about when NumPy is involved).

You could do the same thing with signed integers, but small signed integers like -1 would be encoded with the most bytes, because -1 is 0xffffffffffffffff in int64.

>>> hex(np.int64(-1).view(np.uint64))
'0xffffffffffffffff'

So, on top of the variable-length encoding, signed integers are zig-zag encoded, mapping e.g. [0, -1, 1, -2, ...] as [0, 1, 2, 3, ...]. That step can be vectorized:

https://github.com/JuliaHEP/UnROOT.jl/blob/d50081090d95b44138098e25b4102e0d01f270a6/src/RNTuple/fieldcolumn_reading.jl#L87-L88

or

from_zigzag = lambda n: (n >> 1) ^ -(n & 1)
to_zigzag = lambda n: (n << 1) ^ (n >> 63)

in Python. However, AwkwardForth has a built-in word for it, zigzag-> (which does both the zig-zag and the variable-length decoding), so you can just use that.

>>> vm = ForthMachine64("""
... input buffer
... output data int64
... 
... begin
...     buffer zigzag-> data
... again
... """)

>>> buffer = to_varint(
...     to_zigzag(np.array([0, -1, 1, -2, 2, 100, -100, 1000, -1000], np.int64)).astype(np.uint64)
... )

>>> vm.run({"buffer": buffer}, raise_read_beyond=False)
'read beyond'
>>> vm.outputs["data"]
array([    0,    -1,     1,    -2,     2,   100,  -100,  1000, -1000])

@jblomer
Copy link

jblomer commented Sep 5, 2023

@ioanaif has started to look into this, and found that some of the changes are headers/metadata, while another is that variable-length integers and zig-zag encoding are now included in RNTuple.

Zig-zag encoding is now part of RNTuple but varints aren't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants