Add support for current RNTuple files #928

eguiraud · 2023-08-09T15:20:17Z

Due to some recent changes in RNTuple's format, uproot is not able to read RNTuples correctly anymore.

A reproducer:

python -c 'import uproot; print(uproot.__version__); uproot.open("https://xrootd-local.unl.edu:1094//store/user/AGC/nanoaod-rntuple/zstd/TT_TuneCUETP8M1_13TeV-powheg-pythia8/cmsopendata2015_ttbar_19980_PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext3-v1_00000_0000.root")["Events"].arrays(["nTau"])

results in

5.0.10
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 394, in arrays
    entry_stop = entry_stop or self._length
                               ^^^^^^^^^^^^
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 180, in _length
    return sum(x.num_entries for x in self.cluster_summaries)
                                      ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 175, in cluster_summaries
    return self.footer.cluster_summaries
           ^^^^^^^^^^^
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 164, in footer
    f = FooterReader().read(self._footer_chunk, cursor, context)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 697, in read
    out.extension_links = self.extension_header_links.read(chunk, cursor, context)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/blue/Repos/analysis-grand-challenge-iris-hep/analyses/cms-open-data-ttbar/venv-uproot5/lib/python3.11/site-packages/uproot/models/RNTuple.py", line 556, in read
    assert num_bytes < 0, f"num_bytes={num_bytes}"
           ^^^^^^^^^^^^^
AssertionError: num_bytes=36

IIUC the Julia implementation already implemented the necessary changes at https://github.com/JuliaHEP/UnROOT.jl/pull/264/files

EDIT:
maybe I should have marked it as a bug?

The text was updated successfully, but these errors were encountered:

jpivarski · 2023-08-24T16:15:09Z

@ioanaif has started to look into this, and found that some of the changes are headers/metadata, while another is that variable-length integers and zig-zag encoding are now included in RNTuple.

Conversion to and from a variable-length integer format is not NumPy-vectorizable: it is necessary to write for loops to do this conversion. Here's what that looks like in Python:

def to_varint(data):
    assert issubclass(data.dtype.type, np.uint64)

    output = []
    for value in data:
        mask = np.uint64(0x7f)
        more = np.uint64(np.iinfo(np.uint64).max)
        for shift in np.arange(0, 7 * 9, 7, dtype=np.uint64):
            byte = ((value & mask) >> shift).astype(np.uint8)
            mask <<= np.uint64(7)
            more <<= np.uint64(7)

            if not (value & more):
                output.append(byte)
                break
            else:
                output.append(byte | np.uint8(0x80))

    return b"".join(output)

def from_varint(buffer):
    data = []
    pos = 0
    while pos < len(buffer):
        shift = np.uint64(0)
        result = np.uint64(0)
        while True:
            byte = np.uint64(buffer[pos])
            pos += 1

            if shift == 7 * 9:
                raise Exception("number is too big for uint64")

            result |= (byte & np.uint64(0x7f)) << shift
            shift += np.uint64(7)

            if not (byte & np.uint64(0x80)):
                break

        data.append(result)

    return np.array(data)

>>> data = np.array([0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385], np.uint64)
>>> data.tolist()
[0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385]

>>> buffer = to_varint(data)
>>> buffer
b'\x00\x01\x02\x7f\x80\x01\x81\x01\x82\x01\xff\x7f\x80\x80\x01\x81\x80\x01'

>>> from_varint(buffer).tolist()
[0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385]

But we want to avoid Python for loops. AwkwardForth provides a way to do that when converting from a variable-length encoding into integers. Here's how it can be done:

>>> from awkward.forth import ForthMachine64
>>> vm = ForthMachine64("""
... input buffer
... output data uint64
... 
... begin
...     buffer varint-> data
... again
... """)

>>> buffer = to_varint(np.array([0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385], np.uint64))

>>> vm.run({"buffer": buffer}, raise_read_beyond=False)
'read beyond'
>>> vm.outputs["data"]
array([    0,     1,     2,   127,   128,   129,   130, 16383, 16384,
       16385], dtype=uint64)

What the above does:

declares input and output buffers (inputs are given; outputs are created and grow as needed)
uses the varint-> word to decode one variable-length integer from the input buffer to the output data
uses the begin .. again construct to do an infinite loop (like while True)
uses raise_read_beyond=False to catch a "read beyond length of input buffer" exception and return it as a string instead of raising a Python exception.

But since AwkwardForth is compiled code, it's a lot faster than pure Python:

%%timeit
from_varint(buffer);
# 5.11 s ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
vm.run({"buffer": buffer}, raise_read_beyond=False);
# 21.3 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(240× faster, in this case).

The value of this encoding is that integers that are close to zero use the fewest bytes, but any integer can be encoded (including integers larger than uint64, though that's all we care about when NumPy is involved).

You could do the same thing with signed integers, but small signed integers like -1 would be encoded with the most bytes, because -1 is 0xffffffffffffffff in int64.

>>> hex(np.int64(-1).view(np.uint64))
'0xffffffffffffffff'

So, on top of the variable-length encoding, signed integers are zig-zag encoded, mapping e.g. [0, -1, 1, -2, ...] as [0, 1, 2, 3, ...]. That step can be vectorized:

https://github.com/JuliaHEP/UnROOT.jl/blob/d50081090d95b44138098e25b4102e0d01f270a6/src/RNTuple/fieldcolumn_reading.jl#L87-L88

or

from_zigzag = lambda n: (n >> 1) ^ -(n & 1)
to_zigzag = lambda n: (n << 1) ^ (n >> 63)

in Python. However, AwkwardForth has a built-in word for it, zigzag-> (which does both the zig-zag and the variable-length decoding), so you can just use that.

>>> vm = ForthMachine64("""
... input buffer
... output data int64
... 
... begin
...     buffer zigzag-> data
... again
... """)

>>> buffer = to_varint(
...     to_zigzag(np.array([0, -1, 1, -2, 2, 100, -100, 1000, -1000], np.int64)).astype(np.uint64)
... )

>>> vm.run({"buffer": buffer}, raise_read_beyond=False)
'read beyond'
>>> vm.outputs["data"]
array([    0,    -1,     1,    -2,     2,   100,  -100,  1000, -1000])

jblomer · 2023-09-05T14:08:57Z

@ioanaif has started to look into this, and found that some of the changes are headers/metadata, while another is that variable-length integers and zig-zag encoding are now included in RNTuple.

Zig-zag encoding is now part of RNTuple but varints aren't.

eguiraud added the feature New feature or request label Aug 9, 2023

ioanaif self-assigned this Aug 9, 2023

jpivarski added a commit that referenced this issue Oct 3, 2023

test: skip RNTuple test until #928 is fixed

43108da

jpivarski mentioned this issue Oct 3, 2023

test: skip RNTuple test until #928 is fixed #969

Merged

jpivarski added a commit that referenced this issue Oct 3, 2023

test: skip RNTuple test until #928 is fixed (#969)

9074577

nikoladze mentioned this issue Oct 5, 2023

Reading the RNtuple PHYSLITE prototype #975

Closed

ioanaif mentioned this issue Oct 20, 2023

feat: add the ability to read RNTuple alias columns #1004

Merged

jpivarski closed this as completed in #1004 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for current RNTuple files #928

Add support for current RNTuple files #928

eguiraud commented Aug 9, 2023 •

edited

Loading

jpivarski commented Aug 24, 2023

jblomer commented Sep 5, 2023

Add support for current RNTuple files #928

Add support for current RNTuple files #928

Comments

eguiraud commented Aug 9, 2023 • edited Loading

jpivarski commented Aug 24, 2023

jblomer commented Sep 5, 2023

eguiraud commented Aug 9, 2023 •

edited

Loading