-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for current RNTuple files #928
Comments
@ioanaif has started to look into this, and found that some of the changes are headers/metadata, while another is that variable-length integers and zig-zag encoding are now included in RNTuple. Conversion to and from a variable-length integer format is not NumPy-vectorizable: it is necessary to write for loops to do this conversion. Here's what that looks like in Python: def to_varint(data):
assert issubclass(data.dtype.type, np.uint64)
output = []
for value in data:
mask = np.uint64(0x7f)
more = np.uint64(np.iinfo(np.uint64).max)
for shift in np.arange(0, 7 * 9, 7, dtype=np.uint64):
byte = ((value & mask) >> shift).astype(np.uint8)
mask <<= np.uint64(7)
more <<= np.uint64(7)
if not (value & more):
output.append(byte)
break
else:
output.append(byte | np.uint8(0x80))
return b"".join(output)
def from_varint(buffer):
data = []
pos = 0
while pos < len(buffer):
shift = np.uint64(0)
result = np.uint64(0)
while True:
byte = np.uint64(buffer[pos])
pos += 1
if shift == 7 * 9:
raise Exception("number is too big for uint64")
result |= (byte & np.uint64(0x7f)) << shift
shift += np.uint64(7)
if not (byte & np.uint64(0x80)):
break
data.append(result)
return np.array(data) >>> data = np.array([0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385], np.uint64)
>>> data.tolist()
[0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385]
>>> buffer = to_varint(data)
>>> buffer
b'\x00\x01\x02\x7f\x80\x01\x81\x01\x82\x01\xff\x7f\x80\x80\x01\x81\x80\x01'
>>> from_varint(buffer).tolist()
[0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385] But we want to avoid Python for loops. AwkwardForth provides a way to do that when converting from a variable-length encoding into integers. Here's how it can be done: >>> from awkward.forth import ForthMachine64
>>> vm = ForthMachine64("""
... input buffer
... output data uint64
...
... begin
... buffer varint-> data
... again
... """)
>>> buffer = to_varint(np.array([0, 1, 2, 127, 128, 129, 130, 16383, 16384, 16385], np.uint64))
>>> vm.run({"buffer": buffer}, raise_read_beyond=False)
'read beyond'
>>> vm.outputs["data"]
array([ 0, 1, 2, 127, 128, 129, 130, 16383, 16384,
16385], dtype=uint64) What the above does:
But since AwkwardForth is compiled code, it's a lot faster than pure Python: %%timeit
from_varint(buffer);
# 5.11 s ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
vm.run({"buffer": buffer}, raise_read_beyond=False);
# 21.3 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) (240× faster, in this case). The value of this encoding is that integers that are close to zero use the fewest bytes, but any integer can be encoded (including integers larger than You could do the same thing with signed integers, but small signed integers like >>> hex(np.int64(-1).view(np.uint64))
'0xffffffffffffffff' So, on top of the variable-length encoding, signed integers are zig-zag encoded, mapping e.g. or from_zigzag = lambda n: (n >> 1) ^ -(n & 1)
to_zigzag = lambda n: (n << 1) ^ (n >> 63) in Python. However, AwkwardForth has a built-in word for it, >>> vm = ForthMachine64("""
... input buffer
... output data int64
...
... begin
... buffer zigzag-> data
... again
... """)
>>> buffer = to_varint(
... to_zigzag(np.array([0, -1, 1, -2, 2, 100, -100, 1000, -1000], np.int64)).astype(np.uint64)
... )
>>> vm.run({"buffer": buffer}, raise_read_beyond=False)
'read beyond'
>>> vm.outputs["data"]
array([ 0, -1, 1, -2, 2, 100, -100, 1000, -1000]) |
Zig-zag encoding is now part of RNTuple but varints aren't. |
Due to some recent changes in RNTuple's format, uproot is not able to read RNTuples correctly anymore.
A reproducer:
results in
IIUC the Julia implementation already implemented the necessary changes at https://github.com/JuliaHEP/UnROOT.jl/pull/264/files
EDIT:
maybe I should have marked it as a bug?
The text was updated successfully, but these errors were encountered: