Support chunk cache in versioned-hdf5 (PyInf#13103) #357

ArvidJB · 2024-07-23T20:55:15Z

This is probably more of an h5py or libhdf5 issue, but it mainly impacts versioned-hdf5 so I'm opening it here.

HDF5 maintains a chunk cache to satisfy repeated accesses to a chunk without having to go to the file, see hdf5 docs.

If I understand it correctly that chunk caching does never apply to virtual datasets so is not used in versioned-hdf5? Here are some benchmarks that are much slower in versioned-hdf5 than without versioning:

>>> import h5py
>>> import numpy as np
>>> from versioned_hdf5 import VersionedHDF5File
>>> # END AUTO-GENERATED BLOCK


>>> # pick chunks and h5py cache sizes
>>> chunks = (1000,)
>>> rdcc_nbytes = 10 * (2 ** 20) # 10 MiB, enough to fit 1000 * 8192 byte chunks
>>> rdcc_nslots = 10007 # prime number large enough to fit roughly 10x number of chunks


>>> # slice to read
>>> slc = slice(None, None, 10)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     f.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # disable chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=0) as f:
...     print('first access:')
...     %time f['value'][slc]
...     print()
...     print('subsequent accesses:')
...     %timeit f['value'][slc]

first access:
CPU times: user 11.6 ms, sys: 26.8 ms, total: 38.3 ms
Wall time: 38.4 ms

subsequent accesses:
35.8 ms ± 835 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     f.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # open file with 10 MiB chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=rdcc_nbytes, rdcc_nslots=rdcc_nslots) as f:
...     print('first access actually reads from file:')
...     %time f['value'][slc]
...     print()
...     print('subsequent accesses will read from cache:')
...     %timeit f['value'][slc]

first access actually reads from file:
CPU times: user 5.87 ms, sys: 1.89 ms, total: 7.76 ms
Wall time: 7.72 ms

subsequent accesses will read from cache:
5.09 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     vf = VersionedHDF5File(f)
...     with vf.stage_version('r0') as sv:
...         sv.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # open file with 10 MiB chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=rdcc_nbytes, rdcc_nslots=rdcc_nslots) as f:
...     vf = VersionedHDF5File(f)
...     cv = vf[vf.current_version]
...     print('first access actually reads from file:')
...     %time cv['value'][slc]
...     print()
...     print('for versioned files no accesses will ever read from chunk cache:')
...     %timeit cv['value'][slc]

first access actually reads from file:
CPU times: user 374 ms, sys: 16.7 ms, total: 390 ms
Wall time: 391 ms

for versioned files no accesses will ever read from chunk cache:
492 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is this slowness due to the (missing) chunk cache? If yes, how can we add support for chunk caching for virtual datasets?

The text was updated successfully, but these errors were encountered:

peytondmurray · 2024-07-26T19:11:40Z

This question will be a good one for Guido - if you're okay with this, let's consult him once he becomes available.

ArvidJB changed the title ~~Support chunk cache in versioned-hdf5~~ Support chunk cache in versioned-hdf5 (PyInf#13103) Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

ArvidJB commented Jul 23, 2024

peytondmurray commented Jul 26, 2024

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

Comments

ArvidJB commented Jul 23, 2024

peytondmurray commented Jul 26, 2024