Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

Open
ArvidJB opened this issue Jul 23, 2024 · 1 comment
Open

Support chunk cache in versioned-hdf5 (PyInf#13103) #357

ArvidJB opened this issue Jul 23, 2024 · 1 comment

Comments

@ArvidJB
Copy link
Collaborator

ArvidJB commented Jul 23, 2024

This is probably more of an h5py or libhdf5 issue, but it mainly impacts versioned-hdf5 so I'm opening it here.

HDF5 maintains a chunk cache to satisfy repeated accesses to a chunk without having to go to the file, see hdf5 docs.

If I understand it correctly that chunk caching does never apply to virtual datasets so is not used in versioned-hdf5? Here are some benchmarks that are much slower in versioned-hdf5 than without versioning:

>>> import h5py
>>> import numpy as np
>>> from versioned_hdf5 import VersionedHDF5File
>>> # END AUTO-GENERATED BLOCK


>>> # pick chunks and h5py cache sizes
>>> chunks = (1000,)
>>> rdcc_nbytes = 10 * (2 ** 20) # 10 MiB, enough to fit 1000 * 8192 byte chunks
>>> rdcc_nslots = 10007 # prime number large enough to fit roughly 10x number of chunks


>>> # slice to read
>>> slc = slice(None, None, 10)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     f.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # disable chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=0) as f:
...     print('first access:')
...     %time f['value'][slc]
...     print()
...     print('subsequent accesses:')
...     %timeit f['value'][slc]

first access:
CPU times: user 11.6 ms, sys: 26.8 ms, total: 38.3 ms
Wall time: 38.4 ms

subsequent accesses:
35.8 ms ± 835 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     f.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # open file with 10 MiB chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=rdcc_nbytes, rdcc_nslots=rdcc_nslots) as f:
...     print('first access actually reads from file:')
...     %time f['value'][slc]
...     print()
...     print('subsequent accesses will read from cache:')
...     %timeit f['value'][slc]

first access actually reads from file:
CPU times: user 5.87 ms, sys: 1.89 ms, total: 7.76 ms
Wall time: 7.72 ms

subsequent accesses will read from cache:
5.09 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
...     vf = VersionedHDF5File(f)
...     with vf.stage_version('r0') as sv:
...         sv.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))


>>> # open file with 10 MiB chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=rdcc_nbytes, rdcc_nslots=rdcc_nslots) as f:
...     vf = VersionedHDF5File(f)
...     cv = vf[vf.current_version]
...     print('first access actually reads from file:')
...     %time cv['value'][slc]
...     print()
...     print('for versioned files no accesses will ever read from chunk cache:')
...     %timeit cv['value'][slc]

first access actually reads from file:
CPU times: user 374 ms, sys: 16.7 ms, total: 390 ms
Wall time: 391 ms

for versioned files no accesses will ever read from chunk cache:
492 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is this slowness due to the (missing) chunk cache? If yes, how can we add support for chunk caching for virtual datasets?

@ArvidJB ArvidJB changed the title Support chunk cache in versioned-hdf5 Support chunk cache in versioned-hdf5 (PyInf#13103) Jul 23, 2024
@peytondmurray
Copy link
Collaborator

This question will be a good one for Guido - if you're okay with this, let's consult him once he becomes available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants