You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is probably more of an h5py or libhdf5 issue, but it mainly impacts versioned-hdf5 so I'm opening it here.
HDF5 maintains a chunk cache to satisfy repeated accesses to a chunk without having to go to the file, see hdf5 docs.
If I understand it correctly that chunk caching does never apply to virtual datasets so is not used in versioned-hdf5? Here are some benchmarks that are much slower in versioned-hdf5 than without versioning:
>>> import h5py
>>> import numpy as np
>>> from versioned_hdf5 import VersionedHDF5File
>>> # END AUTO-GENERATED BLOCK
>>> # pick chunks and h5py cache sizes
>>> chunks = (1000,)
>>> rdcc_nbytes = 10 * (2 ** 20) # 10 MiB, enough to fit 1000 * 8192 byte chunks
>>> rdcc_nslots = 10007 # prime number large enough to fit roughly 10x number of chunks
>>> # slice to read
>>> slc = slice(None, None, 10)
>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
... f.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))
>>> # disable chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=0) as f:
... print('first access:')
... %time f['value'][slc]
... print()
... print('subsequent accesses:')
... %timeit f['value'][slc]
first access:
CPU times: user 11.6 ms, sys: 26.8 ms, total: 38.3 ms
Wall time: 38.4 ms
subsequent accesses:
35.8 ms ± 835 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
... f.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))
>>> # open file with 10 MiB chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=rdcc_nbytes, rdcc_nslots=rdcc_nslots) as f:
... print('first access actually reads from file:')
... %time f['value'][slc]
... print()
... print('subsequent accesses will read from cache:')
... %timeit f['value'][slc]
first access actually reads from file:
CPU times: user 5.87 ms, sys: 1.89 ms, total: 7.76 ms
Wall time: 7.72 ms
subsequent accesses will read from cache:
5.09 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> with h5py.File('/var/tmp/data.h5', 'w') as f:
... vf = VersionedHDF5File(f)
... with vf.stage_version('r0') as sv:
... sv.create_dataset('value', data=np.arange(1_000_000), chunks=chunks, maxshape=(None,))
>>> # open file with 10 MiB chunk cache
>>> with h5py.File('/var/tmp/data.h5', 'r', rdcc_nbytes=rdcc_nbytes, rdcc_nslots=rdcc_nslots) as f:
... vf = VersionedHDF5File(f)
... cv = vf[vf.current_version]
... print('first access actually reads from file:')
... %time cv['value'][slc]
... print()
... print('for versioned files no accesses will ever read from chunk cache:')
... %timeit cv['value'][slc]
first access actually reads from file:
CPU times: user 374 ms, sys: 16.7 ms, total: 390 ms
Wall time: 391 ms
for versioned files no accesses will ever read from chunk cache:
492 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is this slowness due to the (missing) chunk cache? If yes, how can we add support for chunk caching for virtual datasets?
The text was updated successfully, but these errors were encountered:
ArvidJB
changed the title
Support chunk cache in versioned-hdf5
Support chunk cache in versioned-hdf5 (PyInf#13103)
Jul 23, 2024
This is probably more of an
h5py
orlibhdf5
issue, but it mainly impacts versioned-hdf5 so I'm opening it here.HDF5 maintains a chunk cache to satisfy repeated accesses to a chunk without having to go to the file, see hdf5 docs.
If I understand it correctly that chunk caching does never apply to virtual datasets so is not used in versioned-hdf5? Here are some benchmarks that are much slower in versioned-hdf5 than without versioning:
Is this slowness due to the (missing) chunk cache? If yes, how can we add support for chunk caching for virtual datasets?
The text was updated successfully, but these errors were encountered: