[InMemoryDataset redesign] Read many slices at once with the HDF5 C API #378

crusaderky · 2024-10-02T15:59:48Z

This is an alternative to #370 after it's been found out that virtual datasets perform very poorly in libhdf5.

Add a function to quickly read potentially thousands of slices from HDF5 into a numpy array, or between numpy arrays. This new function remains dormant for now and will be used in a later PR by a variant of StagedChangesArray from #370.

peytondmurray

Thanks for this - the tests seem comprehensive, and I think there's only one place where we might need a lock. Otherwise this looks great! 🚀

peytondmurray · 2024-10-04T18:53:40Z

versioned_hdf5/meson.build

@@ -11,6 +11,16 @@ py.install_sources(
    subdir: 'versioned_hdf5',
 )

+# Adapted from https://numpy.org/doc/2.1/reference/random/examples/cython/meson.build.html


peytondmurray · 2024-10-04T19:32:59Z

versioned_hdf5/slicetools.pyx

+    """Implements read_many_slices data transfer when fast transfer cannot be performed.
+
+    This happens when:
+    1. src is a h5py.Dataset but h5py.Dataset._fast_read_ok returns False.


I think you need a with phil lock here in the case where src is an h5py.Dataset.

peytondmurray · 2024-10-04T21:41:51Z

versioned_hdf5/tests/test_slicetools.py

+
+
+def test_read_many_slices_not_fast_read_ok(h5file):
+    """src is a h5py dataset that doesn't support fast read"""


It might be good to outline somewhere in the docs when a dataset supports fast reading, since there is a significant difference in performance.

crusaderky force-pushed the h5multiread branch 4 times, most recently from 624332d to bb6291f Compare October 3, 2024 00:38

New utility function read_many_slices

e1dbcfc

crusaderky force-pushed the h5multiread branch from 68b2236 to e1dbcfc Compare October 3, 2024 14:21

crusaderky self-assigned this Oct 3, 2024

crusaderky changed the title ~~[WIP] Read many slices at once with the HDF5 C API~~ [InMemoryDataset redesign] Read many slices at once with the HDF5 C API Oct 3, 2024

Fix numpy 1.x

c6394e9

crusaderky force-pushed the h5multiread branch from 45d8663 to c6394e9 Compare October 3, 2024 14:39

crusaderky requested review from peytondmurray and ArvidJB October 3, 2024 15:40

crusaderky marked this pull request as ready for review October 3, 2024 15:40

peytondmurray approved these changes Oct 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[InMemoryDataset redesign] Read many slices at once with the HDF5 C API #378

[InMemoryDataset redesign] Read many slices at once with the HDF5 C API #378

crusaderky commented Oct 2, 2024

peytondmurray left a comment

peytondmurray Oct 4, 2024

peytondmurray Oct 4, 2024

peytondmurray Oct 4, 2024



		def test_read_many_slices_not_fast_read_ok(h5file):
		"""src is a h5py dataset that doesn't support fast read"""

[InMemoryDataset redesign] Read many slices at once with the HDF5 C API #378

Are you sure you want to change the base?

[InMemoryDataset redesign] Read many slices at once with the HDF5 C API #378

Conversation

crusaderky commented Oct 2, 2024

peytondmurray left a comment

Choose a reason for hiding this comment

peytondmurray Oct 4, 2024

Choose a reason for hiding this comment

peytondmurray Oct 4, 2024

Choose a reason for hiding this comment

peytondmurray Oct 4, 2024

Choose a reason for hiding this comment