Lazily load the data in the frame-cache format #710

psavery · 2024-08-18T02:15:58Z

Previously, just instantiating the imageseries would cause the data to be loaded. For some of the larger datasets from eiger, this could use a lot of RAM and time.

However, in some cases, we want to instantiate an imageseries just as a convenient (and somewhat standardized) way to look at things such as metadata (such as omegas) and other information without actually needing to load the data. This is actually something we do in the GUI to check the omegas (we do it for all image series types, in fact, not just frame-cache).

This commit changes the class to wait to load the data until it is actually requested. This makes loading it in the Simple Image Series loader in the GUI (which is done to check metadata) significantly faster.

This class still loads in all the frames the first time any frame is requested. Another potentially useful change would be to just load in whatever frame was requested and cache that frame in a dict (to be returned next time). However, maybe that could be done in the future if we encounter a more relevant use case for it.

Previously, just instantiating the imageseries would cause the data to be loaded. For some of the larger datasets from eiger, this could use a lot of RAM and time. However, in some cases, we want to instantiate an imageseries just as a convenient (and somewhat standardized) way to look at things such as metadata (such as omegas) and other information without actually needing to load the data. This is actually something we do in the GUI to check the omegas (we do it for all image series types, in fact, not just frame-cache). This commit changes the class to wait to load the data until it is actually requested. This makes loading it in the Simple Image Series loader in the GUI (which is done to check metadata) *significantly* faster. It still loads in all the data the first time any data is requested. Another potentially useful change would be to just load in whatever frame was requested and cache that frame in a dict (to be returned next time). However, maybe that could be done in the future if we encounter a more relevant use case for it. Signed-off-by: Patrick Avery <patrick.avery@kitware.com>

psavery · 2024-08-18T02:16:44Z

@donald-e-boyce, can you review this one?

psavery · 2024-08-18T02:17:00Z

hexrd/imageseries/load/framecache.py

+        arrs = np.load(self._fname)
+        # HACK: while the loaded npz file has a getitem method
+        # that mimicks a dict, it doesn't have a "pop" method.
+        # must make an empty dict to pop after assignment of
+        # class attributes so we can get to the metadata
+        keysd = dict.fromkeys(list(arrs.keys()))
+        self._nframes = int(arrs['nframes'])
+        self._shape = tuple(arrs['shape'])
+        # Check the type so we can read files written
+        # using Python 2.7
+        array_dtype = arrs['dtype'].dtype
+        # Python 3
+        if array_dtype.type == np.str_:
+            dtype_str = str(arrs['dtype'])
+        # Python 2.7
+        else:
+            dtype_str = arrs['dtype'].tobytes().decode()
+        self._dtype = np.dtype(dtype_str)
+
+        keysd.pop('nframes')
+        keysd.pop('shape')
+        keysd.pop('dtype')
+        for i in range(self._nframes):
+            keysd.pop(f"{i}_row")
+            keysd.pop(f"{i}_col")
+            keysd.pop(f"{i}_data")
+
+        # all rmaining keys should be metadata
+        for key in keysd:
+            keysd[key] = arrs[key]
+        self._meta = keysd


Most of this was just copied over from the previous _load_cache() function.

I ran into a problem. In this simple script, I tried copying the original imageseries to another, intending to add some metadata and check that the metadata is right. But saving the imageseries failed. Here is the script.

"""Test frame cache meta data""" import time from hexrd import imageseries FCFORMAT = "frame-cache" fcfile = "lshr_r6_s0_box_ff_000031_ge1.npz" def save_fc(): ims = imageseries.open(fcfile, FCFORMAT) imageseries.save.write(ims, "fc.npz", FCFORMAT, threshold=0) save_fc()

I got this error:

Traceback (most recent call last): File "/Users/deboyce/Source/hexrd/dev/frame-cache-meta/init.py", line 30, in <module> save_fc() File "/Users/deboyce/Source/hexrd/dev/frame-cache-meta/init.py", line 27, in save_fc imageseries.save.write(ims, "fc.npz", FCFORMAT, threshold=0) File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/save.py", line 40, in write w.write() File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/save.py", line 339, in write self._write_frames() File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/save.py", line 326, in _write_frames list(executor.map(extract_data, range(len(self._ims)))) File "/Users/deboyce/miniconda3/envs/hexrd-dev/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator yield fs.pop().result() File "/Users/deboyce/miniconda3/envs/hexrd-dev/lib/python3.9/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/Users/deboyce/miniconda3/envs/hexrd-dev/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result raise self._exception File "/Users/deboyce/miniconda3/envs/hexrd-dev/lib/python3.9/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/save.py", line 300, in extract_data count = extract_ijv(self._ims[i], self._thresh, File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/baseclass.py", line 23, in __getitem__ return self._adapter[key] File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/load/framecache.py", line 144, in __getitem__ return self._framelist[key].toarray() IndexError: list index out of range

I figured out what was happening. The write happens simultaneously by a bunch of threads. So all of them hit the __get_item__() method quickly in succession. The first one is trying to load all the images. After it gets at least one image in the framelist, the others think the framelist is loaded and try to access their frame. I added a print to the __get_item__() method.

def __getitem__(self, key): if not self._framelist_was_loaded: # Load the framelist now self._load_framelist() else: if key >= (nframes := len(self._framelist)): msg = f"key ({key})is out of range ({nframes})" # raise RuntimeError(msg) print(msg) return self._framelist[key].toarray()

Here is some output:

key (12)is out of range (1) key (13)is out of range (1) key (14)is out of range (1) key (15)is out of range (4) key (16)is out of range (8) key (17)is out of range (9) key (18)is out of range (9) key (19)is out of range (9) ...

Notice I also had it raise an exception too. In that case, the code finishes without the Exception being raised, although I don't know if the saved file is correct.

codecov · 2024-08-18T02:21:08Z

Codecov Report

Attention: Patch coverage is 94.44444% with 2 lines in your changes missing coverage. Please review.

Project coverage is 33.53%. Comparing base (433a0d5) to head (e9017ee).

Files	Patch %	Lines
hexrd/imageseries/load/framecache.py	94.44%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #710      +/-   ##
==========================================
+ Coverage   33.49%   33.53%   +0.04%     
==========================================
  Files         129      129              
  Lines       21237    21241       +4     
==========================================
+ Hits         7113     7123      +10     
+ Misses      14124    14118       -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This is so that `__getitem__` is thread-safe again. Only one thread should load the framelist, and the other threads wait for it to finish. Signed-off-by: Patrick Avery <patrick.avery@kitware.com>

donald-e-boyce · 2024-09-05T14:49:15Z

Everything looks good, and my test problems work fine.

psavery · 2024-09-05T14:57:16Z

CI already passed except for the Mac test which was fixed in #713

psavery requested a review from donald-e-boyce August 18, 2024 02:16

psavery commented Aug 18, 2024

View reviewed changes

Make framelist loading thread-safe

86ad798

This is so that `__getitem__` is thread-safe again. Only one thread should load the framelist, and the other threads wait for it to finish. Signed-off-by: Patrick Avery <patrick.avery@kitware.com>

donald-e-boyce approved these changes Sep 5, 2024

View reviewed changes

donald-e-boyce closed this Sep 5, 2024

donald-e-boyce reopened this Sep 5, 2024

psavery merged commit c0971c7 into master Sep 5, 2024
6 of 12 checks passed

psavery deleted the frame-cache-lazily-load-frames branch September 5, 2024 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazily load the data in the frame-cache format #710

Lazily load the data in the frame-cache format #710

psavery commented Aug 18, 2024

psavery commented Aug 18, 2024

psavery Aug 18, 2024

donald-e-boyce Sep 4, 2024

codecov bot commented Aug 18, 2024

donald-e-boyce commented Sep 5, 2024

psavery commented Sep 5, 2024

Lazily load the data in the frame-cache format #710

Lazily load the data in the frame-cache format #710

Conversation

psavery commented Aug 18, 2024

psavery commented Aug 18, 2024

psavery Aug 18, 2024

Choose a reason for hiding this comment

donald-e-boyce Sep 4, 2024

Choose a reason for hiding this comment

codecov bot commented Aug 18, 2024

Codecov Report

donald-e-boyce commented Sep 5, 2024

psavery commented Sep 5, 2024