Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazily load the data in the frame-cache format #710

Merged
merged 2 commits into from
Sep 5, 2024

Conversation

psavery
Copy link
Collaborator

@psavery psavery commented Aug 18, 2024

Previously, just instantiating the imageseries would cause the data to be loaded. For some of the larger datasets from eiger, this could use a lot of RAM and time.

However, in some cases, we want to instantiate an imageseries just as a convenient (and somewhat standardized) way to look at things such as metadata (such as omegas) and other information without actually needing to load the data. This is actually something we do in the GUI to check the omegas (we do it for all image series types, in fact, not just frame-cache).

This commit changes the class to wait to load the data until it is actually requested. This makes loading it in the Simple Image Series loader in the GUI (which is done to check metadata) significantly faster.

This class still loads in all the frames the first time any frame is requested. Another potentially useful change would be to just load in whatever frame was requested and cache that frame in a dict (to be returned next time). However, maybe that could be done in the future if we encounter a more relevant use case for it.

Previously, just instantiating the imageseries would cause the data to be
loaded. For some of the larger datasets from eiger, this could use a lot of
RAM and time.

However, in some cases, we want to instantiate an imageseries just as a
convenient (and somewhat standardized) way to look at things such as
metadata (such as omegas) and other information without actually needing
to load the data. This is actually something we do in the GUI to check the
omegas (we do it for all image series types, in fact, not just frame-cache).

This commit changes the class to wait to load the data until it is actually
requested. This makes loading it in the Simple Image Series loader in the
GUI (which is done to check metadata) *significantly* faster.

It still loads in all the data the first time any data is requested. Another
potentially useful change would be to just load in whatever frame was
requested and cache that frame in a dict (to be returned next time).
However, maybe that could be done in the future if we encounter a more
relevant use case for it.

Signed-off-by: Patrick Avery <patrick.avery@kitware.com>
@psavery
Copy link
Collaborator Author

psavery commented Aug 18, 2024

@donald-e-boyce, can you review this one?

Comment on lines +44 to +74
arrs = np.load(self._fname)
# HACK: while the loaded npz file has a getitem method
# that mimicks a dict, it doesn't have a "pop" method.
# must make an empty dict to pop after assignment of
# class attributes so we can get to the metadata
keysd = dict.fromkeys(list(arrs.keys()))
self._nframes = int(arrs['nframes'])
self._shape = tuple(arrs['shape'])
# Check the type so we can read files written
# using Python 2.7
array_dtype = arrs['dtype'].dtype
# Python 3
if array_dtype.type == np.str_:
dtype_str = str(arrs['dtype'])
# Python 2.7
else:
dtype_str = arrs['dtype'].tobytes().decode()
self._dtype = np.dtype(dtype_str)

keysd.pop('nframes')
keysd.pop('shape')
keysd.pop('dtype')
for i in range(self._nframes):
keysd.pop(f"{i}_row")
keysd.pop(f"{i}_col")
keysd.pop(f"{i}_data")

# all rmaining keys should be metadata
for key in keysd:
keysd[key] = arrs[key]
self._meta = keysd
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this was just copied over from the previous _load_cache() function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a problem. In this simple script, I tried copying the original imageseries to another, intending to add some metadata and check that the metadata is right. But saving the imageseries failed. Here is the script.

"""Test frame cache meta data"""
import time

from hexrd import imageseries

FCFORMAT = "frame-cache"
fcfile = "lshr_r6_s0_box_ff_000031_ge1.npz"


def save_fc():
    ims = imageseries.open(fcfile, FCFORMAT)
    imageseries.save.write(ims, "fc.npz", FCFORMAT, threshold=0)

save_fc()

I got this error:

Traceback (most recent call last):
  File "/Users/deboyce/Source/hexrd/dev/frame-cache-meta/init.py", line 30, in <module>
    save_fc()
  File "/Users/deboyce/Source/hexrd/dev/frame-cache-meta/init.py", line 27, in save_fc
    imageseries.save.write(ims, "fc.npz", FCFORMAT, threshold=0)
  File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/save.py", line 40, in write
    w.write()
  File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/save.py", line 339, in write
    self._write_frames()
  File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/save.py", line 326, in _write_frames
    list(executor.map(extract_data, range(len(self._ims))))
  File "/Users/deboyce/miniconda3/envs/hexrd-dev/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator
    yield fs.pop().result()
  File "/Users/deboyce/miniconda3/envs/hexrd-dev/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/Users/deboyce/miniconda3/envs/hexrd-dev/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/Users/deboyce/miniconda3/envs/hexrd-dev/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/save.py", line 300, in extract_data
    count = extract_ijv(self._ims[i], self._thresh,
  File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/baseclass.py", line 23, in __getitem__
    return self._adapter[key]
  File "/Users/deboyce/Source/hexrd/pkg/hexrd/imageseries/load/framecache.py", line 144, in __getitem__
    return self._framelist[key].toarray()
IndexError: list index out of range

I figured out what was happening. The write happens simultaneously by a bunch of threads. So all of them hit the __get_item__() method quickly in succession. The first one is trying to load all the images. After it gets at least one image in the framelist, the others think the framelist is loaded and try to access their frame. I added a print to the __get_item__() method.

    def __getitem__(self, key):
        if not self._framelist_was_loaded:
            # Load the framelist now
            self._load_framelist()
        else:
            if key >= (nframes := len(self._framelist)):
                msg = f"key ({key})is out of range ({nframes})"
                # raise RuntimeError(msg)
                print(msg)
        return self._framelist[key].toarray()

Here is some output:

key (12)is out of range (1)
key (13)is out of range (1)
key (14)is out of range (1)
key (15)is out of range (4)
key (16)is out of range (8)
key (17)is out of range (9)
key (18)is out of range (9)
key (19)is out of range (9)
...

Notice I also had it raise an exception too. In that case, the code finishes without the Exception being raised, although I don't know if the saved file is correct.

Copy link

codecov bot commented Aug 18, 2024

Codecov Report

Attention: Patch coverage is 94.44444% with 2 lines in your changes missing coverage. Please review.

Project coverage is 33.53%. Comparing base (433a0d5) to head (e9017ee).

Files Patch % Lines
hexrd/imageseries/load/framecache.py 94.44% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #710      +/-   ##
==========================================
+ Coverage   33.49%   33.53%   +0.04%     
==========================================
  Files         129      129              
  Lines       21237    21241       +4     
==========================================
+ Hits         7113     7123      +10     
+ Misses      14124    14118       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This is so that `__getitem__` is thread-safe again. Only one thread
should load the framelist, and the other threads wait for it to finish.

Signed-off-by: Patrick Avery <patrick.avery@kitware.com>
@donald-e-boyce
Copy link
Collaborator

Everything looks good, and my test problems work fine.

@psavery
Copy link
Collaborator Author

psavery commented Sep 5, 2024

CI already passed except for the Mac test which was fixed in #713

@psavery psavery merged commit c0971c7 into master Sep 5, 2024
6 of 12 checks passed
@psavery psavery deleted the frame-cache-lazily-load-frames branch September 5, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants