-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable "lazy_tree" for all Datamodels #358
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #358 +/- ##
==========================================
+ Coverage 97.56% 97.64% +0.08%
==========================================
Files 30 36 +6
Lines 2788 3316 +528
==========================================
+ Hits 2720 3238 +518
- Misses 68 78 +10 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I'm surprised that no changes were necessary to romancal proper.
Just out of curiosity, a "block" is an entire data extension, so for files with two blocks, lazy_tree doesn't do very much, even if they're long blocks? In my imagination a big win for lazy_tree is for operations that only need to read in the metadata (operations like [fits.getheader(x)[field] for x in list_of_filenames]), and I'm trying to understand if that's still the case with files with <= 2 data extensions.
There are some follow-up changes that are worth considering related to removing
Thanks for the question. I glossed over some details when referring to which blocks were "loaded". There is still a benefit for files with <= 2 data extensions as the above used To try and add some more (but hopefully not too much) detail. Each ASDF block has a binary header and data. When a file is opened, asdf reads the headers for the first and last blocks but typically does not load the data (when lazy_load=True). However, when the data is accessed (or the converter forces the loading in the case of >>> m = rdm.open("r0000101001001001001_01101_0001_WFI01_cal.asdf")
>>> sum([b.loaded for b in m._asdf._blocks._blocks])
2 Load the first and last block headers. However the data portion of the blocks are not yet read (this uses more private asdf API as the public API for interfacing with the blocks is within the extension code and would complicate the example here). >>> m._asdf._blocks._blocks[0]._cached_data
None For this particular file, the 0th block corresponds to coefficients of a polynomial transform in the gwcs object. So if we access the wcs it will trigger loading the block data: >>> m.meta.wcs
>>> m._asdf._blocks._blocks[0]._cached_data
array([...]) Finally, to circle back to the relationship with >>> m = rdm.open("r0000101001001001001_01101_0001_WFI01_cal.asdf", lazy_load=False)
>>> sum([b.loaded for b in m._asdf._blocks._blocks])
41
>>> m._asdf._blocks._blocks[0]._data # returns an array for every block in the file
array([...]) So for the most part using |
Ah, right, you had mentioned to me before that we had largely disabled lazy_load in the pipeline anyway; I agree that we can now try to do that better. And yes, thank you for the clear description of what it means to load the first and last blocks. If I understand correctly, there's likely a small penalty related to doing the seek to get the last block header, but it's small and there's no associated memory usage penalty; that sounds like a reasonable trade. |
Regtests all pass:
https://github.com/spacetelescope/RegressionTests/actions/runs/10163978364
This PR enables the asdf lazy_tree feature for roman_datamodels. By default asdf set's
lazy_tree=False
, this PR sets the default toTrue
for the calls toroman_datamodels.datamodels.open
.This PR also changes a few
isinstance
checks to include the "lazy" nodes returned by asdf (for example, instead of adict
asdf will return aAsdfDictNode
whenlazy_tree=True
).Finally, this PR adds
lazy=True
to the roman datamodels asdf converters to signify that these converters can handle "lazy" objects.Enabling "lazy_tree" allows asdf to defer conversion of custom objects (like
astropy.unit.Quantity
) until they are accessed (from the containing object). This allows asdf to also defer loading the blocks containing the array data forQuantity
(this is otherwise impossible due to the handling of the input forQuantity.__init__
). To provide an example, on roman_datamodels main:shows that 36 blocks were loaded from disk when the file was opened (the file contains a total of 41 blocks).
With this PR:
Only 2 blocks were loaded (asdf does this as a sanity check of the file, loading the first and last block to confirm that the blocks are ordered appropriately).
Furthermore, with this PR, the above example takes 142.8M of RAM, whereas with main the example takes 486.5M.
Checklist
CHANGES.rst
under the corresponding subsection