Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Attempt to continue LRU cache for decoded chunks #1214

Closed
wants to merge 58 commits into from

Conversation

croth1
Copy link

@croth1 croth1 commented Oct 23, 2022

Continuation attempt of #306 - do not merge yet - very likely still not correct behaviour when chunks get deleted from store by #738.

Still have very limited knowledge of the code base - will require me to dig a bit deeper to gain bit more understanding.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

shikharsg and others added 30 commits August 29, 2018 13:17
… decode round trip for object arrays(with tests)
factoring out mapping code from LRUStoreCache and LRUChunkCache
@croth1
Copy link
Author

croth1 commented Oct 23, 2022

Hmh, ok - the diff of the merge commit looks still a bit insane - some code from other PRs somehow leaked in/got duplicated - or I am just bad at interpreting the diff. Will investigate tomorrow what happened there.

class StoreTests(MutableMappingStoreTests):
"""Abstract store tests."""

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I needed to move this one, because LRUChunkCache doesn't implement the contextmanager interface - maybe implement it instead?

@@ -87,6 +87,15 @@ class Array:
read and decompressed when possible.

.. versionadded:: 2.7
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need to get updated.

@@ -478,6 +489,16 @@ def open_array(
non-fill-value data are stored, at the expense of overhead associated
with checking the data of each chunk.

.. versionadded:: 2.7
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to get updated

@joshmoore
Copy link
Member

Kicked off the GHA workflows.

@codecov
Copy link

codecov bot commented Oct 24, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.99%. Comparing base (f361631) to head (16793f5).
Report is 298 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #1214    +/-   ##
========================================
  Coverage   99.99%   99.99%            
========================================
  Files          35       35            
  Lines       14136    14366   +230     
========================================
+ Hits        14135    14365   +230     
  Misses          1        1            
Files with missing lines Coverage Δ
zarr/__init__.py 100.00% <ø> (ø)
zarr/core.py 100.00% <100.00%> (ø)
zarr/creation.py 100.00% <ø> (ø)
zarr/hierarchy.py 99.78% <100.00%> (+<0.01%) ⬆️
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_core.py 100.00% <100.00%> (ø)
zarr/tests/test_storage.py 100.00% <100.00%> (ø)

@joshmoore
Copy link
Member

Green. Are there any other commits expected from your side, @croth1?

@croth1
Copy link
Author

croth1 commented Nov 1, 2022

I am still not super sure whether this is the final form or I would rather re-implement it as a caching store ontop of other stores, just like the LRUStoreCache - it feels a bit more intuitive to me. Also I still want to check whether I can do write caching. If I remember correctly, this implementation is write-through. Not sure whether in combination with LRUStoreCache cached writes can be achieved. Need to do more research, but I'm a bit busy right now.

@FarzanT
Copy link

FarzanT commented May 20, 2023

Hi, any news on when this feature will get pushed to main? It seems like a very useful feature!
I want to reduce the overhead of sampling from the same chunk which is expected to be in RAM. If I'm not mistaken, right now, every time you pass an index to the zarr array, it has to decompress that index. I want to have an option to keep the entire chunk decompressed in the cache until it's discarded. Specifically, I'm using a large zarr array in my pytorch dataset module, and need to randomly iterate through samples in the same chunk before randomly picking the next chunk. Avoiding the decompression overhead is quite useful, and unfortunately I can't leave the data fully decompressed on disk, as it would be too large.

@croth1
Copy link
Author

croth1 commented Jun 2, 2023

@FarzanT @joshmoore, I have in the meantime changed my approach and would not need this anymore - hence the long silence. If that's useful for other people, we can try finishing it up as is. Would need some rebasing and performance testing, though.

IIRC, last time I checked all tests were green, although this would need a very thorough review because frequently I was not 100% sure what I was doing during the quite substantial rebase.

@joshmoore
Copy link
Member

I'll leave @FarzanT and others to say how pressing their need is. Happy to help how I can @croth1 if you'd like to pursue this.

@FarzanT
Copy link

FarzanT commented Jun 6, 2023

Thank you @croth1 and @joshmoore, my primary use case has also been addressed by #278 (comment), so it's not a pressing issue for me at the moment. But I'd say it would be much nicer to just flip a switch and have zarr handle this internally. If this pull request can address this, then it shouldn't be abandoned IMO!

@jhamman
Copy link
Member

jhamman commented Oct 11, 2024

I'm going to close this as stale. Folks should feel free to reopen if there is interest in continuing this work.

@jhamman jhamman closed this Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants