feat: in-memory caching #1207

zilto · 2024-10-25T15:49:41Z

This feature was requested by user. The goal is to enable caching without persisting results to disk. This is useful when using interactive Python sessions, notebooks, or even webservices, and don't want to use disk space.

NOTE. Need to add tests for Parallelizabl/Collect

In-memory caching requires implementing both a metadata and a result store in memory. It was also the opportunity to introduce .load_from() and .persist_to() which allows to load and save stores via other implementations.

sidenote: This feature could exist on any implementation actually (e.g., S3ResultStore.load_from(FileResultStore(...))). In the mean time, a similar pattern is currently possible through the in-memory store:

s3_result_store = S3ResultStore(...)
file_result_store = FileResultStore(...)

# this populates the S3 store from the local file-based store
temp_store = InMemoryResultStore.load_from(file_result_store, data_versions=[...])
temp_store.persist_to(s3_result_store)

# the Driver will use the S3 store, now populated with data that was local
dr = driver.Builder().with_modules(...).with_cache(result_store=s3_result_store).build()

ellipsis-dev

👍 Looks good to me! Reviewed everything up to 1dcf954 in 50 seconds

More details

Looked at 1126 lines of code in 12 files
Skipped 1 files when reviewing.
Skipped posting 6 drafted comments based on config settings.

1. hamilton/caching/stores/memory.py:52

Draft comment:
Consider handling the case where cache_key does not exist to avoid KeyError. You can use self._data_versions.pop(cache_key, None) to safely delete without raising an error if the key is not found.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The comment is suggesting a change to prevent potential runtime errors. This is a valid concern as using del without checking if the key exists can lead to a KeyError. The suggestion to use pop is a practical solution to handle this case safely.
The current implementation might be intentional if the author wants to ensure that the key exists before deletion. However, the comment is still valid as it suggests a safer alternative.
Even if the current implementation is intentional, the comment provides a safer alternative that could prevent potential errors.
The comment is useful as it suggests a safer way to handle key deletion, which could prevent runtime errors. It should be kept.

2. hamilton/caching/stores/memory.py:189

Draft comment:
Consider handling the case where data_version does not exist to avoid KeyError. You can use self._results.pop(data_version, None) to safely delete without raising an error if the key is not found.
Reason this comment was not posted:
Marked as duplicate.

3. hamilton/caching/stores/file.py:5

Draft comment:
The pattern for handling the ImportError for the override decorator is repeated in multiple files. Consider refactoring this into a utility function or a common module to adhere to the DRY principle.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The comment is about a change made in the diff, specifically the handling of the ImportError for the 'override' decorator. The suggestion to refactor this pattern into a utility function aligns with the DRY principle, which is a valid code quality improvement.
The comment assumes that this pattern is repeated in multiple files, which is not verifiable from the current diff. Without evidence of repetition, the suggestion might not be necessary.
Even if the pattern is not repeated in this diff, the suggestion to refactor for potential reuse is a proactive code quality improvement that aligns with best practices.
Keep the comment as it suggests a valid code quality improvement by adhering to the DRY principle, even if the repetition is not evident in this diff alone.

4. hamilton/caching/stores/memory.py:4

Draft comment:
The pattern for handling the ImportError for the override decorator is repeated in multiple files. Consider refactoring this into a utility function or a common module to adhere to the DRY principle. This comment also applies to hamilton/caching/stores/file.py.
Reason this comment was not posted:
Marked as duplicate.

5. hamilton/caching/stores/memory.py:5

Draft comment:
The pattern for handling the ImportError for the override decorator is repeated in multiple files. Consider refactoring this into a utility function or a common module to adhere to the DRY principle. This comment also applies to hamilton/caching/stores/file.py.
Reason this comment was not posted:
Marked as duplicate.

6. hamilton/caching/stores/sqlite.py:27

Draft comment:
The pattern for handling the ImportError for the override decorator is repeated in multiple files. Consider refactoring this into a utility function or a common module to adhere to the DRY principle. This comment also applies to hamilton/caching/stores/file.py and hamilton/caching/stores/memory.py.
Reason this comment was not posted:
Marked as duplicate.

Workflow ID: wflow_SiaYX41SEhZlQiVm

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev

👍 Looks good to me! Incremental review on 7305b81 in 5 seconds

More details

Looked at 11 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 2 drafted comments based on config settings.

1. hamilton/caching/stores/memory.py:1

Draft comment:
The import of Sequence from collections.abc is replaced by importing it directly from typing. This is a valid change as Sequence is available in typing.
Reason this comment was not posted:
Confidence changes required: 0%
The import of Sequence from collections.abc is replaced by importing it directly from typing. This is a valid change as Sequence is available in typing.

2. hamilton/caching/stores/memory.py:1

Draft comment:
Consider adding this change to the docs/ to note the import source change for Sequence.
Reason this comment was not posted:
Confidence changes required: 33%
The import statement for Sequence was moved from collections.abc to typing, which is a minor change but should be noted for documentation purposes.

Workflow ID: wflow_E7c8x7XeehMSZSUW

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

elijahbenizzy

Looks good! A few nits.

hamilton/caching/stores/file.py

hamilton/caching/stores/memory.py

hamilton/caching/stores/sqlite.py

tests/caching/metadata_store/test_base.py

tests/caching/metadata_store/test_memory.py

ellipsis-dev

👍 Looks good to me! Incremental review on e1a7b00 in 29 seconds

More details

Looked at 274 lines of code in 6 files
Skipped 0 files when reviewing.
Skipped posting 4 drafted comments based on config settings.

1. hamilton/caching/stores/sqlite.py:33

Draft comment:
self._directory should be converted to a string for serialization.

        state["path"] = str(self._directory)

Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The comment seems to address a potential issue with serialization of pathlib.Path objects. Since __getstate__ is meant for serialization, and pathlib.Path might not serialize correctly in all contexts, converting it to a string could be a valid suggestion. However, I need to ensure that this is indeed a necessary change and not just a preference.
I might be overestimating the importance of converting pathlib.Path to a string. If the current implementation works correctly in all intended contexts, the suggestion might be unnecessary.
Given that __getstate__ is used for serialization, and pathlib.Path objects might not serialize as expected, converting to a string could prevent potential issues. It's better to ensure compatibility across different serialization contexts.
The comment is likely valid as it addresses a potential serialization issue with pathlib.Path. Converting to a string ensures compatibility in serialization contexts.

2. tests/caching/metadata_store/test_base.py:54

Draft comment:
Consider re-adding metadata_store.delete_all() to ensure a clean test environment after each test.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
The removal of metadata_store.delete_all() could lead to tests not having a clean state, which is important for test reliability. The comment is suggesting a change that could prevent potential test contamination. This seems like a valid concern, as ensuring a clean test environment is crucial for accurate test results.
The comment assumes that the removal of metadata_store.delete_all() will lead to issues, but without more context on how the tests are isolated, it's speculative. The tests might be designed to handle state in another way.
Even if the tests are designed to handle state differently, the removal of a cleanup step is a significant change that could impact test reliability. It's better to err on the side of caution and ensure tests are isolated.
The comment should be kept as it addresses a potential issue with test isolation due to the removal of a cleanup step in the fixture.

3. tests/caching/test_integration.py:1

Draft comment:
The code in this file is complex and would benefit from comments explaining the purpose and flow of the tests.
Reason this comment was not posted:
Comment was on unchanged code.

4. tests/caching/result_store/test_base.py:28

Draft comment:
This file is missing tests for shared properties across result store implementations. Consider adding tests to ensure compliance with the Single Responsibility Principle.
Reason this comment was not posted:
Comment did not seem useful.

Workflow ID: wflow_iSWCNrxoyZZRJBPG

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev

❌ Changes requested. Incremental review on 29dbe1c in 28 seconds

More details

Looked at 17 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. hamilton/caching/stores/file.py:23

Draft comment:
The set method is handling both path creation and result saving, which violates the Single Responsibility Principle. Consider refactoring to separate these responsibilities.
Reason this comment was not posted:
Comment was on unchanged code.

Workflow ID: wflow_YTmC6Qa2eF4VEKkg

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2024-10-28T19:14:57Z

hamilton/caching/stores/file.py

@@ -14,6 +20,12 @@ def __init__(self, path: str, create_dir: bool = True) -> None:
        if self.create_dir:
            self.path.mkdir(exist_ok=True, parents=True)

+    def __getstate__(self) -> dict:


The __getstate__ method only serializes the path attribute. If create_dir is also important for the object's state, consider including it in the serialization.

elijahbenizzy

Looks good, a few more small nits

hamilton/caching/stores/memory.py

tests/caching/metadata_store/test_memory.py

ellipsis-dev

👍 Looks good to me! Incremental review on 13a0ec3 in 14 seconds

More details

Looked at 87 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 3 drafted comments based on config settings.

1. tests/caching/metadata_store/test_memory.py:37

Draft comment:
InMemoryMetadataStore does not have a size attribute. Use len(in_memory_metadata_store) and len(metadata_store) to compare sizes instead.
Reason this comment was not posted:
Comment was on unchanged code.

2. hamilton/caching/stores/memory.py:276

Draft comment:
Consider handling the case where result is None to avoid ambiguity when a data_version is not found in result_store.
Reason this comment was not posted:
Comment was on unchanged code.

3. hamilton/caching/stores/memory.py:276

Draft comment:
Consider adding comments to the load_from method to explain the logic, especially around how data is retrieved from result_store and set in in_memory_result_store. This will aid in understanding the data flow and the purpose of each step.
Reason this comment was not posted:
Confidence changes required: 50%
The load_from method in InMemoryResultStore is complex and involves multiple steps to retrieve and set data. Adding comments would help clarify the process.

Workflow ID: wflow_d2C4hfGZ5MXVLHcU

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Added in-memory caching. Stores exist in the Python process, but are not persisted to disk after each node execution. The entire stored can be persisted or loaded via new methods. Slight refactor to the `HamiltonCacheAdapter` means that implementations supporting multiprocessing need to implement a `__getstate__` method. Note that for `Python >= 3.11` all objects have a default implementation of `__getstate__` but it's not the case for earlier versions. --------- Co-authored-by: zilto <tjean@DESKTOP-V6JDCS2>

zilto added 2 commits October 24, 2024 16:32

added docstrings

7e9e9f0

added in-memory caching

1dcf954

zilto added enhancement New feature or request caching labels Oct 25, 2024

ellipsis-dev bot reviewed Oct 25, 2024

View reviewed changes

fix 3.8 typing

7305b81

ellipsis-dev bot reviewed Oct 25, 2024

View reviewed changes

elijahbenizzy reviewed Oct 25, 2024

View reviewed changes

zilto added 3 commits October 28, 2024 10:23

added override decorator

04c60f6

added override decorators; added constant for test parameterization

4273aba

refactored to support Parallel/Collect

e1a7b00

ellipsis-dev bot reviewed Oct 28, 2024

View reviewed changes

added __getstate__

29dbe1c

ellipsis-dev bot reviewed Oct 28, 2024

View reviewed changes

elijahbenizzy approved these changes Oct 28, 2024

View reviewed changes

hamilton/caching/stores/memory.py Outdated Show resolved Hide resolved

tests/caching/metadata_store/test_memory.py Outdated Show resolved Hide resolved

removed comment; extracted variable PERSISTENT_IMPLEMENTATIONS

13a0ec3

ellipsis-dev bot reviewed Oct 28, 2024

View reviewed changes

zilto merged commit 6f3799a into main Oct 28, 2024
24 checks passed

zilto deleted the feat/in-memory-store branch October 28, 2024 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: in-memory caching #1207

feat: in-memory caching #1207

zilto commented Oct 25, 2024 •

edited

Loading

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

elijahbenizzy left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot left a comment

ellipsis-dev bot Oct 28, 2024

elijahbenizzy left a comment

ellipsis-dev bot left a comment

feat: in-memory caching #1207

feat: in-memory caching #1207

Conversation

zilto commented Oct 25, 2024 • edited Loading

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

elijahbenizzy left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ellipsis-dev bot Oct 28, 2024

Choose a reason for hiding this comment

elijahbenizzy left a comment

Choose a reason for hiding this comment

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

zilto commented Oct 25, 2024 •

edited

Loading