Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: in-memory caching #1207

Merged
merged 8 commits into from
Oct 28, 2024
Merged

feat: in-memory caching #1207

merged 8 commits into from
Oct 28, 2024

Conversation

zilto
Copy link
Collaborator

@zilto zilto commented Oct 25, 2024

This feature was requested by user. The goal is to enable caching without persisting results to disk. This is useful when using interactive Python sessions, notebooks, or even webservices, and don't want to use disk space.

NOTE. Need to add tests for Parallelizabl/Collect

In-memory caching requires implementing both a metadata and a result store in memory. It was also the opportunity to introduce .load_from() and .persist_to() which allows to load and save stores via other implementations.

sidenote: This feature could exist on any implementation actually (e.g., S3ResultStore.load_from(FileResultStore(...))). In the mean time, a similar pattern is currently possible through the in-memory store:

s3_result_store = S3ResultStore(...)
file_result_store = FileResultStore(...)

# this populates the S3 store from the local file-based store
temp_store = InMemoryResultStore.load_from(file_result_store, data_versions=[...])
temp_store.persist_to(s3_result_store)

# the Driver will use the S3 store, now populated with data that was local
dr = driver.Builder().with_modules(...).with_cache(result_store=s3_result_store).build()

@zilto zilto added enhancement New feature or request caching labels Oct 25, 2024
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 1dcf954 in 50 seconds

More details
  • Looked at 1126 lines of code in 12 files
  • Skipped 1 files when reviewing.
  • Skipped posting 6 drafted comments based on config settings.
1. hamilton/caching/stores/memory.py:52
  • Draft comment:
    Consider handling the case where cache_key does not exist to avoid KeyError. You can use self._data_versions.pop(cache_key, None) to safely delete without raising an error if the key is not found.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The comment is suggesting a change to prevent potential runtime errors. This is a valid concern as using del without checking if the key exists can lead to a KeyError. The suggestion to use pop is a practical solution to handle this case safely.
    The current implementation might be intentional if the author wants to ensure that the key exists before deletion. However, the comment is still valid as it suggests a safer alternative.
    Even if the current implementation is intentional, the comment provides a safer alternative that could prevent potential errors.
    The comment is useful as it suggests a safer way to handle key deletion, which could prevent runtime errors. It should be kept.
2. hamilton/caching/stores/memory.py:189
  • Draft comment:
    Consider handling the case where data_version does not exist to avoid KeyError. You can use self._results.pop(data_version, None) to safely delete without raising an error if the key is not found.
  • Reason this comment was not posted:
    Marked as duplicate.
3. hamilton/caching/stores/file.py:5
  • Draft comment:
    The pattern for handling the ImportError for the override decorator is repeated in multiple files. Consider refactoring this into a utility function or a common module to adhere to the DRY principle.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The comment is about a change made in the diff, specifically the handling of the ImportError for the 'override' decorator. The suggestion to refactor this pattern into a utility function aligns with the DRY principle, which is a valid code quality improvement.
    The comment assumes that this pattern is repeated in multiple files, which is not verifiable from the current diff. Without evidence of repetition, the suggestion might not be necessary.
    Even if the pattern is not repeated in this diff, the suggestion to refactor for potential reuse is a proactive code quality improvement that aligns with best practices.
    Keep the comment as it suggests a valid code quality improvement by adhering to the DRY principle, even if the repetition is not evident in this diff alone.
4. hamilton/caching/stores/memory.py:4
  • Draft comment:
    The pattern for handling the ImportError for the override decorator is repeated in multiple files. Consider refactoring this into a utility function or a common module to adhere to the DRY principle. This comment also applies to hamilton/caching/stores/file.py.
  • Reason this comment was not posted:
    Marked as duplicate.
5. hamilton/caching/stores/memory.py:5
  • Draft comment:
    The pattern for handling the ImportError for the override decorator is repeated in multiple files. Consider refactoring this into a utility function or a common module to adhere to the DRY principle. This comment also applies to hamilton/caching/stores/file.py.
  • Reason this comment was not posted:
    Marked as duplicate.
6. hamilton/caching/stores/sqlite.py:27
  • Draft comment:
    The pattern for handling the ImportError for the override decorator is repeated in multiple files. Consider refactoring this into a utility function or a common module to adhere to the DRY principle. This comment also applies to hamilton/caching/stores/file.py and hamilton/caching/stores/memory.py.
  • Reason this comment was not posted:
    Marked as duplicate.

Workflow ID: wflow_SiaYX41SEhZlQiVm


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 7305b81 in 5 seconds

More details
  • Looked at 11 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 drafted comments based on config settings.
1. hamilton/caching/stores/memory.py:1
  • Draft comment:
    The import of Sequence from collections.abc is replaced by importing it directly from typing. This is a valid change as Sequence is available in typing.
  • Reason this comment was not posted:
    Confidence changes required: 0%
    The import of Sequence from collections.abc is replaced by importing it directly from typing. This is a valid change as Sequence is available in typing.
2. hamilton/caching/stores/memory.py:1
  • Draft comment:
    Consider adding this change to the docs/ to note the import source change for Sequence.
  • Reason this comment was not posted:
    Confidence changes required: 33%
    The import statement for Sequence was moved from collections.abc to typing, which is a minor change but should be noted for documentation purposes.

Workflow ID: wflow_E7c8x7XeehMSZSUW


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few nits.

hamilton/caching/stores/file.py Show resolved Hide resolved
hamilton/caching/stores/memory.py Show resolved Hide resolved
hamilton/caching/stores/memory.py Show resolved Hide resolved
hamilton/caching/stores/memory.py Outdated Show resolved Hide resolved
hamilton/caching/stores/sqlite.py Outdated Show resolved Hide resolved
tests/caching/metadata_store/test_base.py Outdated Show resolved Hide resolved
tests/caching/metadata_store/test_memory.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on e1a7b00 in 29 seconds

More details
  • Looked at 274 lines of code in 6 files
  • Skipped 0 files when reviewing.
  • Skipped posting 4 drafted comments based on config settings.
1. hamilton/caching/stores/sqlite.py:33
  • Draft comment:
    self._directory should be converted to a string for serialization.
        state["path"] = str(self._directory)
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The comment seems to address a potential issue with serialization of pathlib.Path objects. Since __getstate__ is meant for serialization, and pathlib.Path might not serialize correctly in all contexts, converting it to a string could be a valid suggestion. However, I need to ensure that this is indeed a necessary change and not just a preference.
    I might be overestimating the importance of converting pathlib.Path to a string. If the current implementation works correctly in all intended contexts, the suggestion might be unnecessary.
    Given that __getstate__ is used for serialization, and pathlib.Path objects might not serialize as expected, converting to a string could prevent potential issues. It's better to ensure compatibility across different serialization contexts.
    The comment is likely valid as it addresses a potential serialization issue with pathlib.Path. Converting to a string ensures compatibility in serialization contexts.
2. tests/caching/metadata_store/test_base.py:54
  • Draft comment:
    Consider re-adding metadata_store.delete_all() to ensure a clean test environment after each test.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The removal of metadata_store.delete_all() could lead to tests not having a clean state, which is important for test reliability. The comment is suggesting a change that could prevent potential test contamination. This seems like a valid concern, as ensuring a clean test environment is crucial for accurate test results.
    The comment assumes that the removal of metadata_store.delete_all() will lead to issues, but without more context on how the tests are isolated, it's speculative. The tests might be designed to handle state in another way.
    Even if the tests are designed to handle state differently, the removal of a cleanup step is a significant change that could impact test reliability. It's better to err on the side of caution and ensure tests are isolated.
    The comment should be kept as it addresses a potential issue with test isolation due to the removal of a cleanup step in the fixture.
3. tests/caching/test_integration.py:1
  • Draft comment:
    The code in this file is complex and would benefit from comments explaining the purpose and flow of the tests.
  • Reason this comment was not posted:
    Comment was on unchanged code.
4. tests/caching/result_store/test_base.py:28
  • Draft comment:
    This file is missing tests for shared properties across result store implementations. Consider adding tests to ensure compliance with the Single Responsibility Principle.
  • Reason this comment was not posted:
    Comment did not seem useful.

Workflow ID: wflow_iSWCNrxoyZZRJBPG


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on 29dbe1c in 28 seconds

More details
  • Looked at 17 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. hamilton/caching/stores/file.py:23
  • Draft comment:
    The set method is handling both path creation and result saving, which violates the Single Responsibility Principle. Consider refactoring to separate these responsibilities.
  • Reason this comment was not posted:
    Comment was on unchanged code.

Workflow ID: wflow_YTmC6Qa2eF4VEKkg


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@@ -14,6 +20,12 @@ def __init__(self, path: str, create_dir: bool = True) -> None:
if self.create_dir:
self.path.mkdir(exist_ok=True, parents=True)

def __getstate__(self) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __getstate__ method only serializes the path attribute. If create_dir is also important for the object's state, consider including it in the serialization.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few more small nits

hamilton/caching/stores/memory.py Outdated Show resolved Hide resolved
tests/caching/metadata_store/test_memory.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on 13a0ec3 in 14 seconds

More details
  • Looked at 87 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 drafted comments based on config settings.
1. tests/caching/metadata_store/test_memory.py:37
  • Draft comment:
    InMemoryMetadataStore does not have a size attribute. Use len(in_memory_metadata_store) and len(metadata_store) to compare sizes instead.
  • Reason this comment was not posted:
    Comment was on unchanged code.
2. hamilton/caching/stores/memory.py:276
  • Draft comment:
    Consider handling the case where result is None to avoid ambiguity when a data_version is not found in result_store.
  • Reason this comment was not posted:
    Comment was on unchanged code.
3. hamilton/caching/stores/memory.py:276
  • Draft comment:
    Consider adding comments to the load_from method to explain the logic, especially around how data is retrieved from result_store and set in in_memory_result_store. This will aid in understanding the data flow and the purpose of each step.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The load_from method in InMemoryResultStore is complex and involves multiple steps to retrieve and set data. Adding comments would help clarify the process.

Workflow ID: wflow_d2C4hfGZ5MXVLHcU


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@zilto zilto merged commit 6f3799a into main Oct 28, 2024
24 checks passed
@zilto zilto deleted the feat/in-memory-store branch October 28, 2024 22:55
jernejfrank pushed a commit to jernejfrank/hamilton that referenced this pull request Oct 31, 2024
Added in-memory caching. Stores exist in the Python process, but are not 
persisted to disk after each node execution. The entire stored can be persisted
or loaded via new methods.  

Slight refactor to the `HamiltonCacheAdapter` means that implementations
supporting multiprocessing need to implement a `__getstate__` method.
Note that for `Python >= 3.11` all objects have a default implementation of
`__getstate__` but it's not the case for earlier versions.

---------

Co-authored-by: zilto <tjean@DESKTOP-V6JDCS2>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
caching enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants