feat: First-class caching #1104

zilto · 2024-08-23T00:59:52Z

This follows-up the closed PR #1039

Current TODOs:

Clean up docstrings and comments that are outdated
[~] Write documentation
[~] Add special behavior (always recompute, dont fingerprint, dont cache)
Review the assumptions around the data model

Other TODOs:

structured the Hamilton modules
standardized naming: result_store and metadata_store (closer to S3 lingo), fingerprinting, fingerprint
introspection mechanism (pre-run, and post-run)
introspection visualization
[~] logging for cache adapter; could be improved for storage
checkpointing feature using run_id. Passing a run_id string or latest to result_from should do the trick
[~ ] OOP or protocols for Store subclasses. It's currently a bit confusing that result_store and metadata_store both inherit from the same BaseStore but they have different method signatures for get and set
[~] Works with materializers. It does work, but has rough edges:
- for to.file or other destinations where the suffix doesn't match the ideal file extension.
- Can't pass kwargs to materializers
- Assumes the materializer's only required arg is path

tests/caching/cases.py

tests/caching/test_integration.py

docs/concepts/caching.rst

elijahbenizzy · 2024-08-28T23:10:43Z

High-level:

We should have some user-stories that match to common cases. THe docs are good for explaining the very high-level, but we want people to attach it to their workflow (maybe multiple notebooks?)
In the docs it feels like there's some assumed knowledge about terms (fingerprinting, etc...). I think we're exposing a bit at the lower level here -- user stories should help.

A few more user-stories can help me evaluate the use-cases. Seems technically largely solid, although I would want instructions on extending (E.G. adding a global cache)

docs/concepts/caching.rst

hamilton/driver.py

elijahbenizzy

Lots of thoughts here, exciting stuff

elijahbenizzy · 2024-09-17T21:54:30Z

hamilton/driver.py

@@ -1696,6 +1709,14 @@ def validate_materialization(
        all_nodes = nodes | user_nodes
        self.graph_executor.validate(list(all_nodes))

+    @property
+    def cache(self) -> lifecycle.SmartCacheAdapter:


If this is user-facing add more comments (E.G. many people might not know that caching is implemented by an adapter, why might you want it...)

hamilton/driver.py

hamilton/experimental/h_cache.py

hamilton/lifecycle/caching.py

docs/how-tos/caching.rst

elijahbenizzy · 2024-09-24T19:21:00Z

docs/reference/caching/caching-logic.rst

@@ -0,0 +1,52 @@
+=======================
+Caching logic


Brief overview on what this section is

hamilton/lifecycle/caching.py

hamilton/plugins/h_diskcache.py

hamilton/stores/base.py

hamilton/stores/file.py

added docstring to base64 encode fingerprinting notebook section added caching code and example dev checkpoint added Fingerprint class first-class caching v0.1 fixed tests updated module_6 test case updated docstrings documentation WIP added refactor file structure add direct access to cache via Driver remove Support Parallelizable/Collect; removed Fingerprint Refactored the logic of the CacheAdapter to have explicit operations. Many challenges arise from `Parallel` nodes that return sequence of elements, but actually the data version of individual elements is what matters. Also, `Collect` can have difficulties access upstream data version if these were computed in other threads/processes. The `Fingerprint` construct was removed because it obfuscated what information was relevant to pass around. It's name is also less evocative than "data version" and "code version". made SQLMetadataStore threadsafe added structured logging; move class to hamilton.lifecycle.caching structured log; data saver; cache decorator added from_string to CachingBehavior enum added docs and mermaid graph support fixed variable renaming issue updated docs requirements; adapted typing to 3.8 reverted Sequence type to Sequence class for singledispatch fixed typo added deprecation warnings to other caching methods fixed missing kwarg for recursive data versioning output structured log to file updated expand nodes handling; added ignore behavior updated caching behaviors and admonitions updated Sequence import for 3.8 support fixed bugs: sentinel values, log printing, failed nodes updated docs updated docstring; fixed materialization with parallel fixed 3.8 typing guard against setting cache twice registered separate function for versioning bytes refactored adapter to use internal hook replace HamiltonGraph by FunctionGraph add _get_node_role() method switched to internal hooks pre/post node refactored to key on run_id; refactor result_store refactored sqlitestore to hamilton.stores fixed type annotations fixed docs reference and docstrings updated result_store tests fixed materialization from result_store added roadmap to docs refactored to move to hamilton/caching fingerprinting.set_max_depth() added changed cache decorator to a class updated all docstrings for SmartCacheAdapter renamed context_key to cache_key added deprecation warnings using logging improved warning message added TODO

…ields

skrawcz · 2024-09-30T04:41:14Z

Looking good! Does this work with adapters like ray, or the graceful error one?
If there are things that this doesn't work with or behavior would be weird -- we should have a section mentioning that.

elijahbenizzy

Looking great! A few points on documentation. Caveats + testing:

mlflow adapter
hamilton UI (curious what this looks like)

But yeah, gotta get this out, merge tonight/tomorrow morning?

docs/concepts/caching.rst

examples/caching/README.md

hamilton/caching/adapter.py

elijahbenizzy · 2024-09-30T17:55:47Z

hamilton/caching/stores/sqlite.py

+from hamilton.caching.stores.base import MetadataStore
+
+
+class SQLiteMetadataStore(MetadataStore):


Comments on this -- it'll be something people want to look into for debugging

elijahbenizzy · 2024-09-30T17:56:02Z

hamilton/caching/stores/utils.py

+import pathlib
+
+
+def get_directory_size(directory: str) -> float:


hamilton/driver.py

elijahbenizzy · 2024-09-30T17:59:33Z

hamilton/driver.py

@@ -1905,6 +1931,11 @@ def with_adapters(self, *adapters: lifecycle_base.LifecycleAdapter) -> "Builder"
        :param adapter: Adapter to use.
        :return: self
        """
+        if any(isinstance(adapter, SmartCacheAdapter) for adapter in adapters):


nit, but we should store cache then add it to the adapters later I think?

Not a blocker

hamilton/function_modifiers/metadata.py

elijahbenizzy

Looking good! Will glance over when we sync tomorrow, then approve + merge.

elijahbenizzy · 2024-10-03T05:25:14Z

hamilton/caching/adapter.py

+                # solution for `@dataloader` and `from_`
+                if behaviors.get(main_node, None) is not None:
+                    behaviors[node.name] = behaviors[main_node]
+                # this hacky section is required to support @load_from and provide


Hmm, not sure I follow what's happening here. Let's add some more docs.

Specifically, document desired behavior here/how the code maps to it

elijahbenizzy · 2024-10-03T05:27:49Z

hamilton/caching/adapter.py

                data_version = self._version_data(node_name=node_name, run_id=run_id, result=result)
+
+                # nodes collected in `._data_savers` return a dictionary of metadata


I'm not convinced we need to special case these... But for now, let's call this a bit experimental (data savers/loaders) -- it's a very odd case that we don't want ot dwell onmore.

elijahbenizzy

Well done!

zilto added the enhancement New feature or request label Aug 23, 2024

skrawcz reviewed Aug 23, 2024

View reviewed changes

tests/caching/cases.py Outdated Show resolved Hide resolved

skrawcz reviewed Aug 23, 2024

View reviewed changes

tests/caching/test_integration.py Outdated Show resolved Hide resolved

skrawcz reviewed Aug 23, 2024

View reviewed changes

tests/caching/test_integration.py Outdated Show resolved Hide resolved

skrawcz reviewed Aug 28, 2024

View reviewed changes