Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: First-class caching #1104

Merged
merged 26 commits into from
Oct 3, 2024
Merged

feat: First-class caching #1104

merged 26 commits into from
Oct 3, 2024

Conversation

zilto
Copy link
Collaborator

@zilto zilto commented Aug 23, 2024

This follows-up the closed PR #1039

Current TODOs:

  • Clean up docstrings and comments that are outdated
  • [~] Write documentation
  • [~] Add special behavior (always recompute, dont fingerprint, dont cache)
  • Review the assumptions around the data model

Other TODOs:

  • structured the Hamilton modules
  • standardized naming: result_store and metadata_store (closer to S3 lingo), fingerprinting, fingerprint
  • introspection mechanism (pre-run, and post-run)
  • introspection visualization
  • [~] logging for cache adapter; could be improved for storage
  • checkpointing feature using run_id. Passing a run_id string or latest to result_from should do the trick
  • [~ ] OOP or protocols for Store subclasses. It's currently a bit confusing that result_store and metadata_store both inherit from the same BaseStore but they have different method signatures for get and set
  • [~] Works with materializers. It does work, but has rough edges:
    • for to.file or other destinations where the suffix doesn't match the ideal file extension.
    • Can't pass kwargs to materializers
    • Assumes the materializer's only required arg is path

@zilto zilto added the enhancement New feature or request label Aug 23, 2024
tests/caching/cases.py Outdated Show resolved Hide resolved
docs/concepts/caching.rst Outdated Show resolved Hide resolved
docs/concepts/caching.rst Outdated Show resolved Hide resolved
docs/concepts/caching.rst Outdated Show resolved Hide resolved
docs/concepts/caching.rst Outdated Show resolved Hide resolved
@elijahbenizzy
Copy link
Collaborator

High-level:

  1. We should have some user-stories that match to common cases. THe docs are good for explaining the very high-level, but we want people to attach it to their workflow (maybe multiple notebooks?)
  2. In the docs it feels like there's some assumed knowledge about terms (fingerprinting, etc...). I think we're exposing a bit at the lower level here -- user stories should help.

A few more user-stories can help me evaluate the use-cases. Seems technically largely solid, although I would want instructions on extending (E.G. adding a global cache)

docs/concepts/caching.rst Outdated Show resolved Hide resolved
hamilton/driver.py Outdated Show resolved Hide resolved
hamilton/driver.py Outdated Show resolved Hide resolved
hamilton/driver.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of thoughts here, exciting stuff

@@ -1696,6 +1709,14 @@ def validate_materialization(
all_nodes = nodes | user_nodes
self.graph_executor.validate(list(all_nodes))

@property
def cache(self) -> lifecycle.SmartCacheAdapter:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is user-facing add more comments (E.G. many people might not know that caching is implemented by an adapter, why might you want it...)

hamilton/driver.py Outdated Show resolved Hide resolved
hamilton/driver.py Show resolved Hide resolved
hamilton/driver.py Show resolved Hide resolved
hamilton/experimental/h_cache.py Outdated Show resolved Hide resolved
hamilton/lifecycle/caching.py Outdated Show resolved Hide resolved
hamilton/lifecycle/caching.py Outdated Show resolved Hide resolved
hamilton/lifecycle/caching.py Outdated Show resolved Hide resolved
hamilton/lifecycle/caching.py Outdated Show resolved Hide resolved
hamilton/lifecycle/caching.py Outdated Show resolved Hide resolved
docs/how-tos/caching.rst Outdated Show resolved Hide resolved
docs/how-tos/caching.rst Outdated Show resolved Hide resolved
docs/how-tos/caching.rst Outdated Show resolved Hide resolved
docs/how-tos/caching.rst Outdated Show resolved Hide resolved
@@ -0,0 +1,52 @@
=======================
Caching logic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief overview on what this section is

hamilton/lifecycle/caching.py Outdated Show resolved Hide resolved
hamilton/plugins/h_diskcache.py Outdated Show resolved Hide resolved
hamilton/stores/base.py Outdated Show resolved Hide resolved
hamilton/stores/base.py Outdated Show resolved Hide resolved
hamilton/stores/file.py Outdated Show resolved Hide resolved
zilto added 2 commits September 25, 2024 17:30
added docstring to base64 encode

fingerprinting notebook section

added caching code and example

dev checkpoint

added Fingerprint class

first-class caching v0.1

fixed tests

updated module_6 test case

updated docstrings

documentation WIP added

refactor file structure

add direct access to cache via Driver

remove

Support Parallelizable/Collect; removed Fingerprint

Refactored the logic of the CacheAdapter to have
explicit operations. Many challenges arise from
`Parallel` nodes that return sequence of elements,
but actually the data version of individual elements
is what matters. Also, `Collect` can have difficulties
access upstream data version if these were computed
in other threads/processes.

The `Fingerprint` construct was removed because it
obfuscated what information was relevant to pass
around. It's name is also less evocative than
"data version" and "code version".

made SQLMetadataStore threadsafe

added structured logging; move class to hamilton.lifecycle.caching

structured log; data saver; cache decorator

added from_string to CachingBehavior enum

added docs and mermaid graph support

fixed variable renaming issue

updated docs requirements; adapted typing to 3.8

reverted Sequence type to Sequence class for singledispatch

fixed typo

added deprecation warnings to other caching methods

fixed missing kwarg for recursive data versioning

output structured log to file

updated expand nodes handling; added ignore behavior

updated caching behaviors and admonitions

updated Sequence import for 3.8 support

fixed bugs: sentinel values, log printing, failed nodes

updated docs

updated docstring; fixed materialization with parallel

fixed 3.8 typing

guard against setting cache twice

registered separate function for versioning bytes

refactored adapter to use internal hook

replace HamiltonGraph by FunctionGraph

add _get_node_role() method

switched to internal hooks pre/post node

refactored to key on run_id; refactor result_store

refactored sqlitestore to hamilton.stores

fixed type annotations

fixed docs reference and docstrings

updated result_store tests

fixed materialization from result_store

added roadmap to docs

refactored to move to hamilton/caching

fingerprinting.set_max_depth() added

changed cache decorator to a class

updated all docstrings for SmartCacheAdapter

renamed context_key to cache_key

added deprecation warnings using logging

improved warning message

added TODO
@zilto zilto force-pushed the feat/first-class-caching branch from 2c244d4 to 998ca73 Compare September 25, 2024 22:48
@skrawcz
Copy link
Collaborator

skrawcz commented Sep 30, 2024

Looking good! Does this work with adapters like ray, or the graceful error one?
If there are things that this doesn't work with or behavior would be weird -- we should have a section mentioning that.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! A few points on documentation. Caveats + testing:

  • mlflow adapter
  • hamilton UI (curious what this looks like)

But yeah, gotta get this out, merge tonight/tomorrow morning?

docs/concepts/caching.rst Outdated Show resolved Hide resolved
docs/concepts/caching.rst Show resolved Hide resolved
docs/concepts/caching.rst Show resolved Hide resolved
examples/caching/README.md Show resolved Hide resolved
hamilton/caching/adapter.py Show resolved Hide resolved
from hamilton.caching.stores.base import MetadataStore


class SQLiteMetadataStore(MetadataStore):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on this -- it'll be something people want to look into for debugging

import pathlib


def get_directory_size(directory: str) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments

hamilton/driver.py Outdated Show resolved Hide resolved
@@ -1905,6 +1931,11 @@ def with_adapters(self, *adapters: lifecycle_base.LifecycleAdapter) -> "Builder"
:param adapter: Adapter to use.
:return: self
"""
if any(isinstance(adapter, SmartCacheAdapter) for adapter in adapters):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but we should store cache then add it to the adapters later I think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker

hamilton/function_modifiers/metadata.py Show resolved Hide resolved
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Will glance over when we sync tomorrow, then approve + merge.

# solution for `@dataloader` and `from_`
if behaviors.get(main_node, None) is not None:
behaviors[node.name] = behaviors[main_node]
# this hacky section is required to support @load_from and provide
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure I follow what's happening here. Let's add some more docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, document desired behavior here/how the code maps to it

data_version = self._version_data(node_name=node_name, run_id=run_id, result=result)

# nodes collected in `._data_savers` return a dictionary of metadata
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we need to special case these... But for now, let's call this a bit experimental (data savers/loaders) -- it's a very odd case that we don't want ot dwell onmore.

Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!

@zilto zilto merged commit a6b3898 into main Oct 3, 2024
24 checks passed
@zilto zilto deleted the feat/first-class-caching branch October 3, 2024 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants