Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for caching / checkpointing development workflow - umbrella issue #940

Closed
skrawcz opened this issue Jun 5, 2024 · 7 comments
Labels
core-work Work that is "core". Likely overseen by core team in most cases. enhancement New feature or request question Further information is requested

Comments

@skrawcz
Copy link
Collaborator

skrawcz commented Jun 5, 2024

Is your feature request related to a problem? Please describe.
We need a simpler caching & checkpointing story that has full service visibility into what's going on.

Describe the solution you'd like

  1. Checkpointing -- i.e. cache outputs and restart from the latest point.
  2. Intelligent Caching -- i.e. cache nodes and only rerun things if code or data has changed.

These should come with the ability to:

  1. visualize what is going on when using them.
  2. work in a notebook / cli / library context
  3. extend how to hash data & where to store it

Prior art

  1. You do it yourself outside of Hamilton and use the overrides argument in .execute/.materialize(..., overrides={...}) to inject pre-computed values into the graph. That is, you run your code, save the things you want, and then you load them and inject them using overrides=. TODO: show example.
  2. You use the data savers & data loaders. This is similar to the above, but instead you use the Data Savers & Data Loaders (i.e. materializers) to save & then load and inject data in. TODO: show example.
  3. You use the CachingGraphAdapter, which requires you to tag functions to cache along with the serialization format.
  4. You use the DiskCacheAdapter, which uses the diskcache library to store the results on disk.

Could use https://books.ropensci.org/targets/walkthrough.html#change-code as inspiration.

Additional context
Slack threads:

Next steps:

TODO: write up tasks in this issue into smaller and manageable chunks.

@skrawcz skrawcz added enhancement New feature or request question Further information is requested labels Jun 5, 2024
@jmbuhr
Copy link
Contributor

jmbuhr commented Jun 5, 2024

I took one of the targets examples and transferred it one-to-one to hamilton to see how the concepts compare. Both workflows are implemented in modules and make use of helper functions from a separate module. Then both are started interactively from quarto documents and their results and graphs visualized in the rendered output of said notebooks: http://jmbuhr.de/targets-hamilton-comparison/ (source code: https://github.com/jmbuhr/targets-hamilton-comparison)

Again, this is for exploration of possibilities, not to impose paradigms on you :)

In this first pass I noticed two things I was missing in hamilton compared to targets when it comes to caching:

  • changing a function that is used by a node, but is not itself a node, should also invalidated the cache of the node
  • loading the cached result from any node independently of the dr.execute run as with tar_load(<name of node>) (https://docs.ropensci.org/targets/reference/tar_load.html) is super helpful for interactively picking up where you left of with a workflow and working on different parts of it.

@jmbuhr
Copy link
Contributor

jmbuhr commented Jun 5, 2024

For inspiration, the developer documentation of how targets does caching might come in handy: https://books.ropensci.org/targets-design/data.html#skipping-up-to-date-targets

@skrawcz
Copy link
Collaborator Author

skrawcz commented Jul 18, 2024

Update - we've got a candidate API:

c = CacheStore() # this could house various strategies, e.g. basic checkpointing, to more sophisticated fingerprinting.
dr = driver.Builder()...with_cache(c, **kwargs).build()

# first run -- nothing cached
dr.execute([output1], inputs=A)

# change some code -- any code: upstream or downstream of what was run before
# rebuild driver
dr = driver.Builder()...with_cache(c, **kwargs).build()
# this should recompute as needed -- and recompute downstream as needed
dr.execute([output2], inputs=A)  

# no-op if run again
dr.execute([output2], inputs=A)  

# should only recompute what inputs impact -- going downstream as needed.
dr.execute([output2], inputs=A')

Then there's some nuance around:

  • annotating something as do_not_cache, and customizing what inputs to a function matter for caching.
  • customizing serialization / hashing
  • displaying/introspecting what is available

@skrawcz skrawcz added the core-work Work that is "core". Likely overseen by core team in most cases. label Aug 11, 2024
@skrawcz
Copy link
Collaborator Author

skrawcz commented Sep 4, 2024

Updates:

.with_cache() will just be about fingerprinting caching strategy.
.with_checkpointing() will just be about checkpointing.

@zilto
Copy link
Collaborator

zilto commented Oct 3, 2024

Hi everyone! We just merged #1104 which introduces caching as a core Hamilton feature.

We invite you to try it via Google Colab and review the docs!

@zilto zilto closed this as completed Oct 3, 2024
@skrawcz
Copy link
Collaborator Author

skrawcz commented Oct 3, 2024

CC @jmbuhr

@jmbuhr
Copy link
Contributor

jmbuhr commented Oct 13, 2024

Very cool, excited to try it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-work Work that is "core". Likely overseen by core team in most cases. enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants