Caching Graph Adapter #195

elshize · 2023-06-20T17:31:55Z

Implementation of a caching adapter.

Changes

Adds an experimental caching adapter that caches nodes tagged with "cache".

How I tested this

Unit tests.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

elshize · 2023-06-20T17:34:07Z

@skrawcz hey, I didn't really have the time to fill out the PR info above, and the code is essentially copy-pasted from what I implemented in my use case, but I figured you might want to give this a look sooner than later, and maybe provide any comments.

As I said, there might be some hacky stuff there, especially the way I'm testing, but it's a good start I think.

skrawcz · 2023-06-20T19:27:10Z

Thanks @elshize yep this is great. I'll queue it up for review for later this week/early next week.

skrawcz

Yeah this looks good. I like how simple it is. It can always be made fancier, but I think this probably serves as a good 80/20.

Otherwise to get this out, I think we just need some more documentation on usage, e.g. outlining how one might use it for development, as well as production. Listing any caveats.

hamilton/experimental/h_cache.py

tests/test_caching.py

elshize · 2023-06-28T00:45:57Z

Thanks, I'll try to get the changes ready soon.

elshize · 2023-06-29T23:22:31Z

Made some changes, still some stuff left, including more docs (with examples), and debug logs.

One thing I noticed when I ran tests on my other desktop is that I'm getting errors due to not having either feather or parquet support, despite installing all the deps.

I see that you do use both in other parts of the code, like in pandas_extensions.py so I'm assuming this is ok? And it's not in requirements because it's an optional dep?

skrawcz · 2023-06-30T01:09:10Z

Made some changes, still some stuff left, including more docs (with examples), and debug logs.

Thanks looking good thus far. Made a suggestion for clarity.

One thing I noticed when I ran tests on my other desktop is that I'm getting errors due to not having either feather or parquet support, despite installing all the deps.

I see that you do use both in other parts of the code, like in pandas_extensions.py so I'm assuming this is ok? And it's not in requirements because it's an optional dep?

Ah good point! Hmmm. We could do some "try/catch-ing" to determine what exists on import, and only include the things that are?

But... we probably want to make this adapter generic/extensible in a way. e.g. make it pluggable for polars/other object types.

So as to support someone doing this:

@tag(cache="json")
def my_df(...) -> pd.DataFrame:
    return df

@tag(cache="json")
def my_dict(...) -> dict:
    return {"a":"b"}

@tag(cache="json")
def my_polars_df(...) -> pl.DataFrame:
    return df

So we could build a registry of (format, type) -> serde function, but maybe that's outside the scope of this PR. Right now the design is simple. Format implies datatype and serialization -- did you think about that aspect at all?

So if anything maybe we just want to add isinstance checks to the writer functions to help surface a good error if someone screws this up. Thoughts?

Happy to jump on a call if that'll be easier to align on what we want to finalize here.

hamilton/experimental/h_cache.py

elshize · 2023-07-01T14:17:17Z

But... we probably want to make this adapter generic/extensible in a way. e.g. make it pluggable for polars/other object types.

Yeah, I see what you mean. I haven't really thought of it because I didn't need it when implementing it, but this is definitely going to come up.

To not make it too complicated at the moment, I would approach it this way: We can have default serializers for some number of output formats, and it would check what is passed to it and do the right thing -- for whatever number of types we want to support out of the box. If someone needs to support an additonal type, then they can pass an extended writer:

def extended_writer(data, path):
    if isinstance(data, SomeType):
        write_some_type(data)
    else:
        hamilton_writer(data, path)

It's probably not the most elegant way to approach it, but it's very simple, most people would be covered by the default types supported, and those who are not would have a way of extending it.

Then, we could think whether a more complex approach is needed and what it would be, and if there is one, we implement it later. Thoughts?

skrawcz · 2023-07-03T21:03:18Z

okay @elshize I came up with #211 to extend things. Thoughts?

elshize · 2023-07-07T23:28:54Z

I checked it out and commented on the other PR. This is better than my solution I spoke about above :)

So is there anything else missing there other than some examples in the module doc? I should be able to have some time over the weekend to write some down; how would you want me to add it? Should I use the other branch, branch off and create a new PR, or edit here, and you can rebase?

skrawcz · 2023-07-10T04:46:46Z

@elshize sorry, I missed this (the weekend is now over...). Yeah let's just get the module doc in order I think.

In terms of adding it, do whatever is easiest for you. The other branch is based off of this one. You could pull in all my commits to this branch, and then add the new one(s)?

We might want to squash some commits to keep history tidy but let's worry about that last.

Thanks!

elshize · 2023-07-11T02:21:02Z

You could pull in all my commits to this branch, and then add the new one(s)?

Yeah, I could do that.

We might want to squash some commits to keep history tidy but let's worry about that last.

Do you imagine there would be more than one commit for this? I was thinking of squashing all of it together, seems like a small enough and isolated change.

skrawcz · 2023-07-11T05:45:39Z

You could pull in all my commits to this branch, and then add the new one(s)?

Yeah, I could do that.
👍

We might want to squash some commits to keep history tidy but let's worry about that last.

Do you imagine there would be more than one commit for this? I was thinking of squashing all of it together, seems like a small enough and isolated change.

I hadn't looked hard to be honest. But a single squashed commit would work too.

elshize · 2023-07-19T23:03:43Z

@skrawcz Btw, there are more modules I had to install in order to fully run tests:

pandera pyspark ray dask pyarrow distributed

And then still some spark and ray tests failed for me locally. Not sure why, could be because versions of the ones above don't match what is required.

skrawcz · 2023-07-20T00:25:38Z

@skrawcz Btw, there are more modules I had to install in order to fully run tests:
pandera pyspark ray dask pyarrow distributed
And then still some spark and ray tests failed for me locally. Not sure why, could be because versions of the ones above don't match what is required.

Ah - what did you assume you needed, so I can fix up any docs. My assumption is (1) I should probably change https://github.com/DAGWorks-Inc/hamilton/blob/main/.ci/test.sh to instead refer to requirements-test.txt from within the graph adapter directories. (2) make a note in the general requirements-test.txt that it only applies to unit test, not the graph adapter integration test.

Note: Since we use circleci, you can also download the circleci CLI and run a job locally and it'll replicate the behavior of that test with docker -- but that is a little slow to iterate with.

Mind posting which tests? This is annoying if they're flakey.

elshize · 2023-07-20T00:52:15Z

here are logs: pytest.log notice that besides graph adapter tests, there is also one telemetry test that failed:

I didn't really need it for the Pr, I just wanted to run all tests :) so really not a problem for me. I just didn't know exactly what you expected to be in requirements, so wanted to mention it.

skrawcz · 2023-07-20T01:18:08Z

here are logs: pytest.log notice that besides graph adapter tests, there is also one telemetry test that failed:

I didn't really need it for the Pr, I just wanted to run all tests :) so really not a problem for me. I just didn't know exactly what you expected to be in requirements, so wanted to mention it.

Ah yep. The telemetry test fails locally; something I haven't dug into -- but it's not flakey on circleci proper.

elshize · 2023-07-21T00:01:54Z

@skrawcz ok, I think this is ready for you to have another look at. One more thing that we maybe want to do is to include info about it somewhere in the docs/. Let me know if and where you would want that.

I haven't squashed commits yet, but I think we should just squash to one single commit, it's a simple enough change, I don't see a need for any more than that, but let me know if you feel different.

skrawcz · 2023-07-21T04:52:27Z

@elshize for docs we'd add a new file to https://github.com/DAGWorks-Inc/hamilton/tree/main/docs/reference/graph-adapters (and add it to index.rst) and then perhaps a doc in https://github.com/DAGWorks-Inc/hamilton/tree/main/docs/how-tos explaining how you'd make use of it? We can always put the latter in a separate PR, but the first one would then at least add the adapter to the reference section.

This packs it all into h_cache.py for now. It uses singledispatch because that's an easy way to register functions for different object types. In terms of deserialization, we hack things, by including an empty expected type to get singldispatch to read from the right function, since singledispatch uses the first argument to resolve which function, and it only works in instantiated types, and not on a Type object. Longer term we should merge this with the datasavers -- i.e. the data savers should in general have an easy way to access SERDE functions, but without the boilerplate overhead... CAVEATS to make this truly generic is: 1. How to register these across different object types. Should these be defined in the extension classes? 2. KWARGS: we need to handle how to pass these through nicely. For the adapter, do we use tags? or allow some other means to inject kwargs? Ditto for getting this working with the data savers.

So that we can test the caching serialization.

Primitive types aren't subscriptable until 3.9

examples/caching_nodes/run.py

examples/caching_nodes/README.md

Co-authored-by: Stefan Krawczyk <stefan@dagworks.io>

elshize · 2023-07-22T13:05:56Z

@skrawcz ok, this is ready for you to review.

So that people can run it without problems. By default pyarrow is not installed, so the example breaks. Also setting sf-hamilton minimum based on future release.

skrawcz

Looks good. Just needed to adjust the requirements.txt for the example. Thanks for all the work!

For work on adding the caching adapter #195.

elshize · 2023-07-23T21:08:47Z

🔥

For work on adding the caching adapter #195.

skrawcz self-assigned this Jun 20, 2023

skrawcz reviewed Jun 27, 2023

View reviewed changes

skrawcz reviewed Jun 30, 2023

View reviewed changes

hamilton/experimental/h_cache.py Outdated Show resolved Hide resolved

skrawcz mentioned this pull request Jul 3, 2023

Example to extend #195 #211

Closed

7 tasks

elshize changed the title ~~Caching adapter~~ Caching Graph Adapter Jul 20, 2023

Michal Siedlaczek and others added 8 commits July 21, 2023 09:50

Caching adapter

97be1ff

Reset computed node state when collecting results

b75aa14

Import module instead of functions directly

cdb2c9f

Change doc format

cb9a15a

Document execute_node

da67aa5

Clarify docstring

00691f5

Adds pyarrow to test dependencies

decd6f5

So that we can test the caching serialization.

skrawcz and others added 7 commits July 21, 2023 09:50

Fixing h_cache annotations for 3.7 & 3.8

5bc3bf9

Primitive types aren't subscriptable until 3.9

Adds more documentation to explain h_cache set up

f93b43d

Correct writers/readers type hints

e91f8fb

Add examples in docs

ae4bdf6

Logging messages

26f376c

Reformat

ab3c277

Add reference entry to CachingGraphAdapter

12edca0

elshize force-pushed the caching-adapter branch from cd3e58d to 12edca0 Compare July 21, 2023 13:57

Add example project

66c9d76

skrawcz reviewed Jul 21, 2023

View reviewed changes

examples/caching_nodes/run.py Outdated Show resolved Hide resolved

skrawcz reviewed Jul 21, 2023

View reviewed changes

examples/caching_nodes/README.md Show resolved Hide resolved

elshize and others added 3 commits July 21, 2023 21:06

Update examples/caching_nodes/README.md

8296029

Co-authored-by: Stefan Krawczyk <stefan@dagworks.io>

Fix caching adapter example

30f4257

Finish documentation

fd2d58c

elshize requested a review from skrawcz July 22, 2023 13:05

Adjust requirements.txt for caching example

b2e6b8c

So that people can run it without problems. By default pyarrow is not installed, so the example breaks. Also setting sf-hamilton minimum based on future release.

skrawcz approved these changes Jul 23, 2023

View reviewed changes

skrawcz merged commit 985fb92 into DAGWorks-Inc:main Jul 23, 2023

skrawcz added a commit that referenced this pull request Jul 23, 2023

Adds Michał Siedlaczek (@elshize) to contributors

3184717

For work on adding the caching adapter #195.

skrawcz mentioned this pull request Jul 23, 2023

Adds Michał Siedlaczek (@elshize) to contributors #230

Merged

7 tasks

skrawcz added a commit that referenced this pull request Jul 23, 2023

Adds Michał Siedlaczek (@elshize) to contributors

36a7e22

For work on adding the caching adapter #195.

elshize deleted the caching-adapter branch August 3, 2023 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching Graph Adapter #195

Caching Graph Adapter #195

elshize commented Jun 20, 2023 •

edited

Loading

elshize commented Jun 20, 2023

skrawcz commented Jun 20, 2023

skrawcz left a comment

elshize commented Jun 28, 2023

elshize commented Jun 29, 2023

skrawcz commented Jun 30, 2023 •

edited

Loading

elshize commented Jul 1, 2023 •

edited

Loading

skrawcz commented Jul 3, 2023

elshize commented Jul 7, 2023

skrawcz commented Jul 10, 2023

elshize commented Jul 11, 2023

skrawcz commented Jul 11, 2023

elshize commented Jul 19, 2023 •

edited

Loading

skrawcz commented Jul 20, 2023

elshize commented Jul 20, 2023

skrawcz commented Jul 20, 2023

elshize commented Jul 21, 2023

skrawcz commented Jul 21, 2023

elshize commented Jul 22, 2023

skrawcz left a comment

elshize commented Jul 23, 2023

Caching Graph Adapter #195

Caching Graph Adapter #195

Conversation

elshize commented Jun 20, 2023 • edited Loading

Changes

How I tested this

Checklist

elshize commented Jun 20, 2023

skrawcz commented Jun 20, 2023

skrawcz left a comment

Choose a reason for hiding this comment

elshize commented Jun 28, 2023

elshize commented Jun 29, 2023

skrawcz commented Jun 30, 2023 • edited Loading

elshize commented Jul 1, 2023 • edited Loading

skrawcz commented Jul 3, 2023

elshize commented Jul 7, 2023

skrawcz commented Jul 10, 2023

elshize commented Jul 11, 2023

skrawcz commented Jul 11, 2023

elshize commented Jul 19, 2023 • edited Loading

skrawcz commented Jul 20, 2023

elshize commented Jul 20, 2023

skrawcz commented Jul 20, 2023

elshize commented Jul 21, 2023

skrawcz commented Jul 21, 2023

elshize commented Jul 22, 2023

skrawcz left a comment

Choose a reason for hiding this comment

elshize commented Jul 23, 2023

elshize commented Jun 20, 2023 •

edited

Loading

skrawcz commented Jun 30, 2023 •

edited

Loading

elshize commented Jul 1, 2023 •

edited

Loading

elshize commented Jul 19, 2023 •

edited

Loading