-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching Graph Adapter #195
Conversation
@skrawcz hey, I didn't really have the time to fill out the PR info above, and the code is essentially copy-pasted from what I implemented in my use case, but I figured you might want to give this a look sooner than later, and maybe provide any comments. As I said, there might be some hacky stuff there, especially the way I'm testing, but it's a good start I think. |
Thanks @elshize yep this is great. I'll queue it up for review for later this week/early next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this looks good. I like how simple it is. It can always be made fancier, but I think this probably serves as a good 80/20.
Otherwise to get this out, I think we just need some more documentation on usage, e.g. outlining how one might use it for development, as well as production. Listing any caveats.
Thanks, I'll try to get the changes ready soon. |
Made some changes, still some stuff left, including more docs (with examples), and debug logs. One thing I noticed when I ran tests on my other desktop is that I'm getting errors due to not having either feather or parquet support, despite installing all the deps. I see that you do use both in other parts of the code, like in |
Thanks looking good thus far. Made a suggestion for clarity.
Ah good point! Hmmm. We could do some "try/catch-ing" to determine what exists on import, and only include the things that are? But... we probably want to make this adapter generic/extensible in a way. e.g. make it pluggable for polars/other object types. So as to support someone doing this: @tag(cache="json")
def my_df(...) -> pd.DataFrame:
return df
@tag(cache="json")
def my_dict(...) -> dict:
return {"a":"b"}
@tag(cache="json")
def my_polars_df(...) -> pl.DataFrame:
return df So we could build a registry of (format, type) -> serde function, but maybe that's outside the scope of this PR. Right now the design is simple. Format implies datatype and serialization -- did you think about that aspect at all? So if anything maybe we just want to add Happy to jump on a call if that'll be easier to align on what we want to finalize here. |
Yeah, I see what you mean. I haven't really thought of it because I didn't need it when implementing it, but this is definitely going to come up. To not make it too complicated at the moment, I would approach it this way: We can have default serializers for some number of output formats, and it would check what is passed to it and do the right thing -- for whatever number of types we want to support out of the box. If someone needs to support an additonal type, then they can pass an extended writer: def extended_writer(data, path):
if isinstance(data, SomeType):
write_some_type(data)
else:
hamilton_writer(data, path) It's probably not the most elegant way to approach it, but it's very simple, most people would be covered by the default types supported, and those who are not would have a way of extending it. Then, we could think whether a more complex approach is needed and what it would be, and if there is one, we implement it later. Thoughts? |
I checked it out and commented on the other PR. This is better than my solution I spoke about above :) So is there anything else missing there other than some examples in the module doc? I should be able to have some time over the weekend to write some down; how would you want me to add it? Should I use the other branch, branch off and create a new PR, or edit here, and you can rebase? |
@elshize sorry, I missed this (the weekend is now over...). Yeah let's just get the module doc in order I think. In terms of adding it, do whatever is easiest for you. The other branch is based off of this one. You could pull in all my commits to this branch, and then add the new one(s)? We might want to squash some commits to keep history tidy but let's worry about that last. Thanks! |
Yeah, I could do that.
Do you imagine there would be more than one commit for this? I was thinking of squashing all of it together, seems like a small enough and isolated change. |
I hadn't looked hard to be honest. But a single squashed commit would work too. |
@skrawcz Btw, there are more modules I had to install in order to fully run tests:
And then still some spark and ray tests failed for me locally. Not sure why, could be because versions of the ones above don't match what is required. |
Ah - what did you assume you needed, so I can fix up any docs. My assumption is (1) I should probably change https://github.com/DAGWorks-Inc/hamilton/blob/main/.ci/test.sh to instead refer to Note: Since we use circleci, you can also download the circleci CLI and run a job locally and it'll replicate the behavior of that test with docker -- but that is a little slow to iterate with. Mind posting which tests? This is annoying if they're flakey. |
here are logs: pytest.log notice that besides graph adapter tests, there is also one telemetry test that failed: I didn't really need it for the Pr, I just wanted to run all tests :) so really not a problem for me. I just didn't know exactly what you expected to be in requirements, so wanted to mention it. |
Ah yep. The telemetry test fails locally; something I haven't dug into -- but it's not flakey on circleci proper. |
@skrawcz ok, I think this is ready for you to have another look at. One more thing that we maybe want to do is to include info about it somewhere in the I haven't squashed commits yet, but I think we should just squash to one single commit, it's a simple enough change, I don't see a need for any more than that, but let me know if you feel different. |
@elshize for docs we'd add a new file to https://github.com/DAGWorks-Inc/hamilton/tree/main/docs/reference/graph-adapters (and add it to index.rst) and then perhaps a doc in https://github.com/DAGWorks-Inc/hamilton/tree/main/docs/how-tos explaining how you'd make use of it? We can always put the latter in a separate PR, but the first one would then at least add the adapter to the reference section. |
This packs it all into h_cache.py for now. It uses singledispatch because that's an easy way to register functions for different object types. In terms of deserialization, we hack things, by including an empty expected type to get singldispatch to read from the right function, since singledispatch uses the first argument to resolve which function, and it only works in instantiated types, and not on a Type object. Longer term we should merge this with the datasavers -- i.e. the data savers should in general have an easy way to access SERDE functions, but without the boilerplate overhead... CAVEATS to make this truly generic is: 1. How to register these across different object types. Should these be defined in the extension classes? 2. KWARGS: we need to handle how to pass these through nicely. For the adapter, do we use tags? or allow some other means to inject kwargs? Ditto for getting this working with the data savers.
So that we can test the caching serialization.
Primitive types aren't subscriptable until 3.9
Co-authored-by: Stefan Krawczyk <stefan@dagworks.io>
@skrawcz ok, this is ready for you to review. |
So that people can run it without problems. By default pyarrow is not installed, so the example breaks. Also setting sf-hamilton minimum based on future release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just needed to adjust the requirements.txt for the example. Thanks for all the work!
For work on adding the caching adapter #195.
🔥 |
For work on adding the caching adapter #195.
Implementation of a caching adapter.
Changes
Adds an experimental caching adapter that caches nodes tagged with "cache".
How I tested this
Unit tests.
Checklist