Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching Graph Adapter #195

Merged
merged 20 commits into from
Jul 23, 2023
Merged

Caching Graph Adapter #195

merged 20 commits into from
Jul 23, 2023

Conversation

elshize
Copy link
Contributor

@elshize elshize commented Jun 20, 2023

Implementation of a caching adapter.

Changes

Adds an experimental caching adapter that caches nodes tagged with "cache".

How I tested this

Unit tests.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@elshize
Copy link
Contributor Author

elshize commented Jun 20, 2023

@skrawcz hey, I didn't really have the time to fill out the PR info above, and the code is essentially copy-pasted from what I implemented in my use case, but I figured you might want to give this a look sooner than later, and maybe provide any comments.

As I said, there might be some hacky stuff there, especially the way I'm testing, but it's a good start I think.

@skrawcz
Copy link
Collaborator

skrawcz commented Jun 20, 2023

Thanks @elshize yep this is great. I'll queue it up for review for later this week/early next week.

@skrawcz skrawcz self-assigned this Jun 20, 2023
Copy link
Collaborator

@skrawcz skrawcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this looks good. I like how simple it is. It can always be made fancier, but I think this probably serves as a good 80/20.

Otherwise to get this out, I think we just need some more documentation on usage, e.g. outlining how one might use it for development, as well as production. Listing any caveats.

hamilton/experimental/h_cache.py Outdated Show resolved Hide resolved
hamilton/experimental/h_cache.py Outdated Show resolved Hide resolved
hamilton/experimental/h_cache.py Outdated Show resolved Hide resolved
hamilton/experimental/h_cache.py Outdated Show resolved Hide resolved
tests/test_caching.py Outdated Show resolved Hide resolved
tests/test_caching.py Show resolved Hide resolved
tests/test_caching.py Outdated Show resolved Hide resolved
@elshize
Copy link
Contributor Author

elshize commented Jun 28, 2023

Thanks, I'll try to get the changes ready soon.

@elshize
Copy link
Contributor Author

elshize commented Jun 29, 2023

Made some changes, still some stuff left, including more docs (with examples), and debug logs.

One thing I noticed when I ran tests on my other desktop is that I'm getting errors due to not having either feather or parquet support, despite installing all the deps.

I see that you do use both in other parts of the code, like in pandas_extensions.py so I'm assuming this is ok? And it's not in requirements because it's an optional dep?

@skrawcz
Copy link
Collaborator

skrawcz commented Jun 30, 2023

Made some changes, still some stuff left, including more docs (with examples), and debug logs.

Thanks looking good thus far. Made a suggestion for clarity.

One thing I noticed when I ran tests on my other desktop is that I'm getting errors due to not having either feather or parquet support, despite installing all the deps.

I see that you do use both in other parts of the code, like in pandas_extensions.py so I'm assuming this is ok? And it's not in requirements because it's an optional dep?

Ah good point! Hmmm. We could do some "try/catch-ing" to determine what exists on import, and only include the things that are?

But... we probably want to make this adapter generic/extensible in a way. e.g. make it pluggable for polars/other object types.

So as to support someone doing this:

@tag(cache="json")
def my_df(...) -> pd.DataFrame:
    return df

@tag(cache="json")
def my_dict(...) -> dict:
    return {"a":"b"}

@tag(cache="json")
def my_polars_df(...) -> pl.DataFrame:
    return df

So we could build a registry of (format, type) -> serde function, but maybe that's outside the scope of this PR. Right now the design is simple. Format implies datatype and serialization -- did you think about that aspect at all?

So if anything maybe we just want to add isinstance checks to the writer functions to help surface a good error if someone screws this up. Thoughts?

Happy to jump on a call if that'll be easier to align on what we want to finalize here.

@elshize
Copy link
Contributor Author

elshize commented Jul 1, 2023

But... we probably want to make this adapter generic/extensible in a way. e.g. make it pluggable for polars/other object types.

Yeah, I see what you mean. I haven't really thought of it because I didn't need it when implementing it, but this is definitely going to come up.

To not make it too complicated at the moment, I would approach it this way: We can have default serializers for some number of output formats, and it would check what is passed to it and do the right thing -- for whatever number of types we want to support out of the box. If someone needs to support an additonal type, then they can pass an extended writer:

def extended_writer(data, path):
    if isinstance(data, SomeType):
        write_some_type(data)
    else:
        hamilton_writer(data, path)

It's probably not the most elegant way to approach it, but it's very simple, most people would be covered by the default types supported, and those who are not would have a way of extending it.

Then, we could think whether a more complex approach is needed and what it would be, and if there is one, we implement it later. Thoughts?

@skrawcz skrawcz mentioned this pull request Jul 3, 2023
7 tasks
@skrawcz
Copy link
Collaborator

skrawcz commented Jul 3, 2023

okay @elshize I came up with #211 to extend things. Thoughts?

@elshize
Copy link
Contributor Author

elshize commented Jul 7, 2023

I checked it out and commented on the other PR. This is better than my solution I spoke about above :)

So is there anything else missing there other than some examples in the module doc? I should be able to have some time over the weekend to write some down; how would you want me to add it? Should I use the other branch, branch off and create a new PR, or edit here, and you can rebase?

@skrawcz
Copy link
Collaborator

skrawcz commented Jul 10, 2023

@elshize sorry, I missed this (the weekend is now over...). Yeah let's just get the module doc in order I think.

In terms of adding it, do whatever is easiest for you. The other branch is based off of this one. You could pull in all my commits to this branch, and then add the new one(s)?

We might want to squash some commits to keep history tidy but let's worry about that last.

Thanks!

@elshize
Copy link
Contributor Author

elshize commented Jul 11, 2023

You could pull in all my commits to this branch, and then add the new one(s)?

Yeah, I could do that.

We might want to squash some commits to keep history tidy but let's worry about that last.

Do you imagine there would be more than one commit for this? I was thinking of squashing all of it together, seems like a small enough and isolated change.

@skrawcz
Copy link
Collaborator

skrawcz commented Jul 11, 2023

You could pull in all my commits to this branch, and then add the new one(s)?

Yeah, I could do that.
👍

We might want to squash some commits to keep history tidy but let's worry about that last.

Do you imagine there would be more than one commit for this? I was thinking of squashing all of it together, seems like a small enough and isolated change.

I hadn't looked hard to be honest. But a single squashed commit would work too.

@elshize
Copy link
Contributor Author

elshize commented Jul 19, 2023

@skrawcz Btw, there are more modules I had to install in order to fully run tests:

pandera pyspark ray dask pyarrow distributed

And then still some spark and ray tests failed for me locally. Not sure why, could be because versions of the ones above don't match what is required.

@skrawcz
Copy link
Collaborator

skrawcz commented Jul 20, 2023

@skrawcz Btw, there are more modules I had to install in order to fully run tests:

pandera pyspark ray dask pyarrow distributed

And then still some spark and ray tests failed for me locally. Not sure why, could be because versions of the ones above don't match what is required.

Ah - what did you assume you needed, so I can fix up any docs. My assumption is (1) I should probably change https://github.com/DAGWorks-Inc/hamilton/blob/main/.ci/test.sh to instead refer to requirements-test.txt from within the graph adapter directories. (2) make a note in the general requirements-test.txt that it only applies to unit test, not the graph adapter integration test.

Note: Since we use circleci, you can also download the circleci CLI and run a job locally and it'll replicate the behavior of that test with docker -- but that is a little slow to iterate with.

Mind posting which tests? This is annoying if they're flakey.

@elshize
Copy link
Contributor Author

elshize commented Jul 20, 2023

here are logs: pytest.log notice that besides graph adapter tests, there is also one telemetry test that failed:
image

I didn't really need it for the Pr, I just wanted to run all tests :) so really not a problem for me. I just didn't know exactly what you expected to be in requirements, so wanted to mention it.

@skrawcz
Copy link
Collaborator

skrawcz commented Jul 20, 2023

here are logs: pytest.log notice that besides graph adapter tests, there is also one telemetry test that failed: image

I didn't really need it for the Pr, I just wanted to run all tests :) so really not a problem for me. I just didn't know exactly what you expected to be in requirements, so wanted to mention it.

Ah yep. The telemetry test fails locally; something I haven't dug into -- but it's not flakey on circleci proper.

@elshize elshize changed the title Caching adapter Caching Graph Adapter Jul 20, 2023
@elshize
Copy link
Contributor Author

elshize commented Jul 21, 2023

@skrawcz ok, I think this is ready for you to have another look at. One more thing that we maybe want to do is to include info about it somewhere in the docs/. Let me know if and where you would want that.

I haven't squashed commits yet, but I think we should just squash to one single commit, it's a simple enough change, I don't see a need for any more than that, but let me know if you feel different.

@skrawcz
Copy link
Collaborator

skrawcz commented Jul 21, 2023

@elshize for docs we'd add a new file to https://github.com/DAGWorks-Inc/hamilton/tree/main/docs/reference/graph-adapters (and add it to index.rst) and then perhaps a doc in https://github.com/DAGWorks-Inc/hamilton/tree/main/docs/how-tos explaining how you'd make use of it? We can always put the latter in a separate PR, but the first one would then at least add the adapter to the reference section.

Michal Siedlaczek and others added 8 commits July 21, 2023 09:50
This packs it all into h_cache.py for now.

It uses singledispatch because that's an easy way
to register functions for different object types.

In terms of deserialization, we hack things, by including
an empty expected type to get singldispatch
to read from the right function, since singledispatch
uses the first argument to resolve which function, and it
only works in instantiated types, and not on a Type object.

Longer term we should merge this with the datasavers --
i.e. the data savers should in general have an easy way to
access SERDE functions, but without the boilerplate overhead...

CAVEATS to make this truly generic is:

1. How to register these across different object types. Should these
be defined in the extension classes?
2. KWARGS: we need to handle how to pass these through nicely.
For the adapter, do we use tags? or allow some other means to inject
kwargs? Ditto for getting this working with the data savers.
So that we can test the caching serialization.
@elshize elshize requested a review from skrawcz July 22, 2023 13:05
@elshize
Copy link
Contributor Author

elshize commented Jul 22, 2023

@skrawcz ok, this is ready for you to review.

So that people can run it without problems. By default
pyarrow is not installed, so the example breaks.

Also setting sf-hamilton minimum based on future release.
Copy link
Collaborator

@skrawcz skrawcz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just needed to adjust the requirements.txt for the example. Thanks for all the work!

@skrawcz skrawcz merged commit 985fb92 into DAGWorks-Inc:main Jul 23, 2023
skrawcz added a commit that referenced this pull request Jul 23, 2023
For work on adding the caching adapter #195.
@elshize
Copy link
Contributor Author

elshize commented Jul 23, 2023

🔥

skrawcz added a commit that referenced this pull request Jul 23, 2023
For work on adding the caching adapter #195.
@elshize elshize deleted the caching-adapter branch August 3, 2023 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants