Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Declairing explicit node dependencies (upstream nodes) #1156

Closed
younggalois opened this issue Jan 15, 2022 · 3 comments
Closed

Declairing explicit node dependencies (upstream nodes) #1156

younggalois opened this issue Jan 15, 2022 · 3 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@younggalois
Copy link

Description

I think it would be useful to be able to declare explicit node dependencies when creating nodes. Datasets that are not outputs of nodes, such as SQLQueryDataset, may still have upstream dependencies, but I'm not sure how to declare them as such.

Context

I may be missing an API for this, but currently it looks like node dependencies are only implicitly declared by being somewhere in the sub-graph of the a node which is passed in via the "inputs" param to the node function. I currently have a few nodes that have upstream dependencies, but those are not in a subgraph of a node that gets passed in. For example with SQLQueryDatasets, it is possible for a node to rely on a dataset which is not itself emitted from a node.

For example say node node_A writes to sql table dataset table_A. I then define a SQLQueryDataset, query_table_A that queries table_A. (Let's say I want query_table_A because I only need a small subset of data from table_A, and all of table_A is very large and wastes time and memory loading and saving).

If I then have a node, node_B, that has query_table_A as an input, we can see that i need table_A to exists, but I don't see a way to explicitly tell node_B that table_A is an upstream, since query_table_A is kind of a "standalone dataset". I.e. It's not an output of a node, so the crucial information that it relies on table_A doesn't get incorporated into its subgraph anywhere. It's only in the query itself, in the datasets.yml file. It looks like the serial order of execution of discrete subgraphs is non-deterministic, so sometimes in my pipeline, node_B is correctly run after node_A (in which case all is well), but sometimes it runs before (which breaks the pipeline).

Possible Implementation

from an API standpoint we can have an additional parameter to node named something like additional_upstream which takes a list of strings (datasets). In my example above, we could just pass additional_upstream=['table_A'] into the definition of node_B in it's pipeline, which would ensure it only runs after the node that generates table_A, i.e. node_A, runs.

From an implementation standpoint, I'm not sure how feasible this is.

@younggalois younggalois added the Issue: Feature Request New feature or improvement to existing feature label Jan 15, 2022
@datajoely
Copy link
Contributor

datajoely commented Jan 15, 2022

Hi @younggalois this is a common question and in truth Kedro isn't best suited for SQL based workflows where the execution happens outside of Python / Python API.

Our current solution for this is documented in our spark.DeltaTableDataSet docs under the name 'out of DAG' transactions. Delta if you aren't familiar is similar to SQL in the sense that you can perform CRUD type operations like update and delete. These don't neatly translate to Kedro world since we are in the habit of reading/writing (maybe appending) but not really mutating.

The 'trick' we recommend is to essentially declare the same dataset twice that preserves the topological node ordering
image

To work with SQL Kedro is best suited to do so via a Python API - this is possible with things like PySpark or in the future something like SnowPark. If you're really set on doing a lot of transformation which uses SQL as an execution engine it's sometimes hard for us to argue that Kedro is the right fit for this sort of pipeline.

Personally - I'd consider combining a dbt transformation pipeline and use Kedro for modelling/experimenting pipeline as those sections tend to live in Python world.

@younggalois
Copy link
Author

@datajoely Thanks so much for the info and recommendations.

Most of my pipeline is working with pandas in python, but there are a couple of nodes that have the structure that I mentioned. I was also thinking about declaring the node 2x as well, with one being there just to ensure the structure. I think this makes the most sense.

I've looked at dbt a little but never used it. I'll definitely check it out. There's so much that's awesome about kedro - but one of my favorites the ability to to access to everything (datasets, params) with the catalog variable. Having it all right there is material a quality of like improvement. :).

Honestly kedro is the perfect mix of research, development, productionization, best practices, testing, config, abstraction/concreteness, cloud/local, interoperability, etc. The UX is amazing. It's a crazy framework.

@datajoely
Copy link
Contributor

Love the feedback good luck on your Kedro journey @younggalois 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants