Declairing explicit node dependencies (upstream nodes) #1156

younggalois · 2022-01-15T03:56:53Z

Description

I think it would be useful to be able to declare explicit node dependencies when creating nodes. Datasets that are not outputs of nodes, such as SQLQueryDataset, may still have upstream dependencies, but I'm not sure how to declare them as such.

Context

I may be missing an API for this, but currently it looks like node dependencies are only implicitly declared by being somewhere in the sub-graph of the a node which is passed in via the "inputs" param to the node function. I currently have a few nodes that have upstream dependencies, but those are not in a subgraph of a node that gets passed in. For example with SQLQueryDatasets, it is possible for a node to rely on a dataset which is not itself emitted from a node.

For example say node node_A writes to sql table dataset table_A. I then define a SQLQueryDataset, query_table_A that queries table_A. (Let's say I want query_table_A because I only need a small subset of data from table_A, and all of table_A is very large and wastes time and memory loading and saving).

If I then have a node, node_B, that has query_table_A as an input, we can see that i need table_A to exists, but I don't see a way to explicitly tell node_B that table_A is an upstream, since query_table_A is kind of a "standalone dataset". I.e. It's not an output of a node, so the crucial information that it relies on table_A doesn't get incorporated into its subgraph anywhere. It's only in the query itself, in the datasets.yml file. It looks like the serial order of execution of discrete subgraphs is non-deterministic, so sometimes in my pipeline, node_B is correctly run after node_A (in which case all is well), but sometimes it runs before (which breaks the pipeline).

Possible Implementation

from an API standpoint we can have an additional parameter to node named something like additional_upstream which takes a list of strings (datasets). In my example above, we could just pass additional_upstream=['table_A'] into the definition of node_B in it's pipeline, which would ensure it only runs after the node that generates table_A, i.e. node_A, runs.

From an implementation standpoint, I'm not sure how feasible this is.

The text was updated successfully, but these errors were encountered:

datajoely · 2022-01-15T12:10:36Z

Hi @younggalois this is a common question and in truth Kedro isn't best suited for SQL based workflows where the execution happens outside of Python / Python API.

Our current solution for this is documented in our spark.DeltaTableDataSet docs under the name 'out of DAG' transactions. Delta if you aren't familiar is similar to SQL in the sense that you can perform CRUD type operations like update and delete. These don't neatly translate to Kedro world since we are in the habit of reading/writing (maybe appending) but not really mutating.

The 'trick' we recommend is to essentially declare the same dataset twice that preserves the topological node ordering

To work with SQL Kedro is best suited to do so via a Python API - this is possible with things like PySpark or in the future something like SnowPark. If you're really set on doing a lot of transformation which uses SQL as an execution engine it's sometimes hard for us to argue that Kedro is the right fit for this sort of pipeline.

Personally - I'd consider combining a dbt transformation pipeline and use Kedro for modelling/experimenting pipeline as those sections tend to live in Python world.

younggalois · 2022-01-25T01:20:43Z

@datajoely Thanks so much for the info and recommendations.

Most of my pipeline is working with pandas in python, but there are a couple of nodes that have the structure that I mentioned. I was also thinking about declaring the node 2x as well, with one being there just to ensure the structure. I think this makes the most sense.

I've looked at dbt a little but never used it. I'll definitely check it out. There's so much that's awesome about kedro - but one of my favorites the ability to to access to everything (datasets, params) with the catalog variable. Having it all right there is material a quality of like improvement. :).

Honestly kedro is the perfect mix of research, development, productionization, best practices, testing, config, abstraction/concreteness, cloud/local, interoperability, etc. The UX is amazing. It's a crazy framework.

datajoely · 2022-01-25T08:08:14Z

Love the feedback good luck on your Kedro journey @younggalois 🚀

younggalois added the Issue: Feature Request New feature or improvement to existing feature label Jan 15, 2022

merelcht pushed a commit that referenced this issue Jan 17, 2022

Merge master into develop via merge-master-to-develop (#1156)

87427e5

antonymilne closed this as completed Jan 28, 2022

datajoely mentioned this issue Jul 4, 2024

Allow for specifying extra node dependencies #3988

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Declairing explicit node dependencies (upstream nodes) #1156

Declairing explicit node dependencies (upstream nodes) #1156

younggalois commented Jan 15, 2022

datajoely commented Jan 15, 2022 •

edited

Loading

younggalois commented Jan 25, 2022

datajoely commented Jan 25, 2022

Declairing explicit node dependencies (upstream nodes) #1156

Declairing explicit node dependencies (upstream nodes) #1156

Comments

younggalois commented Jan 15, 2022

Description

Context

Possible Implementation

datajoely commented Jan 15, 2022 • edited Loading

younggalois commented Jan 25, 2022

datajoely commented Jan 25, 2022

datajoely commented Jan 15, 2022 •

edited

Loading