-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Declairing explicit node dependencies (upstream nodes) #1156
Comments
Hi @younggalois this is a common question and in truth Kedro isn't best suited for SQL based workflows where the execution happens outside of Python / Python API. Our current solution for this is documented in our The 'trick' we recommend is to essentially declare the same dataset twice that preserves the topological node ordering To work with SQL Kedro is best suited to do so via a Python API - this is possible with things like PySpark or in the future something like SnowPark. If you're really set on doing a lot of transformation which uses SQL as an execution engine it's sometimes hard for us to argue that Kedro is the right fit for this sort of pipeline. Personally - I'd consider combining a dbt transformation pipeline and use Kedro for modelling/experimenting pipeline as those sections tend to live in Python world. |
@datajoely Thanks so much for the info and recommendations. Most of my pipeline is working with pandas in python, but there are a couple of nodes that have the structure that I mentioned. I was also thinking about declaring the node 2x as well, with one being there just to ensure the structure. I think this makes the most sense. I've looked at dbt a little but never used it. I'll definitely check it out. There's so much that's awesome about kedro - but one of my favorites the ability to to access to everything (datasets, params) with the Honestly kedro is the perfect mix of research, development, productionization, best practices, testing, config, abstraction/concreteness, cloud/local, interoperability, etc. The UX is amazing. It's a crazy framework. |
Love the feedback good luck on your Kedro journey @younggalois 🚀 |
Description
I think it would be useful to be able to declare explicit node dependencies when creating nodes. Datasets that are not outputs of nodes, such as SQLQueryDataset, may still have upstream dependencies, but I'm not sure how to declare them as such.
Context
I may be missing an API for this, but currently it looks like node dependencies are only implicitly declared by being somewhere in the sub-graph of the a node which is passed in via the "inputs" param to the node function. I currently have a few nodes that have upstream dependencies, but those are not in a subgraph of a node that gets passed in. For example with SQLQueryDatasets, it is possible for a node to rely on a dataset which is not itself emitted from a node.
For example say node
node_A
writes to sql table datasettable_A
. I then define a SQLQueryDataset,query_table_A
that queriestable_A
. (Let's say I wantquery_table_A
because I only need a small subset of data fromtable_A
, and all oftable_A
is very large and wastes time and memory loading and saving).If I then have a node,
node_B
, that hasquery_table_A
as an input, we can see that i needtable_A
to exists, but I don't see a way to explicitly tellnode_B
thattable_A
is an upstream, sincequery_table_A
is kind of a "standalone dataset". I.e. It's not an output of a node, so the crucial information that it relies ontable_A
doesn't get incorporated into its subgraph anywhere. It's only in the query itself, in the datasets.yml file. It looks like the serial order of execution of discrete subgraphs is non-deterministic, so sometimes in my pipeline,node_B
is correctly run afternode_A
(in which case all is well), but sometimes it runs before (which breaks the pipeline).Possible Implementation
from an API standpoint we can have an additional parameter to
node
named something likeadditional_upstream
which takes a list of strings (datasets). In my example above, we could just passadditional_upstream=['table_A']
into the definition ofnode_B
in it's pipeline, which would ensure it only runs after the node that generatestable_A
, i.e.node_A
, runs.From an implementation standpoint, I'm not sure how feasible this is.
The text was updated successfully, but these errors were encountered: