-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defer to prod manifest for partial runs #2527
Comments
cc @jtcohen6 |
This is a slightly different and probably better approach than what I was describing in #2253 (comment)
|
I'm leaving some thoughts below on the discussion questions from above. Let me know if this makes sense @drewbanin @beckjake, and where you disagree. If it needs to be enabled explicitly, is this a model config, a run flag, or something else?I think this is a top-level flag for the run: dbt run --models fct_snowplow_sessions+ --defer-to [path/to/artifacts/] I'm not hard-set on the syntax here. Whatever we decide on should cohere with plans for #2465. To that end, should we allow comparing against one set of artifacts and then deferring to a different one, or require that they be the same one? That's the difference between a more concise version: dbt run --models state:modified+ --deferred --state prod-target/ And a more flexible one: dbt run --models state:modified+ --state prod-target/ --defer-to different-target/ I'm trying to imagine a use case for having different targets, and it feels like it'd be very specific to bigger organizations that have genuinely different source data in their prod/staging/dev environments. They want to run models that are changed vs. prod, they want to avoid rebuilding big models, but they need to defer to a different version of that big model from the one that's in prod. I don't think we need to enable this in the first version, but I'm also wary of making choices that close the door on this entirely, especially since big/complex/multi-project organizations will be some of the biggest beneficiaries of this feature. Is this automatic? Or should models be configured as "deferrable" in some way?I don't think we need model-level flags to differentiate between small, easily recreated models and big, expensive ones. The user will have plenty of power to do this themselves via node selection. I imagine a common pattern might be
What happens if a [referenced] model is not selected and also is not in the prod manifest?dbt should return an immediate DAG error, as it does today when a referenced model is missing/disabled in the project. Does this have any impact on non-models?I think an acceptable first version of this is as a flag available to As far as non-model nodes (seeds and snapshots) that are parents of models, I think they should work like models: If included in the E.g. if there is a seed In the case of no node selectors, i.e. How does this impact the generated manifest for the run? Are the "borrowed" prod models included in the generated manifest? Should they render in the auto-generated documentation?The manifest produced by Implications:
|
@drewbanin / @jtcohen6; This comment is related to issue #1612 Instead of this: Covering the #1612 concern would require a command like this: The question then is: WHAT model(s) is /are deferred to prod-target? For the #1612 example I would argue that everything that is not found in the current target is deferred to prod-target. So, this would be a more explicit syntax: It goes without saying that prod-target could/should be any sort of target. For the #1612 example I would actually not refer to the live production dataset, but to a Snowflake clone that is updated daily / where PII data has been obfuscated... I agree with @jtcohen6 that we should exclude testing from the initial version of this functionality; make it work for ...as it could be that preceding models are not present in the environment the command runs... |
It occurs to me that deferring to artifacts generated by a different dbt version may have surprising effects, in the event that the manifests look quite different. That's:
|
I think this is not a "dbt version" situation, but actually a "schema version" situation. |
I buy that. I don't think we have an issue yet open for adding "schema version" to all generated artifacts. I know it came up in our conversation about revising run results specifically. |
See also #1612, #1603, and #2465
Describe the feature
In a partial run of dbt (eg.
dbt run -m my_model+
), all of the parents ofmy_model
and its children must exist for the run to succeed. To this end, we've built the@
selector, but another helpful approach would be to defer to relations created by a production run when these parent models are referenced.When dbt is provided a prod manifest (#2465), it can rewrite references to models based on the information contained in this manifest. Specifically, references to models which are not included in the run (ie. they are not selected, or they are explicitly excluded) should be interpolated as a relation captured in the prod manifest.
This flow will support the following use-cases:
Things to discuss:
Additional context
Let's approach this separately from #2465 but keep this context in mind
Who will this benefit?
The text was updated successfully, but these errors were encountered: