Defer to prod manifest for partial runs #2527

drewbanin · 2020-06-10T15:52:33Z

Describe the feature

In a partial run of dbt (eg. dbt run -m my_model+), all of the parents of my_model and its children must exist for the run to succeed. To this end, we've built the @ selector, but another helpful approach would be to defer to relations created by a production run when these parent models are referenced.

When dbt is provided a prod manifest (#2465), it can rewrite references to models based on the information contained in this manifest. Specifically, references to models which are not included in the run (ie. they are not selected, or they are explicitly excluded) should be interpolated as a relation captured in the prod manifest.

This flow will support the following use-cases:

Slim CI builds that only run new and changed models (without building their parents)
Local development (don't run large Snowplow models in dev)
Orgs with multiple projects where users don't have permissions to run models in an upstream package

Things to discuss:

Is this automatic? Or should models be configured as "deferrable" in some way?
If it needs to be enabled explicitly, is this a model config, a run flag, or something else?
What happens if a model is not selected and also is not in the prod manifest?
Does this have any impact on non-models?
- Seeds
- Snapshots
- Tests
How does this impact the generated manifest for the run? Are the "borrowed" prod models included in the generated manifest? Should they render in the auto-generated documentation?

Additional context

Let's approach this separately from #2465 but keep this context in mind

Who will this benefit?

CI
Local dev
Large, sprawling, microproject architecture deployments

The text was updated successfully, but these errors were encountered:

drewbanin · 2020-06-10T15:52:50Z

cc @jtcohen6

clausherther · 2020-06-12T15:18:01Z

This is a slightly different and probably better approach than what I was describing in #2253 (comment)
In my example, we're trying to hack this use case by materializing a model in a special "validation" schema in the prod database while all of its upstream dependencies are previously run production tables. We currently do this by simply overriding the schema in the relevant model config and then running just the updated model against prod. This use case often comes up late in model development where we're already comfortable with the code, but are left trying to validate the new model using production data or validate an updated model against its current production version.
Using the approach outlined in this issue would be a much cleaner take on this use case, by letting us materialize the model in the development database, while reading from production for its upstream dependencies. That would cut down the potential for accidentally running other production models. It would also mean the developer would not need to be able to write to production and would only need read access.
In an ideal world, this would be done with a very lightweight approach, e.g. a new flag or operator for the CLI. Something like:
dbt run --models my_model --upstream prod --target dev or
meaning my_model is materialized in dev, while all of its upstream dependencies come from prod.

upstream should default to target unless specified on the CLI
If an upstream model is not explicitly selected via --models it would be assumed to be present in prod if --upstream is prod
If it does not exist in prod, the run would fail
If it was selected, it would be materialized in dev or whatever the target is
Presumably you could also do the inverse, depending on your target configurations, e.g. dbt run --models my_model --upstream dev --target prod, although I don't see a use case here.

jtcohen6 · 2020-07-09T17:26:03Z

I'm leaving some thoughts below on the discussion questions from above. Let me know if this makes sense @drewbanin @beckjake, and where you disagree.

If it needs to be enabled explicitly, is this a model config, a run flag, or something else?

I think this is a top-level flag for the run:

dbt run --models fct_snowplow_sessions+ --defer-to [path/to/artifacts/]

I'm not hard-set on the syntax here. Whatever we decide on should cohere with plans for #2465. To that end, should we allow comparing against one set of artifacts and then deferring to a different one, or require that they be the same one? That's the difference between a more concise version:

dbt run --models state:modified+ --deferred --state prod-target/

And a more flexible one:

dbt run --models state:modified+ --state prod-target/ --defer-to different-target/

I'm trying to imagine a use case for having different targets, and it feels like it'd be very specific to bigger organizations that have genuinely different source data in their prod/staging/dev environments. They want to run models that are changed vs. prod, they want to avoid rebuilding big models, but they need to defer to a different version of that big model from the one that's in prod. I don't think we need to enable this in the first version, but I'm also wary of making choices that close the door on this entirely, especially since big/complex/multi-project organizations will be some of the biggest beneficiaries of this feature.

Is this automatic? Or should models be configured as "deferrable" in some way?

I don't think we need model-level flags to differentiate between small, easily recreated models and big, expensive ones. The user will have plenty of power to do this themselves via node selection. I imagine a common pattern might be

dbt run --exclude +my_massive_model --defer-to prod-target/

What happens if a [referenced] model is not selected and also is not in the prod manifest?

dbt should return an immediate DAG error, as it does today when a referenced model is missing/disabled in the project.

Does this have any impact on non-models?

I think an acceptable first version of this is as a flag available to dbt run only. That's where I see most of the benefit. We shouldn't invest a lot of time and energy (yet) in thinking through the implications of dbt test --defer-to [path] or dbt snapshot --defer-to [path].

As far as non-model nodes (seeds and snapshots) that are parents of models, I think they should work like models: If included in the --models selection syntax, use the current namespace; otherwise, use the namespace in the provided manifest.

E.g. if there is a seed csv_country_codes that is a parent of fct_snowplow_sessions, then a deferred run with --models +fct_snowplow_sessions would use the current namespace, and a deferred run with --models fct_snowplow_sessions would use the deferred namespace.

In the case of no node selectors, i.e. dbt run, I think we should use the current compiled namespaces only. This would mean that dbt run --defer-to [path] has no effect and is identical to dbt run.

How does this impact the generated manifest for the run? Are the "borrowed" prod models included in the generated manifest? Should they render in the auto-generated documentation?

The manifest produced by dbt run --defer-to [path] should include all unselected nodes at their "deferred" namespaces.

Implications:

You could defer to the manifest of a past run, which was itself a partial/deferred run. I think that sounds trickier than it actually is, and it should be pretty reasonable in practice.
I suppose this has implications for auto-generated documentation, but insofar as we aren't including dbt docs generate --defer-to [path] in the first version of this, I don't think it's going to crop up often.

bashyroger · 2020-07-13T09:56:45Z

@drewbanin / @jtcohen6; This comment is related to issue #1612
This is how I think the use case of #1612 could be incorporated in #2527:

Instead of this:
dbt run --exclude +my_massive_model --defer-to prod-target/

Covering the #1612 concern would require a command like this:
dbt run +my_other_model_based_on_massive_model --defer-to prod-target/

The question then is: WHAT model(s) is /are deferred to prod-target? For the #1612 example I would argue that everything that is not found in the current target is deferred to prod-target.

So, this would be a more explicit syntax:
dbt run +my_other_model_based_on_massive_model --not-found-defer-to prod-target/

It goes without saying that prod-target could/should be any sort of target. For the #1612 example I would actually not refer to the live production dataset, but to a Snowflake clone that is updated daily / where PII data has been obfuscated...

I agree with @jtcohen6 that we should exclude testing from the initial version of this functionality; make it work for dbt run only.
For the #1612 example this would mean that running this would fail in development:
dbt test +my_other_model_based_on_massive_model

...as it could be that preceding models are not present in the environment the command runs...

jtcohen6 · 2020-07-29T17:22:14Z

It occurs to me that deferring to artifacts generated by a different dbt version may have surprising effects, in the event that the manifests look quite different. That's:

a caveat to document
all the more reason we need to include dbt versions in all generated artifacts, so we can raise a concrete warning

beckjake · 2020-07-29T17:24:33Z

I think this is not a "dbt version" situation, but actually a "schema version" situation.

jtcohen6 · 2020-07-29T17:28:43Z

I buy that. I don't think we have an issue yet open for adding "schema version" to all generated artifacts. I know it came up in our conversation about revising run results specifically.

drewbanin added the enhancement New feature or request label Jun 10, 2020

drewbanin mentioned this issue Jun 12, 2020

Add a "full run" indicator to Jinja context #2253

Closed

jtcohen6 added this to the Marian Anderson milestone Jul 6, 2020

jtcohen6 added the more_information_needed label Jul 8, 2020

jtcohen6 mentioned this issue Jul 9, 2020

Run Development Models using Production Data #1612

Closed

This was referenced Jul 22, 2020

Slim CI runs, via state comparison and deferred refs #2641

Closed

Store resolved node names in manifest #2647

Closed

beckjake mentioned this issue Jul 28, 2020

Feature/defer to prod #2656

Merged

4 tasks

beckjake closed this as completed in #2656 Jul 31, 2020

jtcohen6 added the state Stateful selection (state:modified, defer) label Sep 9, 2020

jtcohen6 mentioned this issue May 6, 2021

Support for deference to models imported from a package. #3309

Closed

dbeatty10 mentioned this issue Jun 12, 2024

[Feature] Snapshots should respect generate schema/database name macro #10301

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer to prod manifest for partial runs #2527

Defer to prod manifest for partial runs #2527

drewbanin commented Jun 10, 2020

drewbanin commented Jun 10, 2020

clausherther commented Jun 12, 2020 •

edited

Loading

jtcohen6 commented Jul 9, 2020 •

edited

Loading

bashyroger commented Jul 13, 2020 •

edited

Loading

jtcohen6 commented Jul 29, 2020

beckjake commented Jul 29, 2020

jtcohen6 commented Jul 29, 2020

Defer to prod manifest for partial runs #2527

Defer to prod manifest for partial runs #2527

Comments

drewbanin commented Jun 10, 2020

Describe the feature

Additional context

Who will this benefit?

drewbanin commented Jun 10, 2020

clausherther commented Jun 12, 2020 • edited Loading

jtcohen6 commented Jul 9, 2020 • edited Loading

If it needs to be enabled explicitly, is this a model config, a run flag, or something else?

Is this automatic? Or should models be configured as "deferrable" in some way?

What happens if a [referenced] model is not selected and also is not in the prod manifest?

Does this have any impact on non-models?

How does this impact the generated manifest for the run? Are the "borrowed" prod models included in the generated manifest? Should they render in the auto-generated documentation?

bashyroger commented Jul 13, 2020 • edited Loading

jtcohen6 commented Jul 29, 2020

beckjake commented Jul 29, 2020

jtcohen6 commented Jul 29, 2020

clausherther commented Jun 12, 2020 •

edited

Loading

jtcohen6 commented Jul 9, 2020 •

edited

Loading

bashyroger commented Jul 13, 2020 •

edited

Loading