Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer to prod manifest for partial runs #2527

Closed
drewbanin opened this issue Jun 10, 2020 · 7 comments · Fixed by #2656
Closed

Defer to prod manifest for partial runs #2527

drewbanin opened this issue Jun 10, 2020 · 7 comments · Fixed by #2656
Labels
enhancement New feature or request more_information_needed state Stateful selection (state:modified, defer)

Comments

@drewbanin
Copy link
Contributor

See also #1612, #1603, and #2465

Describe the feature

In a partial run of dbt (eg. dbt run -m my_model+), all of the parents of my_model and its children must exist for the run to succeed. To this end, we've built the @ selector, but another helpful approach would be to defer to relations created by a production run when these parent models are referenced.

When dbt is provided a prod manifest (#2465), it can rewrite references to models based on the information contained in this manifest. Specifically, references to models which are not included in the run (ie. they are not selected, or they are explicitly excluded) should be interpolated as a relation captured in the prod manifest.

This flow will support the following use-cases:

  • Slim CI builds that only run new and changed models (without building their parents)
  • Local development (don't run large Snowplow models in dev)
  • Orgs with multiple projects where users don't have permissions to run models in an upstream package

Things to discuss:

  • Is this automatic? Or should models be configured as "deferrable" in some way?
  • If it needs to be enabled explicitly, is this a model config, a run flag, or something else?
  • What happens if a model is not selected and also is not in the prod manifest?
  • Does this have any impact on non-models?
    • Seeds
    • Snapshots
    • Tests
  • How does this impact the generated manifest for the run? Are the "borrowed" prod models included in the generated manifest? Should they render in the auto-generated documentation?

Additional context

Let's approach this separately from #2465 but keep this context in mind

Who will this benefit?

  • CI
  • Local dev
  • Large, sprawling, microproject architecture deployments
@drewbanin drewbanin added the enhancement New feature or request label Jun 10, 2020
@drewbanin
Copy link
Contributor Author

cc @jtcohen6

@clausherther
Copy link
Contributor

clausherther commented Jun 12, 2020

This is a slightly different and probably better approach than what I was describing in #2253 (comment)
In my example, we're trying to hack this use case by materializing a model in a special "validation" schema in the prod database while all of its upstream dependencies are previously run production tables. We currently do this by simply overriding the schema in the relevant model config and then running just the updated model against prod. This use case often comes up late in model development where we're already comfortable with the code, but are left trying to validate the new model using production data or validate an updated model against its current production version.
Using the approach outlined in this issue would be a much cleaner take on this use case, by letting us materialize the model in the development database, while reading from production for its upstream dependencies. That would cut down the potential for accidentally running other production models. It would also mean the developer would not need to be able to write to production and would only need read access.
In an ideal world, this would be done with a very lightweight approach, e.g. a new flag or operator for the CLI. Something like:
dbt run --models my_model --upstream prod --target dev or
meaning my_model is materialized in dev, while all of its upstream dependencies come from prod.

  • upstream should default to target unless specified on the CLI
  • If an upstream model is not explicitly selected via --models it would be assumed to be present in prod if --upstream is prod
  • If it does not exist in prod, the run would fail
  • If it was selected, it would be materialized in dev or whatever the target is
  • Presumably you could also do the inverse, depending on your target configurations, e.g. dbt run --models my_model --upstream dev --target prod, although I don't see a use case here.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jul 9, 2020

I'm leaving some thoughts below on the discussion questions from above. Let me know if this makes sense @drewbanin @beckjake, and where you disagree.

If it needs to be enabled explicitly, is this a model config, a run flag, or something else?

I think this is a top-level flag for the run:

dbt run --models fct_snowplow_sessions+ --defer-to [path/to/artifacts/]

I'm not hard-set on the syntax here. Whatever we decide on should cohere with plans for #2465. To that end, should we allow comparing against one set of artifacts and then deferring to a different one, or require that they be the same one? That's the difference between a more concise version:

dbt run --models state:modified+ --deferred --state prod-target/

And a more flexible one:

dbt run --models state:modified+ --state prod-target/ --defer-to different-target/

I'm trying to imagine a use case for having different targets, and it feels like it'd be very specific to bigger organizations that have genuinely different source data in their prod/staging/dev environments. They want to run models that are changed vs. prod, they want to avoid rebuilding big models, but they need to defer to a different version of that big model from the one that's in prod. I don't think we need to enable this in the first version, but I'm also wary of making choices that close the door on this entirely, especially since big/complex/multi-project organizations will be some of the biggest beneficiaries of this feature.

Is this automatic? Or should models be configured as "deferrable" in some way?

I don't think we need model-level flags to differentiate between small, easily recreated models and big, expensive ones. The user will have plenty of power to do this themselves via node selection. I imagine a common pattern might be

dbt run --exclude +my_massive_model --defer-to prod-target/

What happens if a [referenced] model is not selected and also is not in the prod manifest?

dbt should return an immediate DAG error, as it does today when a referenced model is missing/disabled in the project.

Does this have any impact on non-models?

I think an acceptable first version of this is as a flag available to dbt run only. That's where I see most of the benefit. We shouldn't invest a lot of time and energy (yet) in thinking through the implications of dbt test --defer-to [path] or dbt snapshot --defer-to [path].

As far as non-model nodes (seeds and snapshots) that are parents of models, I think they should work like models: If included in the --models selection syntax, use the current namespace; otherwise, use the namespace in the provided manifest.

E.g. if there is a seed csv_country_codes that is a parent of fct_snowplow_sessions, then a deferred run with --models +fct_snowplow_sessions would use the current namespace, and a deferred run with --models fct_snowplow_sessions would use the deferred namespace.

In the case of no node selectors, i.e. dbt run, I think we should use the current compiled namespaces only. This would mean that dbt run --defer-to [path] has no effect and is identical to dbt run.

How does this impact the generated manifest for the run? Are the "borrowed" prod models included in the generated manifest? Should they render in the auto-generated documentation?

The manifest produced by dbt run --defer-to [path] should include all unselected nodes at their "deferred" namespaces.

Implications:

  • You could defer to the manifest of a past run, which was itself a partial/deferred run. I think that sounds trickier than it actually is, and it should be pretty reasonable in practice.
  • I suppose this has implications for auto-generated documentation, but insofar as we aren't including dbt docs generate --defer-to [path] in the first version of this, I don't think it's going to crop up often.

@bashyroger
Copy link

bashyroger commented Jul 13, 2020

@drewbanin / @jtcohen6; This comment is related to issue #1612
This is how I think the use case of #1612 could be incorporated in #2527:

Instead of this:
dbt run --exclude +my_massive_model --defer-to prod-target/

Covering the #1612 concern would require a command like this:
dbt run +my_other_model_based_on_massive_model --defer-to prod-target/

The question then is: WHAT model(s) is /are deferred to prod-target? For the #1612 example I would argue that everything that is not found in the current target is deferred to prod-target.

So, this would be a more explicit syntax:
dbt run +my_other_model_based_on_massive_model --not-found-defer-to prod-target/

It goes without saying that prod-target could/should be any sort of target. For the #1612 example I would actually not refer to the live production dataset, but to a Snowflake clone that is updated daily / where PII data has been obfuscated...

I agree with @jtcohen6 that we should exclude testing from the initial version of this functionality; make it work for dbt run only.
For the #1612 example this would mean that running this would fail in development:
dbt test +my_other_model_based_on_massive_model

...as it could be that preceding models are not present in the environment the command runs...

@jtcohen6
Copy link
Contributor

It occurs to me that deferring to artifacts generated by a different dbt version may have surprising effects, in the event that the manifests look quite different. That's:

  • a caveat to document
  • all the more reason we need to include dbt versions in all generated artifacts, so we can raise a concrete warning

@beckjake
Copy link
Contributor

I think this is not a "dbt version" situation, but actually a "schema version" situation.

@jtcohen6
Copy link
Contributor

I buy that. I don't think we have an issue yet open for adding "schema version" to all generated artifacts. I know it came up in our conversation about revising run results specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request more_information_needed state Stateful selection (state:modified, defer)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants