Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Development Models using Production Data #1612

Closed
nickymikail opened this issue Jul 16, 2019 · 4 comments
Closed

Run Development Models using Production Data #1612

nickymikail opened this issue Jul 16, 2019 · 4 comments

Comments

@nickymikail
Copy link

Feature

Run Development Models using Production Data

There should be a command that allows data analysts to test models using production data. With a DAG like this:

model_a \
          - model_c - model_d
model_b /

If an analyst wants to test a modification to model d (or create a new dependent model e), they would need to have models a, b, and c present in their development schema or render these models at runtime as well. If model a, b or c is a large dataset or if any of those models has a significant run time this could represent nontrivial costs in time and storage. With a new command dbt develop --models model_d that parses any refs to models not stated in the --models argument to the command as referencing an up-to-date production dataset while still writing to a development target, these costs could be avoided and data model development could be significantly sped up.

Who will this benefit?

This would benefit larger analytics teams who observe significant storage costs from duplicate data, and teams with more generations in their dependency graphs.

@drewbanin
Copy link
Contributor

Hey @nickymikail - thanks for making this issue!

Check out the thread over here: #1603

#1603 describes something different than what you're asking for, but I think there might be a feature that we can pull out of that issue which would address your use case. The big idea is just that(some?) refs should address models in your development schema, while others should address the models in the production schema.

It's would be really hard to build the generalized version of this: models can have all sorts of environment-aware logic which changes the destination schema/table names in prod vs. dev, for instance. Just knowing the location of the prod version of a model is going to be really tough in development!

Here are some alternative approaches that we recommend:

  1. add filters to models in development, eg:
select * from my_really_big_table
{% if target.name == 'dev' %}
where created_at > getdate() - interval '3 days'
{% endif %}
  1. Implement use-case specific logic in some specific models
{% if target.name == 'dev' %}
  -- just make a view that points to the prod model
  {{ config(materialized='view') }}

  select * from production.table_name

{% else %}
  -- build the model here
  {{ config(materialized='incremental') }}
  select ....

{% endif %}
  1. utilize "zero copy clones" if possible (Snowflake only!)

Curious what you think about all of this!

@bashyroger
Copy link

bashyroger commented Jul 8, 2020

Hi @drewbanin / @nickymikail : I think the use case I have just discussed with dylan baker on slack is similar in nature
What I am looking for is the following:

Initial state:

  • I have 2 years of snowplow event data
  • On top of that I have an incremental model in prod, aggregating it to the pageview level.

Change:

  • Now I want to start create a new model OR change an existing one called pageviews_per_month_aggregate on top of the pageview model
  • Initially, that pageview model does not exist in my dev env
  • instead of (partially) rebuilding that, I would like that pageview model to be copied (or cloned in snowflake) when the model is not found.

So, basically, when querying that pageview model in dev, my preferred way of handling this would be for dbt to do something like this _auto-magically :) as part of the JINJA parsing:

IF model_used_in_ref NOT FOUND IN dev_target AND model_used_in_ref EXISTS IN prod_target THEN COPY or CLONE model_used_in_ref FROM prod_target END IF
...and then execute the query
Obviously, where DBT would copy / clone the data from should be configurable, I now used prod_target in this example, but it might as well from an obfuscated copy / clone of prod (thinking about PII data)

I could also imagine a new DBT command that @nickymikail suggested:
dbt develop --models model_d
dbt develop --models pageviews_per_month_aggregate

OR a new run flag for parent models like
dbt run --models +model_d --initialize_parents
dbt run --models +pageviews_per_month_aggregate --initialize_parents
The --initialize_parents flag would then take care of cloning / copying any parent models of model_d / pageviews_per_month_aggregate

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jul 9, 2020

@bashyroger You've outlined a really compelling use case. We've come a long way on our thinking here over the past several months.

If you're on Snowflake, zero-copy cloning is a massive help because there's almost no added cost for cloning more than you need to. You can paint with the broadest possible brush. GitLab uses cloning as part of their CI process today; check out @emilieschario's recent Discourse comment for links.

Copying is a much costlier operation, so you wouldn't want to copy any more objects than you absolutely need to. We're laying some significant groundwork to enable more precise approaches in the next release:

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 10, 2020

I'm going to close this issue, since we added dbt run --defer in v0.18.0 (#2656, docs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants