Provide option to allow tests to be ran before a model's data is updated #5687

adamcunnington-mlg · 2022-08-16T19:47:17Z

adamcunnington-mlg
Aug 16, 2022

Describe the feature

Context

Currently, DBT tests run after a model is built.
If a test fails, there is some characteristic of the data that is unexpected - but it's too late. That "bad" data is in the model and is impacting downstream models - the integrity of the data is broken and this could flow through to other models where crazy things might happen (row explosion uh oh) and use cases (i.e. real-time dashboards) suddenly show the wrong data.

There is no straight forward solution to this problem. On snowflake, there is a neat way of achieving blue/green (although I think it's a slight misuse of the term blue/green because we're talking about new data, not [necessarily] new software) because Snowflake supports partition swaps. Others have implemented more complex workarounds such as managing separate staging and production environments - orchestrated either via Airflow, or perhaps achieved in DBT via logical layers with different run commands. Calogica have a nice write-up about how they use DBT to achieve this here.

My conjecture though is that this is a really core need that should be possible, natively, within DBT.

Acknowledgements

I am aware that there are use cases where you don't want to prevent "bad" data flowing through in the event of a test failure - but I do believe it should be an option - and should probably be the default behaviour (ignoring for one moment compromises that might be made when rolling out such a feature as to not cause a breaking change).

Also, it's worth calling out that tests actually serve two purposes:

Validating a change in model configuration / SQL has not caused regression
Validating that new data meets the expected profile

These are arguably quite different things. The former is more like a conventional software test and we'd expect to run that before deploying our updated artifacts (in this case, updated model code) - which we do via a typical DBT dev workflow. The second is a runtime consideration relating to the assertion of our data's profile. This is a very different thing and why it's so important to be able to succeed the tests before updating the model's data (or at least have the option to do this). The first is a solved problem. The second is not.

Discussion

DBT makes decisions about what should be abstracted behind a common layer and what should be adapter-specific. It feels to me like the ability, and user-facing interface, for having tests run (logically) before a model is updated, should be adapter-agnostic - but I appreciate this is a totally non-trivial problem to solve. Perhaps adapter-specific could be a starting point and this could be homogenised later.

I'm also aware, at the other end of the spectrum, that if you are going to consider running tests "transactionally" before a model is updated, what about a model that is dependent on another model? You may wish to update both or none at all. Sure! But one step at a time.

I imagine a world where there is a per-adapter implementation but perhaps hidden behind a common model property which controls whether tests run before or after.

Specifically, and selfishly, I want this capability in BigQuery. BigQuery is quite limited in terms of tools it gives us to help but multi-statement transactions have been round a shortwhile and are due to move from preview to GA in the not too distant future. I know transactions are not a silver-bullet solution to everything but maybe there's at least some mileage in exploring the feasibility of updating model and then running all model & column-level tests within the same transaction and rolling back if any fail (maybe the failure behaviour is a config that belongs to the tests: array items rather than at the model level)?

Keen to start a discussion on this!

Describe alternatives you've considered

Managing a custom blue/green approach.

Who will this benefit?

A lot of people. I don't have any hard numbers to back this up but I suspect that the majority of DBT users would not expect a model to be updated if tests fail. I imagine some of those users implement workarounds which cost development time and processing resources and opportunity cost in complex support routines. I imagine that others don't do anything in particular but just set expectation with users (directly or otherwise) that when data breaks, it'll be fixed quickly - but it will be broken for a period of time. Most data analytics still isn't mission critical after all (despite what we all say).

Are you interested in contributing this feature?

Perhaps! Certainly dev resources within my team.

Anything else?

https://cloud.google.com/bigquery/docs/reference/standard-sql/transactions

rileyschack · 2022-08-16T20:32:15Z

rileyschack
Aug 16, 2022

+1 for this ability. Realize it may not be the easiest implementation. My org is starting to experiment with table/dataset cloning as a means to work with dbt's current functionality. Even contemplating building a clone materialization that would only run if all tests are passed (using dbt build).

0 replies

dbeatty10 · 2022-08-17T22:20:15Z

dbeatty10
Aug 17, 2022
Maintainer

#1054 has an interesting discussion on this topic.

0 replies

jtcohen6 · 2022-08-20T12:42:05Z

jtcohen6
Aug 20, 2022
Maintainer

@adamcunnington-mlg This is a tremendous start to a discussion! You've compiled some great research here. I am going to convert this to a discussion, in matter of fact.

I also agree with @dbeatty10 that #1054 offers some great historical context on the ideas we've had in the past, as well as the tripwires we've come across.

I appreciate the problem you're raising—when bad data's in the production pipeline, it's in the production pipeline—and I agree it's a very common one. There are a diversity of tactics folks use when solving it today. The two most common I see:

Define tests on sources, to validate that new data meets the expected profile. In the unified DAG of dbt build, those tests will run before any models depending on those sources are built, and those models will skip if the tests fail.
Use a "write-audit-publish" or "blue/green" pattern to run, test, and only then deploy models to production. You can do all three steps once per model, although the pattern I see more often is to swap an entire schema/database from "staging" to "production" when it's ready. This also avoids the risk of downstream queriers seeing inconsistent data, if one "final" model finishes building earlier than another "final" model.

While not perfect solutions, those two constructs have served many quite well. They care more about the holistic deployment of an entire DAG, rather than treating each model as its own microcosm.

I do think that dbt could do more to make the write-audit-publish / "blue-green" pattern more readily available. It is certainly easiest on Snowflake, due to its support of zero-copy cloning, though still requires some custom macros + operations to string together. BigQuery also supports zero-cost copying, for tables and datasets, and (I understand) is rolling out capabilities around table cloning (for discussion in dbt-labs/dbt-bigquery#270). Once there is a critical mass of support among a few of the most popular adapters, I feel more comfortable with the idea of building out an adapter-agnostic abstraction within dbt.

Each model to its own

There are three patterns I could see to support this on a model-by-model basis, all of which are much more case-by-case:

Tests within transactions: On data platforms that support transactions, run tests before the commit; and roll back if they fail. This is totally adapter-specific: Redshift supports transactions completely, Spark/Databricks not at all, Snowflake in very odd ways, and BigQuery only very recently. Within dbt, it would require us to rethink tests as a property of a model, which run during that model's materialization, rather than their own node types within the DAG. (How to treat tests that select from multiple models...?) I'm not dismissing this idea completely, just noting that results will vary significantly here, and I'd want to take account of the several adapter/materialization permutations before proceeding.
Tests within materializations: For certain materializations on certain adapters (namely incremental), which produce a temporary table before upserting/merging/overwriting into the existing result set, test the temporary results before upserting/merging/overwriting. This is an issue that Niall just opened, and to which I owe a response, though his motivation is more about performance of the test queries than rolling back if the tests fail: [CT-793] [Feature] Testable _tmp tables in incremental models #5427. The appeal of this approach doesn't strictly require transactions, because the temp table can be thrown out without a formal rollback—no harm, no foul—but it does get into the implementation details of specific materializations on specific adapters, and it raises the same questions about how tests need to be defined in the manifest and run in the DAG.
Database constraints: Historically, analytical databases have done something very ugly with constraints: they allow them, they use them in query optimization, but they do not enforce them. So, despite adding a unique constraint to a column, it may well have dupes—and the database's optimizer will act as if it doesn't, risking incorrect results! For that reason, dbt has really, really discouraged use of database constraints in the past. Still, there is a pattern of cross-applying dbt tests as database constraints for detection by other tools (https://hub.getdbt.com/snowflake-labs/dbt_constraints/latest/). On at least one data platform (Databricks), constraints are totally enforced at table creation/update time (Support Delta constraints databricks/dbt-databricks#71). That feels like a pretty good pattern, where it's available!

I agree that, in the ideal case, there would be a common abstraction in dbt to identify which tests are worth rolling back for—with adapter-specific implementations behind the scenes—just as dbt build has allowed users to say that all error-severity tests are worth stopping the DAG for. I'm also wary of building an abstraction that risks leaking beyond utility, because of how different these data platforms really are, behind their shiny SQL-y exteriors.

Let's keep the discussion going :)

0 replies

adamcunnington-mlg · 2022-08-22T08:05:02Z

adamcunnington-mlg
Aug 22, 2022
Author

@jtcohen6 thanks for the response - exactly the sort of discussion I was hoping to generate!

I hadn't fully appreciated (despite the fact that I am actually benefiting from the exact behaviour you described) that the models are dependent on the sources and so source tests do provide a significant safety barrier. It doesn't cause the discussion to be moot as we can still miss data issues through not catching edge cases in our source tests and then model tests may fail (and it's too late) but I think it would be good to increase the emphasis of this in docs. Somewhere during the tutorial / getting started pages, it could be really clear that tests run after models but the way dbt build works is such that models will not run at all if source test fails and thus source tests are a critical opportunity to verify upstream data integrity. This reduces the job of model tests to then be i) validating the model logic is good - which is part of typical software dev/test workflow and falls into the solved problem 1 category I outlined before, and ii) catching edge cases that are pertinent to the logic of the model - which should drive a feedback loop in identifying gaps in source tests and improving them.

In fact, the above leads me to think that it would be a valid conclusion for the original hope of this discussion to be met with "it's not feasible given the amount of effort, and that source tests run first" - this is not something I had appreciated at all at the top of this conversation.

Onto the options though.. thanks for listing them out - I guess as much as possible, for sake of code maintainability (and accommodating future adapters), you want as few strategies as possible for achieving this - some of what you have listed are broad-brush strategies and others (e.g. constraints) are specific to particular types of tests, nevermind particular adapters. I guess leaning on the former as much as possible (even with limitations) may be preferable. I.e. if any dev work to be done, would a good starting point be to implore the transaction rollback approach behind a common model property that only has an effect on adapters that support transactions, and just call this out as a limitation? It feels like it would be the smallest code change too. Logically:

A model property which controls whether tests run in a transaction or not.
If so, and if supported by adapter, wrap the existing model running + test running inside of transaction control statements, and handle the test errors.

8 replies

kparaju Nov 22, 2022

@jsnb-devoted

Create a "swap" table (eg my_model_tmp) based on the metadata in the schema.yml file

Run the model's compiled SQL as an INSERT statement into my_model_tmp

Execute the dbt tests against my_model_tmp

Forgive me if this is a stupid question (I'm not a Data Engineer, just adjacent to it) but will this work if the model is incremental? We might need to copy my_model to my_model_tmp, right?

Use-case: What if one of the tests calculates the avg across the whole entire table? In your proposal, we will only be doing tests against the "incremental" data that is located in my_model_tmp, not all of the data in my_model. This could cause a situation where the tests pass in my_model_tmp but fail when it's run on my_model.

jsnb-devoted Nov 23, 2022

@kparaju -- not a stupid question at all. Incremental models will have to look a little different. They would have to have a separate "materialization" but that isn't different from how dbt works now. Incremental models and a "swap table" strategy could work something like this:

Create a copy of the existing target table my_inc_model_tmp
Create a table for the incremental data to go my_inc_model_staging
Execute the merge or delete+insert against my_inc_model_tmp
Run your tests against my_inc_model_tmp and if the tests pass...
Drop my_inc_model
Rename my_inc_model_tmp to my_inc_model

kparaju Nov 23, 2022

@jsnb-devoted That makes sense, thank you.

the latter maybe being more costly for BigQuery or Snowflake users

+1; the cost will scale with the data in the target table for BQ/SF and this will cause significant compute-debt long term. Improvements like dbt-labs/dbt-bigquery#270 will be essential to enabling this feature so as to not skyrocket the compute cost for existing users of dbt test.

And echoing:

I still think it's a worthy enhancement to run tests within a transaction

Maybe this can be generalized at the DB level. Something like a multi-session transaction. We would:

Start a session (dbt run), create a transaction, get an ID
Make changes to the target table in that session
Close the session (dbt run finishes)
Start another session for (dbt test) using the transaction ID from (1)
Test
COMMIT or ROLLBACK based on the test result

tema-popov Nov 23, 2022

I would like to have the ability to check assertions before running the model, in the same way, DBT test work for post-model run checks.

I have two cases and they all affect specifically large incremental models:

I need to check the state of the incremental model before updating it to check the consistency of updated data.
I need to check if the incremental model has all expected increments already built to have all the data for another model that depends on it. For example, I have a "stats for last 7 days" model that expects the incremental "1-day stats model" to have all increments for the last 7 days built.

jsnb-devoted Nov 23, 2022

+1; the cost will scale with the data in the target table for BQ/SF and this will cause significant compute-debt long term.

Ya I think the CREATE TABLE AS SELECT... isn't really an efficient solution. Not really sure why I included it in retrospect. It really ought to be DROP TABLE and RENAME or even better use CLONE if your db supports it.

Maybe this can be generalized at the DB level. Something like a multi-session transaction.

Ultimately I think this is probably the right solution and more generalizable across different databases. We are using Snowflake and it hasn't always treated transactions the way I would expect. If we were doing something like this in a transactional db like postgres it would be a no brainer. Might still be the way to go with these MPP dbs.

jason-albrecht · 2023-07-16T12:21:25Z

jason-albrecht
Jul 16, 2023

Not sure if this is exactly related, but:

Being able to run newly developed tests and macros from a user's local machine against an environment where they don't have the rights to build models (create tables), would be extremely helpful.

Right now, if a user doesn't have CREATE TABLE rights, when they dbt test a model that does not have a new column yet (because it's being developed locally), the test fails.

Running the local dbt project in a temp environment against real data would be amazingly useful.

1 reply

adamcunnington-mlg Jul 17, 2023
Author

Hi @jason-albrecht, it's not related.
You are talking about a very common need and it's covered by https://docs.getdbt.com/reference/node-selection/defer

SoumayaMauthoorMOJ · 2023-08-22T21:24:35Z

SoumayaMauthoorMOJ
Aug 22, 2023

@adamcunnington-mlg very useful discussion! With the "run, test, and only then deploy models to production" approach, models downstream of a failed test will run needlessly. However with dbt build, models downstream of a failed test would not run. Another issue is that all tests need to pass before deploying to production. If any test fails, no tables will be published to production. With dbt build, only models downstream of a failed test would not run.

Since dbt build knows that it needs to run models and tests in the right order, how about modifying dbt build so that you can optionally run something like on-test-end, similar to 'on-run-end', but which runs after tests have completed for a particular model. You could pass different macro depending on whether test passed or failed, for example to swap the database / partition, or roll-back to a previous branch whatever fits your use case and/or adapter.

6 replies

SoumayaMauthoorMOJ Oct 2, 2023

One con with this approach is that I can't run something if my test fails e.g. to rollback a change

SoumayaMauthoorMOJ Oct 2, 2023

It's also incompatible with incremental models

SoumayaMauthoorMOJ Mar 31, 2024

I've come up with a WAP approach that works for incremental models, but only for table that have time travel querying capabilities e.g. Iceberg:

Create a ‘wap’ incremental version of your model:

--my_table_wap.sql
{{
    config(
        materialized='incremental',
        table_type='iceberg',
        incremental_strategy='append',
    )
}}

SELECT 1 as id

Create a view model that references the wap model but as of a set TIMESTAMP

--my_table.sql
{{
    config(
        materialized='view'
    )
}}

{% set current_timestamp = modules.datetime.datetime.now() %}

SELECT * 
FROM {{ ref('my_table_wap') }}
FOR TIMESTAMP AS OF TIMESTAMP '{{ current_timestamp }} UTC'

The view will only get refreshed to use the latest Iceberg table version if the tests passes

y0j0 May 31, 2024

awesome! This idea helped me a lot. But I found one case that it can't solve - when the tests fail during the --full-refresh, the snapshot used in the VIEW does not exist anymore because during the --full-refresh all old snapshots are deleted. Any possible solution that I found is to do the --full-refresh each time on the new schema. Or do you have any other ideas please?

SoumayaMauthoorMOJ Jul 23, 2024

Create a macro to delete all the data from an iceberg table which only runs in incremental mode and if overwrite = True

{% macro delete_from() %}

    {% if is_incremental() and config.get('overwrite', false) %}

            DELETE FROM {{ this }}
            {%- set query -%}
                deleted from {{ this.render_pure() }}
            {%- endset -%}
            {% do log(query, info=True) %}

    {%- endif -%}

{% endmacro %}

You can then call this macro in incremental models a pre-hook to delete all the data instead of doing a full refresh:

{{
    config(
        materialized ='incremental',
        incremental_strategy ='append',
        table_type ='iceberg',
        pre_hook ="{{ transform_table.delete_from() }}",
        table_properties = {'vacuum_max_snapshot_age_seconds':1},
        overwrite = var('OVERWRITE', false)
    )
}}

{% if not is_incremental() or config.get('overwrite', false) %}

    SELECT 2 AS id

{% else %}

    SELECT 1 AS id

{%- endif -%}

To run this model in overwrite mode you'll need to use vars: dbt run --vars 'OVERWRITE: true'

Spince · 2023-11-07T20:09:10Z

Spince
Nov 7, 2023

wondering if dbt clone changes things here.. thinking (similar to others here):

clone production data
run and test against clone
if passed: swap updated clone with production
drop clone

my current team doesn't have a pager and failing jobs can sometimes go days without being noticed. seems like this is a totally doable implementation especially given the benefit of only having tested data in production. I see this as a logical next step in line with the shift from dbt run to dbt build.

1 reply

mrcool4 May 8, 2024

I believe this only solves half of the problem, what about the 2nd problem?
"Validating that new data meets the expected profile"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide option to allow tests to be ran before a model's data is updated #5687

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Provide option to allow tests to be ran before a model's data is updated #5687

Describe the feature

Context

Acknowledgements

Discussion

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

Replies: 7 comments · 16 replies

dbeatty10 Aug 17, 2022 Maintainer

jtcohen6 Aug 20, 2022 Maintainer

Each model to its own

adamcunnington-mlg Aug 22, 2022 Author

adamcunnington-mlg Jul 17, 2023 Author

Replies: 7 comments 16 replies

dbeatty10
Aug 17, 2022
Maintainer

jtcohen6
Aug 20, 2022
Maintainer

adamcunnington-mlg
Aug 22, 2022
Author

adamcunnington-mlg Jul 17, 2023
Author