-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Passthrough data tests #4451
Comments
Maybe instead of |
Here is an example of what this might look like compared to how tests are currently handled (from an etic perspective) -- mart_large_table_failed.sql
WITH tests AS (
SELECT
'nodata_null' AS reason,
*
FROM
{{ ref('stg_tif_postprocessed_fail_nodata_null') }}
UNION ALL
SELECT
'dtype_is_not_min_scalar' AS reason,
*
FROM
{{ ref('stg_tif_postprocessed_fail_dtype_is_not_min_scalar') }}
UNION ALL
SELECT
'processing failed' AS reason,
*
FROM
{{ ref('stg_tif_postprocessed_fail_processing') }}
UNION ALL
SELECT
'preprocessed metadata reading failed' AS reason,
*
FROM
{{ ref('stg_tif_preprocessed_fail_metadata') }}
)
SELECT
string_agg(
reason,
'; '
) reason,
*
FROM
tests
GROUP BY
{list_of_all_columns} -- mart_large_table.sql
SELECT
*
FROM
{{ ref('stg_tif_postprocessed') }}
EXCEPT
SELECT
*
FROM
{{ ref('mart_large_table_failed') }} maybe this would be too hard to implement with tests in a generic way. using models is not too bad--just a little repetitive |
Hey @chapmanjacobd , thank you so much for providing all this information and linking potential issues related to this! CHEETOS is one of my favorite snacks too!! And sorry for the delayed response! I can totally relate to why you want to do this as a test instead of writing it as a model. When I did my dbt onboarding project, I wrote probably 10 staging models that are roughly the same, and I wish there was some template that automatically does it. Our approach to testing is better summed up as: Tests are things about your models that should pass (or at least, should not fail too much). If they do, you have a problem; maybe you should even interrupt the DAG mid-build. So it sort of feels like you are approaching tests with a different idea: they should be a way of quarantining data or flagging it. There is another discussion about this if you are interested #4306. After talking to @jtcohen6, here are a few tools we think can help with this situation(tests that frequently fails because the underlying data is messy, but I still want to keep that test around.):
There might be some sort of dynamic way to template that out so that model_b actually looks like:
that would be using the graph variable, a.k.a. the source of many many shenanigans in dbt projects in 2022 — since it gives users programmatic access (in the jinja context) to other stuff defined in their project. But this is not what we recommend and we think it might not serve you well. But always good to know what's possible. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers. |
Is there an existing feature request for this?
Describe the Feature
from https://getdbt.slack.com/archives/C2JRTGLLS/p1638891747103000
Similar to https://docs.getdbt.com/docs/building-a-dbt-project/tests#storing-test-failures but there are some issues that are fundamental to the interface that was chosen (eg. not being able to reference bifurcated pass/failed data in downstream models)
dbt currently captures the application testing of the data pipeline fairly well--ie. testing that data pipeline changes are conceptually sound before deploying changes into production (not only schema tests but pipeline-specific tests; not data tests).
But it seems like, for some test features currently in dbt, there is a conflation of application testing and data testing. While it may have no impact to conflate these things it does not make sense to do so in every case.
Instead of pass/fail data tests at the dbt application layer it might be useful for folks to have the option to bifurcate and expose passing and failing data through views--at the same granularity as the existing severity option
for example, this might look like
large_table_fail
); likewise for all data passing all tests (large_table_pass
)passthrough:
could also be set at the project level to be the default behavior of tests, similar toseverity
Some complications:
For tables that have
passthrough:
in any test the downstream model would need to reference the passing table/view (large_table_pass
). This dynamic behavior might not be desired although it should be fairly clear what is going on to the end user in DAG visualization--and users wouldn't likely be setting this without knowing what it is doing (hopefully).Describe alternatives you've considered
The easiest alternative will likely be to store my data tests as models. This is what I plan on doing if this idea isn't accepted or until it is in a beta release.
Who will this benefit?
People who use
dbt-expectations
who have messy data sometimes but also might need to have data pushed into production frequently where multi-table data dependencies where automated filtering might not matter as much and their favorite food isCHEETOS® Crunchy FLAMIN' HOT® Cheese Flavored Snacks
Are you interested in contributing this feature?
No response
Anything else?
potentially related to
Railroad-oriented programming
#3066
#2968
#2959
#517
The text was updated successfully, but these errors were encountered: