-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record failing rows for tests into an auditable table #903
Comments
@drewbanin - I totally get that we wouldn't want to bring back all the rows by default, what would be awesome, would be some kind of executable SQL on fail. So the workflow would look like:
The reason I think this would work nicely is that our analysts are used to looking at the logs of a build to see if something has failed, and having the info right there in the logs would be a massive productivity gain - albeit if there's a config overhead to write in explicitly what should happen on failing for each test. |
Totally with you @alanmcruickshank. I think also related is: #517 I'm into the idea, but need to give some more thought into how dbt would support this. We don't have a good way of configuring groups of tests at the moment, and i imagine setting Very open to ideas if you have them! |
I support and would use the use case of reporting on failed tests. I am more interested in using them for auditing than for debugging — I would like a nice easy table to query via Looker for a conditional "Go Fix ETL" alert. When I went looking to do this feature I naturally reached for hooks, as I'm using them already for audit purposes much as described in the documentation. I'm logging the individual model run: on-run-start:
- "create table if not exists {{ target.schema }}.audit_dbt_runs (model text, state text, time timestamp)"
models:
pre-hook:
- "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'start', getdate())"
post-hook:
- "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'end', getdate())" As well as the overall results: on-run-start:
- "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
- "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}" (I have a simple macro In this case I would generally like the exact same thing for tests. Something exactly analogous to the on-test-start:
- "create table if not exists {{ target.schema }}.audit_dbt_test_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-test-end:
- "insert into {{ target.schema }}.audit_dbt_test_results (node, status, execution_time, time) values {{ results_values(results) }}" |
that looks like a great start to a 'dbt_results_mart', where you could have models as a dim, dbt_runs as a dim and fact and errors as an optional fact |
@ianterrell This is a great idea. Are you able & willing to share your |
i'd be interested in getting a look at that macro was well. |
Here's an easy macro for this.
|
Hi @Aylr and @gordonhwong! In case it's still useful for anyone, the following macro is what I use: {% macro results_values(results) %}
{% for res in results -%}
{% if loop.index > 1 %},{% endif %}
('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
{% endfor %}
{% endmacro %} This pairs with: on-run-start:
- "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
- "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}" |
This is exactly what I am looking for! Of course one could do this by bringing in other tools that help with this (like https://docs.greatexpectations.io, a tool I have seen being suggested a few times). |
Hey @bashyroger, thanks for bringing this topic back to the fore. I just opened a new issue to sketch out our latest thinking about how dbt should store tests in the database: #2593 |
I noticed that i need to add a commit statement also in the 'on-run-end' to persist the record. Am I doing something wrong? |
Hi @ianterrell , Thanks, |
Looks like, I can only execute using an on-run-end. Below is what I got from DBT documentation |
@jtcohen6 How would you feel about re-opening this issue? I greatly appreciate the work you did in #3316, but I don't think it satisfies the request in this ticket - namely, an audit log of test results. In particular, the limitations of #3316 are:
Options that I see to generate an audit log:
Option 3 is @ianterrell's idea of using pre and post hooks to generate the test audit log. We do something similar to create a run audit log for some of our clients who use dbt, and it has worked very well. Aside from the work to do the implementation, are there blockers/reasons not to add pre and post hooks to tests? I would be happy to attempt to do the implementation with some (likely substantial) guidance. |
@trevor-petach Thanks for the thoughtful comment! In my current view of the world, I see the solution to problems 1-6 being metadata. Just about all of the information you've enumerated is available in artifacts (
Given that this information is available in JSON artifacts, and in such a rich format, I'd rather encourage you to:
That said, I think you're onto something here:
I agree! I said something to this effect in response to a more recent issue: #3463. I think the basic change, turning on these hooks for the Since the |
Wanted to share the approach we're taking for visibility of failing tests. I don't think it'd scale to all 1200+ tests, but right now we don't need it to - we're focused on exposing data quality issues to other teams, in the hope that showing them a chart with number of errors going down will encourage them to keep the CRM tidy 🤞 We're using dbt-utils' get_relations_by_pattern and union_relations macros, and then will snapshot the results of that.
As much as we'd love to get amongst the artifact goodness, I don't think Redshift doesn't have a PUT equivalent so we're probably dependent on a dbt Cloud native solution for the moment. |
@jtcohen6 Agreed that artifacts have the needed information - the issue is getting that information somewhere useful (i.e. the database). For advanced users, this is not a problem (a quick script that kicks off after dbt test completes to write run_results.json to the database), but for more basic users, it is a substantial stumbling block. Most people we work with use BigQuery, so face the same problem as @joellabes in Redshift - there is no equivalent of PUT. As far as I can tell, the dbt cloud API suffers from the same problem - I have to run the query and parse a relatively complex result on my infrastructure. My goal is be able to build a dashboard to show recent test failures or set an alert/notification for test failures (that includes some information about what failed) for a user who just uses dbt cloud to execute dbt run and dbt test. One minor issue with the metadata objects is that they are not available until all of the tests execute. If we had pre and post-test hooks, then we could record the results immediately into an audit table. |
Feature
Feature description
dbt should record rows that fail schema or data tests into tables, to be used for auditing. There are two general ways this could work:
errors
table with a schema like:The baked-in schema tests will need to be rewritten to provide debuggable information in these tables.
This should probably be an opt-in feature for tests. It could be configured either in
schema.yml
, though it would also be good to configure whole swaths of tests at once.Who will this benefit?
dbt users could use this table to quickly determine why their tests failed. At present, it's surprisingly difficult to debug a failed test.
The text was updated successfully, but these errors were encountered: