-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
expression_is_true is costly when applied to a large table #683
Comments
BTW, I havent checked whether this same thing is happening in other tests. Might be valuable to check this. |
Hi @basdunn, thanks for opening this! There is prior art in dbt-labs/dbt-core#4777 if you wanted to follow the same pattern of bringing out |
I'd noticed this a few times and came to raise an issue today, but found @basdunn had already raised it, thanks! I think the prior art could be improved by allowing the specification of a list of columns to include; the |
I've added a PR to address the suggestion here, but if someone points me at some useful examples I'd be interested in adding the ability to specify a set of columns to extract as well, as that seems like more useful behaviour to me. |
@elyobo my gut reaction here is that this would be something that would make sense to define across multiple tests. So I'd recommend you open an issue in dbt-core. Something like this would be cool: models:
- name: my_model
tests:
- dbt_utils.expression_is_true:
expression: a + b = c
debugging_columns: [a, b, c, d] #horrible name do not use this
columns:
- name: id
tests:
- not_null:
debugging_columns: [id, e, f] #See that this is available to any test And then we'd something like {% macro default__get_debugging_columns(expression) %}
{% if should_store_failures() and debugging_columns is not none %}
{{ debugging_columns.join(", ") }}
{% else %}
* {# Should we default to pulling everything out where that access isn't billed? If not, then we don't need a second BQ-specific version of this #}
{% endif %}
{% endmacro %}
{% macro bigquery__get_debugging_columns(expression) %}
{% if should_store_failures() %}
{% if debugging_columns is not none %}
{{ debugging_columns.join(", ") }}
{% else %}
*
{% endif %}
{% else %}
{{ expression }} {#or maybe just `1` 🤷 #}
{% endif %}
{% endmacro %} |
Describe the bug
When running an
expression_is_true
test I noticed that the test required ~500gb of data (on BigQuery), which in my opinion is extremely costly for a simple test.Because the way the test is setup (
SELECT *
in the last statement), the total cost of the test is the same as doing aSELECT * FROM TABLE_THAT_WE_TEST
. Which we know can be quite expensive for long and wide tables.Steps to reproduce
SELECT * FROM LARGE_TABLE
and check the costexpression_is_true
test againstLARGE_TABLE
and see that it is equally costly as theSELECT *
Expected results
I expect a simple test to be really cheap. I do not want to take into account the cost of a simple column test when developing.
Actual results
Its expensive.
Screenshots and log output
If applicable, add screenshots or log output to help explain your problem.
System information
The contents of your
packages.yml
file:Which database are you using dbt with?
The output of
dbt --version
:Additional context
Add any other context about the problem here. For example, if you think you know which line of code is causing the issue.
Are you interested in contributing the fix?
Sure!
expression_is_true.sql
:The text was updated successfully, but these errors were encountered: