expression_is_true is costly when applied to a large table #683

basdunn · 2022-09-21T13:58:47Z

Describe the bug

When running an expression_is_true test I noticed that the test required ~500gb of data (on BigQuery), which in my opinion is extremely costly for a simple test.

Because the way the test is setup (SELECT * in the last statement), the total cost of the test is the same as doing a SELECT * FROM TABLE_THAT_WE_TEST. Which we know can be quite expensive for long and wide tables.

Steps to reproduce

CREATE A LARGE TABLE
Run on BQ SELECT * FROM LARGE_TABLE and check the cost
Run a expression_is_true test against LARGE_TABLE and see that it is equally costly as the SELECT *

Expected results

I expect a simple test to be really cheap. I do not want to take into account the cost of a simple column test when developing.

Actual results

Its expensive.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The contents of your packages.yml file:

Which database are you using dbt with?

postgres
redshift
[X ] bigquery
snowflake
other (specify: ____________)

The output of dbt --version:

<output goes here>

Additional context

Add any other context about the problem here. For example, if you think you know which line of code is causing the issue.

Are you interested in contributing the fix?

Sure!

expression_is_true.sql:

{% test expression_is_true(model, expression, column_name=None, condition='1=1') %}
{# T-SQL has no boolean data type so we use 1=1 which returns TRUE #}
{# ref https://stackoverflow.com/a/7170753/3842610 #}
  {{ return(adapter.dispatch('test_expression_is_true', 'dbt_utils')(model, expression, column_name, condition)) }}
{% endtest %}

{% macro default__test_expression_is_true(model, expression, column_name, condition) %}

with meet_condition as (
    select * from {{ model }} where {{ condition }}
)

select
    1 -- Change the * to a fixed single column, there might be some relevant info you could pass here for debugging, but just not ALL columns.
from meet_condition
{% if column_name is none %}
where not({{ expression }})
{%- else %}
where not({{ column_name }} {{ expression }})
{%- endif %}

{% endmacro %}

The text was updated successfully, but these errors were encountered:

basdunn · 2022-09-21T14:00:24Z

BTW, I havent checked whether this same thing is happening in other tests. Might be valuable to check this.

joellabes · 2022-09-22T03:32:59Z

Hi @basdunn, thanks for opening this! There is prior art in dbt-labs/dbt-core#4777 if you wanted to follow the same pattern of bringing out * if store-failures is enabled, and just the expression if not.

elyobo · 2022-09-25T07:26:58Z

I'd noticed this a few times and came to raise an issue today, but found @basdunn had already raised it, thanks!

I think the prior art could be improved by allowing the specification of a list of columns to include; the 1 isn't all that informative when debugging a failing test, having a specified list of columns that are used (if should store failures) would be more useful and still cheaper than the whole table for people that have big enough tables to go to the effort of specifying the relevant columns.

Fixes dbt-labs#683

elyobo · 2022-09-26T23:30:56Z

I've added a PR to address the suggestion here, but if someone points me at some useful examples I'd be interested in adding the ability to specify a set of columns to extract as well, as that seems like more useful behaviour to me.

joellabes · 2022-09-28T02:29:17Z

adding the ability to specify a set of columns to extract as well

@elyobo my gut reaction here is that this would be something that would make sense to define across multiple tests. So I'd recommend you open an issue in dbt-core. Something like this would be cool:

models:
  - name: my_model
    tests: 
      - dbt_utils.expression_is_true:
          expression: a + b = c
          debugging_columns: [a, b, c, d] #horrible name do not use this
    columns: 
      - name: id
         tests: 
           - not_null:
               debugging_columns: [id, e, f] #See that this is available to any test

And then we'd something like

{% macro default__get_debugging_columns(expression) %}
    {% if should_store_failures() and debugging_columns is not none %}
      {{ debugging_columns.join(", ") }}
    {% else %}
      * {# Should we default to pulling everything out where that access isn't billed? If not, then we don't need a second BQ-specific version of this #}
    {% endif %}
{% endmacro %}

{% macro bigquery__get_debugging_columns(expression) %}
    {% if should_store_failures() %}
      {% if debugging_columns is not none %}
        {{ debugging_columns.join(", ") }}
      {% else %}
        *
      {% endif %}
    {% else %}
      {{ expression }} {#or maybe just `1` 🤷 #}
    {% endif %}
{% endmacro %}

basdunn added bug Something isn't working triage labels Sep 21, 2022

joellabes added good first issue and removed triage labels Sep 22, 2022

elyobo added a commit to elyobo/dbt-utils that referenced this issue Sep 25, 2022

Limit columns selected in expression_is_true if failures not stored.

ae1b6ad

Fixes dbt-labs#683

elyobo mentioned this issue Sep 25, 2022

Limit columns selected in expression_is_true if failures not stored. #686

Merged

17 tasks

elyobo added a commit to elyobo/dbt-utils that referenced this issue Sep 25, 2022

Limit columns selected in expression_is_true if failures not stored.

71f96f6

Fixes dbt-labs#683

This was referenced Sep 28, 2022

[CT-1264] [Feature] Support "debugging columns" across multiple tests dbt-labs/dbt-core#5952

Closed

Add optional condition to not_null_proportion test #691

Closed

joellabes closed this as completed in c15d55e Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expression_is_true is costly when applied to a large table #683

expression_is_true is costly when applied to a large table #683

basdunn commented Sep 21, 2022 •

edited

Loading

basdunn commented Sep 21, 2022

joellabes commented Sep 22, 2022

elyobo commented Sep 25, 2022

elyobo commented Sep 26, 2022

joellabes commented Sep 28, 2022

expression_is_true is costly when applied to a large table #683

expression_is_true is costly when applied to a large table #683

Comments

basdunn commented Sep 21, 2022 • edited Loading

Describe the bug

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

Additional context

Are you interested in contributing the fix?

basdunn commented Sep 21, 2022

joellabes commented Sep 22, 2022

elyobo commented Sep 25, 2022

elyobo commented Sep 26, 2022

joellabes commented Sep 28, 2022

basdunn commented Sep 21, 2022 •

edited

Loading