[CT-1646] [CT-1170] [Bug] Insert overwrite incremental models not using dbt_tmp table #427

naveen-shankar · 2022-09-13T17:08:01Z

Is this a new bug in dbt-core?

I believe this is a new bug in dbt-core
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

When I run an incremental table update using the insert_overwrite strategy, the created dbt_tmp table isn't used to update the destination table. Instead, the model code is rerun from scratch.

Expected Behavior

Rather than rerunning the model code from scratch, I expected the destination table update to use the created dbt_tmp table.

Steps To Reproduce

The specific model is called gross_and_net_churn.sql. I'll exclude the main body of the model for brevity (I've confirmed this is happening with multiple models), but the config params are:

{% set partitions_to_replace = days_between_two_dates(var("subscriptions_crawler_start_date"), var("subscriptions_crawler_end_date")) %}

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        partition_by = {"field": "report_date", "data_type": "timestamp", "granularity": "day"},
        partitions = partitions_to_replace,
        cluster_by = ["report_date", "client"]
    )
}}

On dbt CLI, I ran:

dbt run --select models/subscription_retention/gross_and_net_churn.sql

I then looked at which queries got executed in the BigQuery console.

The first query run created the dbt_tmp table:

/* {"app": "dbt", "dbt_version": "1.2.1", "profile_name": "data_science", "target_name": "dev", "node_id": "model.subscriptions.gross_and_net_churn"} */

  create or replace table `duolingo-data-science`.`dbt_naveen`.`gross_and_net_churn__dbt_tmp`
  partition by timestamp_trunc(report_date, day)
  cluster by report_date, client
  OPTIONS(
      description="""""",
    
      expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour)
    )
  as (

[Main body of model]    

  );

The second query run did the actual update:

/* {"app": "dbt", "dbt_version": "1.2.1", "profile_name": "data_science", "target_name": "dev", "node_id": "model.subscriptions.gross_and_net_churn"} */

    merge into `duolingo-data-science`.`dbt_naveen`.`gross_and_net_churn` as DBT_INTERNAL_DEST
        using (

[Main body of model]          

        ) as DBT_INTERNAL_SOURCE
        on FALSE

    when not matched by source
         and timestamp_trunc(DBT_INTERNAL_DEST.report_date, day) in (
              '2022-08-13', '2022-08-14', '2022-08-15', '2022-08-16', '2022-08-17', '2022-08-18', '2022-08-19', '2022-08-20', '2022-08-21', '2022-08-22', '2022-08-23', '2022-08-24', '2022-08-25', '2022-08-26', '2022-08-27', '2022-08-28', '2022-08-29', '2022-08-30', '2022-08-31', '2022-09-01', '2022-09-02', '2022-09-03', '2022-09-04', '2022-09-05', '2022-09-06', '2022-09-07', '2022-09-08', '2022-09-09', '2022-09-10', '2022-09-11', '2022-09-12'
          ) 
        then delete

    when not matched then insert
        (`report_date`, `client`, `subscription_type`, `is_new_years`, `subscription_tier`, `country_code`, `active_subscriptions`, `expirations`, `expirations_and_winbacks`, `winbacks`)
    values
        (`report_date`, `client`, `subscription_type`, `is_new_years`, `subscription_tier`, `country_code`, `active_subscriptions`, `expirations`, `expirations_and_winbacks`, `winbacks`)

Just as the documentation suggests, I would've expected the second query to use the gross_and_net_churn__dbt_tmp table, rather than rerunning the entire main body of the model. Or in other words, it seems like the second query should be:

using (gross_and_net_churn__dbt_tmp)

rather than

using ([Main body of model])

As far as I can tell, the created gross_and_net_churn__dbt_tmp table isn't used at all, anywhere.

So the output is correct, but dbt is seemingly unnecessarily running the same query twice, leading to a performance hit.

Relevant log output

No response

Environment

- OS: Monterey 12.4
- Python: 3.8.0
- dbt: 1.2.1

Which database adapter are you using with dbt?

bigquery

Additional Context

I've seen this happen with a couple models. In one example, the first query took 17 minutes and the second query took 23 minutes.

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2022-12-12T19:25:38Z

Thank for noticing and reporting this @naveen-shankar !

Reproduction case

Using dbt=1.3.1, I was able to reproduce the logs you described by using the following example (which I tried to make similar to #424):

models/incremental_example.sql

{% set partitions_to_replace = [
  "date_sub(current_date, interval 1 day)",
  "date_sub(current_date, interval 2 day)"
] %}

{{
    config(
        materialized="incremental",
        incremental_strategy="insert_overwrite",
        cluster_by="id",
        partition_by={
            "field": "date_time",
            "data_type": "datetime",
            "granularity": "day"
        },
        partitions=partitions_to_replace,
        on_schema_change="sync_all_columns"
    )
}}


with data as (

    {% if not is_incremental() %}

        select 1 as id, cast('2020-01-01' as datetime) as date_time union all
        select 2 as id, cast('2020-01-01' as datetime) as date_time union all
        select 3 as id, cast('2020-01-01' as datetime) as date_time union all
        select 4 as id, cast('2020-01-01' as datetime) as date_time

    {% else %}

        -- we want to overwrite the 4 records in the 2020-01-01 partition
        -- with the 2 records below, but add two more in the 2020-01-02 partition
        select 10 as id, cast('2020-01-01' as datetime) as date_time union all
        select 20 as id, cast('2020-01-01' as datetime) as date_time union all
        select 30 as id, cast('2020-01-02' as datetime) as date_time union all
        select 40 as id, cast('2020-01-02' as datetime) as date_time

    {% endif %}

)

select * from data

After running the following commands, the logs in logs/dbt.log contain the output you described:

dbt run -s incremental_example.sql --full-refresh
dbt run -s incremental_example.sql

What's going on

I needed to do a deep dive to inspect what is happening here. Although I don't yet understand all the moving pieces, there are a couple things I can say with confidence:

where the dbt_tmp table is used
where it isn't used

As far as I can tell, the created gross_and_net_churn__dbt_tmp table isn't used at all, anywhere.

Your gross_and_net_churn__dbt_tmp table is actually used when looking for schema changes. The relevant output in the logs is something like this:

    In `your-project`.`your_schema`.`gross_and_net_churn`:
        Schema changed: False
        Source columns not in target: []
        Target columns not in source: []
        New column types: []

It is using the bigquery Python API under the hood, so that's why you don't see the actual SQL that's using the dbt_tmp table.

the created dbt_tmp table isn't used to update the destination table. Instead, the model code is rerun from scratch

You're absolutely right -- it is rerunning it from scratch. And this in contradiction to the docs that currently state the expected SQL is something like:

...
merge into {{ destination_table }} DEST
using {{ model_name }}__dbt_tmp SRC
on FALSE
...

I don't know why it is rebuilding it rather than that using the _dbt_tmp table though -- it might be intentional, but it might be an accidental oversight too.

Someone on our end will need to inspect this more deeply to determine if it can just reuse the temp table or if it must not use it for some reason.

Since I believe this behavior appears specific to BigQuery, I'm going to move this issue to the relevant repository.

Acceptance criteria

We know if the _dbt_tmp temp table can be safely reused in the using clause of the merge into statement or not
If it can be re-used, then do it
if it can't be re-used, then document why not

jtcohen6 · 2023-01-26T10:24:01Z

Going to mark this one as help_wanted. It would be a performance improvement, at parity with existing functionality. I've outlined below the code paths that a contributor would need to follow for the fix.

Doug is spot on that, when on_schema_change is enabled, we need to first create the model in a temp table that we can use to detect schema changes.

In that cases, we should be setting tmp_relation_exists to True:

dbt-bigquery/dbt/include/bigquery/macros/materializations/incremental.sql

Line 129 in 38fb796

{% set tmp_relation_exists = true %}

We pass that tmp_relation_exists argument into bq_generate_incremental_insert_overwrite_build_sql:

dbt-bigquery/dbt/include/bigquery/macros/materializations/incremental.sql

Lines 41 to 43 in 38fb796

    
           {% set build_sql = bq_generate_incremental_insert_overwrite_build_sql( 
        
               tmp_relation, target_relation, sql, unique_key, partition_by, partitions, dest_columns, tmp_relation_exists, copy_partitions 
        
           ) %}

[Aside: For some reason, the next two macros don't have a keyword argument for tmp_relation_exists — instead, it's called on_schema_change — this doesn't actually break the functionality, but it does make it much more confusing to follow the code path:]

dbt-bigquery/dbt/include/bigquery/macros/materializations/incremental_strategy/insert_overwrite.sql

Lines 1 to 3 in 38fb796

    
           {% macro bq_generate_incremental_insert_overwrite_build_sql( 
        
               tmp_relation, target_relation, sql, unique_key, partition_by, partitions, dest_columns, on_schema_change, copy_partitions 
        
           ) %}

dbt-bigquery/dbt/include/bigquery/macros/materializations/incremental_strategy/insert_overwrite.sql

Lines 11 to 13 in 38fb796

    
           {% set build_sql = bq_insert_overwrite_sql( 
        
               tmp_relation, target_relation, sql, unique_key, partition_by, partitions, dest_columns, on_schema_change, copy_partitions 
        
           ) %}

Finally, we see it clearly here: "dynamic" insert_overwrite accepts a boolean argument for tmp_relation_exists, but "static" insert_overwrite doesn't:

dbt-bigquery/dbt/include/bigquery/macros/materializations/incremental_strategy/insert_overwrite.sql

Lines 41 to 45 in 38fb796

    
           {% if partitions is not none and partitions != [] %} {# static #} 
        
               {{ bq_static_insert_overwrite_sql(tmp_relation, target_relation, sql, partition_by, partitions, dest_columns, copy_partitions) }} 
        
           {% else %} {# dynamic #} 
        
               {{ bq_dynamic_insert_overwrite_sql(tmp_relation, target_relation, sql, unique_key, partition_by, dest_columns, tmp_relation_exists, copy_partitions) }} 
        
           {% endif %}

So I think all we'd need to fix this is, have the "static" version start accepting the tmp_relation_exists argument, and update this conditional logic to use it, instead of just repeating {{ sql }}:

dbt-bigquery/dbt/include/bigquery/macros/materializations/incremental_strategy/insert_overwrite.sql

Lines 58 to 66 in 38fb796

    
           {%- set source_sql -%} 
        
             ( 
        
               {%- if partition_by.time_ingestion_partitioning -%} 
        
               {{ wrap_with_time_ingestion_partitioning_sql(build_partition_time_exp(partition_by), sql, True) }} 
        
               {%- else -%} 
        
               {{sql}} 
        
               {%- endif -%} 
        
             ) 
        
           {%- endset -%}

naveen-shankar added bug Something isn't working triage labels Sep 13, 2022

github-actions bot changed the title ~~[Bug] <title>~~ [CT-1170] [Bug] <title> Sep 13, 2022

naveen-shankar changed the title ~~[CT-1170] [Bug] <title>~~ [CT-1170] [Bug] Insert overwrite incremental models not using dbt_tmp table Sep 13, 2022

jtcohen6 self-assigned this Sep 21, 2022

roycefp mentioned this issue Dec 10, 2022

[CT-1641] [CT-1622] [Bug] Insert overwrite incremental models using create or replace table #424

Open

2 tasks

dbeatty10 removed the triage label Dec 12, 2022

dbeatty10 transferred this issue from dbt-labs/dbt-core Dec 12, 2022

github-actions bot changed the title ~~[CT-1170] [Bug] Insert overwrite incremental models not using dbt_tmp table~~ [CT-1646] [CT-1170] [Bug] Insert overwrite incremental models not using dbt_tmp table Dec 12, 2022

dbeatty10 added the Team:Adapters label Dec 14, 2022

jtcohen6 removed their assignment Jan 12, 2023

jtcohen6 added the help_wanted Extra attention is needed label Jan 26, 2023

This was referenced Feb 23, 2023

[CT-2170] Temporary table not deleted after dbt job finishes running #556

Closed

[CT-2173] [Feature] Eliminate cost for query run to detect on_schema_change #557

Closed

github-christophe-oudar mentioned this issue Mar 25, 2023

Use tmp table in static insert overwrite #630

Merged

6 tasks

Fleid mentioned this issue Apr 4, 2023

[ADAP-428] [PR Tracking] Use tmp table in static insert overwrite #630 #643

Closed

dbeatty10 mentioned this issue Jul 12, 2023

Use tmp table in static insert overwrite #821

Closed

6 tasks

colin-rogers-dbt closed this as completed in #630 Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1646] [CT-1170] [Bug] Insert overwrite incremental models not using dbt_tmp table #427

[CT-1646] [CT-1170] [Bug] Insert overwrite incremental models not using dbt_tmp table #427

naveen-shankar commented Sep 13, 2022

dbeatty10 commented Dec 12, 2022

jtcohen6 commented Jan 26, 2023 •

edited

Loading

[CT-1646] [CT-1170] [Bug] Insert overwrite incremental models not using dbt_tmp table #427

[CT-1646] [CT-1170] [Bug] Insert overwrite incremental models not using dbt_tmp table #427

Comments

naveen-shankar commented Sep 13, 2022

Is this a new bug in dbt-core?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Which database adapter are you using with dbt?

Additional Context

dbeatty10 commented Dec 12, 2022

Reproduction case

What's going on

Acceptance criteria

jtcohen6 commented Jan 26, 2023 • edited Loading

jtcohen6 commented Jan 26, 2023 •

edited

Loading