-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1646] [CT-1170] [Bug] Insert overwrite incremental models not using dbt_tmp table #427
Comments
Thank for noticing and reporting this @naveen-shankar ! Reproduction caseUsing dbt=1.3.1, I was able to reproduce the logs you described by using the following example (which I tried to make similar to #424):
{% set partitions_to_replace = [
"date_sub(current_date, interval 1 day)",
"date_sub(current_date, interval 2 day)"
] %}
{{
config(
materialized="incremental",
incremental_strategy="insert_overwrite",
cluster_by="id",
partition_by={
"field": "date_time",
"data_type": "datetime",
"granularity": "day"
},
partitions=partitions_to_replace,
on_schema_change="sync_all_columns"
)
}}
with data as (
{% if not is_incremental() %}
select 1 as id, cast('2020-01-01' as datetime) as date_time union all
select 2 as id, cast('2020-01-01' as datetime) as date_time union all
select 3 as id, cast('2020-01-01' as datetime) as date_time union all
select 4 as id, cast('2020-01-01' as datetime) as date_time
{% else %}
-- we want to overwrite the 4 records in the 2020-01-01 partition
-- with the 2 records below, but add two more in the 2020-01-02 partition
select 10 as id, cast('2020-01-01' as datetime) as date_time union all
select 20 as id, cast('2020-01-01' as datetime) as date_time union all
select 30 as id, cast('2020-01-02' as datetime) as date_time union all
select 40 as id, cast('2020-01-02' as datetime) as date_time
{% endif %}
)
select * from data After running the following commands, the logs in dbt run -s incremental_example.sql --full-refresh
dbt run -s incremental_example.sql What's going onI needed to do a deep dive to inspect what is happening here. Although I don't yet understand all the moving pieces, there are a couple things I can say with confidence:
Your
It is using the
You're absolutely right -- it is rerunning it from scratch. And this in contradiction to the docs that currently state the expected SQL is something like: ...
merge into {{ destination_table }} DEST
using {{ model_name }}__dbt_tmp SRC
on FALSE
... I don't know why it is rebuilding it rather than that using the Someone on our end will need to inspect this more deeply to determine if it can just reuse the temp table or if it must not use it for some reason. Since I believe this behavior appears specific to BigQuery, I'm going to move this issue to the relevant repository. Acceptance criteria
|
Going to mark this one as Doug is spot on that, when In that cases, we should be setting
We pass that
[Aside: For some reason, the next two macros don't have a keyword argument for dbt-bigquery/dbt/include/bigquery/macros/materializations/incremental_strategy/insert_overwrite.sql Lines 1 to 3 in 38fb796
dbt-bigquery/dbt/include/bigquery/macros/materializations/incremental_strategy/insert_overwrite.sql Lines 11 to 13 in 38fb796
Finally, we see it clearly here: "dynamic" dbt-bigquery/dbt/include/bigquery/macros/materializations/incremental_strategy/insert_overwrite.sql Lines 41 to 45 in 38fb796
So I think all we'd need to fix this is, have the "static" version start accepting the dbt-bigquery/dbt/include/bigquery/macros/materializations/incremental_strategy/insert_overwrite.sql Lines 58 to 66 in 38fb796
|
Is this a new bug in dbt-core?
Current Behavior
When I run an incremental table update using the
insert_overwrite
strategy, the createddbt_tmp
table isn't used to update the destination table. Instead, the model code is rerun from scratch.Expected Behavior
Rather than rerunning the model code from scratch, I expected the destination table update to use the created
dbt_tmp
table.Steps To Reproduce
The specific model is called
gross_and_net_churn.sql
. I'll exclude the main body of the model for brevity (I've confirmed this is happening with multiple models), but the config params are:On dbt CLI, I ran:
I then looked at which queries got executed in the BigQuery console.
The first query run created the
dbt_tmp
table:The second query run did the actual update:
Just as the documentation suggests, I would've expected the second query to use the
gross_and_net_churn__dbt_tmp
table, rather than rerunning the entire main body of the model. Or in other words, it seems like the second query should be:rather than
As far as I can tell, the created
gross_and_net_churn__dbt_tmp
table isn't used at all, anywhere.So the output is correct, but dbt is seemingly unnecessarily running the same query twice, leading to a performance hit.
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
bigquery
Additional Context
I've seen this happen with a couple models. In one example, the first query took 17 minutes and the second query took 23 minutes.
The text was updated successfully, but these errors were encountered: