Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] New columns not populated with data in a snapshot #10088

Closed
2 tasks done
elsander opened this issue May 3, 2024 · 11 comments
Closed
2 tasks done

[Bug] New columns not populated with data in a snapshot #10088

elsander opened this issue May 3, 2024 · 11 comments
Labels
bug Something isn't working snapshots Issues related to dbt's snapshot functionality

Comments

@elsander
Copy link

elsander commented May 3, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I experienced an edge case where new columns get added to a snapshot, but are not populated with data.

Steps to reproduce:

  1. Create a table with static data
  2. Snapshot the table with check_cols='all'
  3. Manually add a new column to the table (populate it with non-null values). Do not update any of the data from step 1.
  4. Snapshot the table again

After step 2, I had the expected snapshot with one row per unique id, populated with the data from the table in step 1. After step 4, I had the same snapshot, but with a new column added. The new column was populated entirely with nulls.

Expected Behavior

In step 4, I expect the snapshot to add a new row for each unique id, which contains all data in the table, including the values in the new column. Instead, no new rows were added to the snapshot, and the new column values were not populated in the snapshot.

One interesting note: if any previously snapshot values change (even in only one row), ALL unique ids get a new snapshot row with data from the new columns. So you would only encounter this problem if the previously snapshot data stays entirely static.

Steps To Reproduce

  1. Environment:
    Linux 0584341454eb 5.15.49-linuxkit # 1 SMP Tue Sep 13 07:51:46 UTC 2022 x86_64 GNU/Linux
    Python version 3.9.19
    DBT version 1.6.13, bigquery plugin version 1.6.1
  2. Build a table called snapshot_test, filling in tbl with some valid table for your environment:
SELECT DISTINCT
  '123' as id,
  1 as val_1
FROM {{ tbl }}
UNION ALL
SELECT DISTINCT
  '456' as id,
  1 as val_1
FROM {{ tbl }}
  1. Create an associated snapshot for the table:
{% snapshot snapshot__snapshot_test %}

{{
    config(
      unique_key='id',
      strategy='check',
      invalidate_hard_deletes=True,
      check_cols='all'
    )
}}

SELECT *
FROM {{ ref('snapshot_test') }}

{% endsnapshot %}
  1. Update the and rebuild snapshot_test:
SELECT DISTINCT
  '123' as id,
  1 as val_1,
  'test' as val_2,
FROM {{ tbl }}
UNION ALL
SELECT DISTINCT
  '456' as id,
  1 as val_1,
 'test_2' as val_2
FROM {{ tbl }}
  1. Rerun snapshot

This results in a snapshot with only one row for each id, where the val_2 column is present, but filled with null.
Note that if you run the following in step 3 instead (note the update to one of the values in val_1):

SELECT DISTINCT
  '123' as id,
  2 as val_1,
  'test' as val_2,
FROM {{ tbl }}
UNION ALL
SELECT DISTINCT
  '456' as id,
  1 as val_1,
 'test_2' as val_2
FROM {{ tbl }}

A new row will be added to the snapshot as expected, with non-null values for the new columns.

Relevant log output

No response

Environment

Linux 0584341454eb 5.15.49-linuxkit #1 SMP Tue Sep 13 07:51:46 UTC 2022 x86_64 GNU/Linux
Python version 3.9.19
DBT version 1.6.13, bigquery plugin version 1.6.1

Which database adapter are you using with dbt?

bigquery

Additional Context

No response

@elsander elsander added bug Something isn't working triage labels May 3, 2024
@dbeatty10 dbeatty10 added the snapshots Issues related to dbt's snapshot functionality label May 3, 2024
@dbeatty10 dbeatty10 self-assigned this May 4, 2024
@dbeatty10
Copy link
Contributor

Thanks for reporting this @elsander.

I wasn't able to get the same result as you with dbt=1.6.5 and bigquery=1.6.9. Instead, I got this after the 2nd snapshot:

id val_1 val_2 dbt_valid_from dbt_valid_to
123 1 test 2024-05-04
123 1 2024-05-04 2024-05-04
456 1 test_2 2024-05-04
456 1 2024-05-04 2024-05-04

Could you try using the the files below and let me know if there's something that can be tweaked to see the scenario you are reporting?

Reprex

macros/drop_table.sql

{% macro drop_table(name) %}

    {% set sql %}
        drop table {{ ref(name) }};
    {% endset %}

    {% do log("query: " ~ sql, info=True) %}
    {% do run_query(sql) %}
    {% do log("Table dropped", info=True) %}

{% endmacro %}

models/dual.sql

select 'X' as dummy

models/snapshot_test.sql

-- will be overwritten with one of two versions of the file from below

snapshot_test__v1.sql

SELECT DISTINCT
  '123' as id,
  1 as val_1
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
  '456' as id,
  1 as val_1
FROM {{ ref("dual") }}

snapshot_test__v2.sql

SELECT DISTINCT
  '123' as id,
  1 as val_1,
  'test' as val_2
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
  '456' as id,
  1 as val_1,
 'test_2' as val_2
FROM {{ ref("dual") }}

analyses/snapsphot_1.sql

select
    id,
    val_1,
    cast({{ date_trunc('day', 'dbt_valid_from') }} as date) as dbt_valid_from,
    cast({{ date_trunc('day', 'dbt_valid_to') }} as date) as dbt_valid_to

from {{ ref("snapshot__snapshot_test") }}

order by dbt_valid_from, id

analyses/snapsphot_2.sql

select
    id,
    val_1,
    val_2,
    cast({{ date_trunc('day', 'dbt_valid_from') }} as date) as dbt_valid_from,
    cast({{ date_trunc('day', 'dbt_valid_to') }} as date) as dbt_valid_to

from {{ ref("snapshot__snapshot_test") }}

order by dbt_valid_from, id

snapshots/snapshot__snapshot_test.sql

{% snapshot snapshot__snapshot_test %}

{{
    config(
      target_database=target.database,
      target_schema=target.schema,
      unique_key='id',
      strategy='check',
      invalidate_hard_deletes=True,
      check_cols='all'
    )
}}

SELECT *
FROM {{ ref('snapshot_test') }}

{% endsnapshot %}

Run these commands:

dbt run-operation drop_table --args '{name: snapshot__snapshot_test }'
dbt run -s dual
cp snapshot_test__v1.sql models/snapshot_test.sql
dbt run -s models/snapshot_test.sql
dbt snapshot -s snapshots/snapshot__snapshot_test.sql
dbt show -s analyses/snapsphot_1.sql
cp snapshot_test__v2.sql models/snapshot_test.sql
dbt run -s models/snapshot_test.sql
dbt snapshot -s snapshots/snapshot__snapshot_test.sql
dbt show -s analyses/snapsphot_2.sql

Hopefully I didn't forget any files -- just let me know if I did.

@dbeatty10 dbeatty10 removed their assignment May 4, 2024
@elsander
Copy link
Author

elsander commented May 6, 2024

@dbeatty10 that is truly bizarre; I used your example and replicated the same issue I reported.

The only differences between your test and my replication of your test:

  • My dbt_project.yml pre-specifies the target database and schema for snapshots, so I did not explicitly provide those arguments.
  • I have my own table deletion macro, which I ran instead of yours (I confirmed that the tables were in fact deleted before running the test)
  • Instead of running dbt show, I queried the snapshot output in the BigQuery console.

First snapshot output:
Screenshot 2024-05-06 at 9 57 07 AM

Second snapshot output:
Screenshot 2024-05-06 at 9 59 05 AM

We are using slightly different microversions of DBT/BQ plugin, and you're probably using a somewhat different OS environment. My best guess would be that the OS environment or a microversion update are the root cause here. Unfortunately, I can't easily tweak these, because I use a Docker image that I am not the maintainer for.

@dbeatty10
Copy link
Contributor

Thanks for trying this out @elsander 👍

The OS environment shouldn't make a difference here. I went ahead and adjusted my dbt_project.yml and snapshots/snapshot__snapshot_test.sql as follows:

modified files

dbt_project.yml

name: "my_project"
version: "1.0.0"
config-version: 2
profile: "bigquery"

snapshots:
  my_project:
    +target_database: "{{ target.database }}"
    +target_schema: "{{ target.schema }}"

snapshots/snapshot__snapshot_test.sql

{% snapshot snapshot__snapshot_test %}

{{
    config(
      unique_key='id',
      strategy='check',
      invalidate_hard_deletes=True,
      check_cols='all'
    )
}}

SELECT *
FROM {{ ref('snapshot_test') }}

{% endsnapshot %}

But when I tried this out with the same microversions as you, it still worked as expected for me:

17:17:54  Running with dbt=1.6.13
17:17:55  Registered adapter: bigquery=1.6.1
17:17:55  Found 2 models, 1 snapshot, 2 analyses, 3 seeds, 0 sources, 0 exposures, 0 metrics, 394 macros, 0 groups, 0 semantic models
17:17:55  
17:17:55  Concurrency: 10 threads (target='blue')
17:17:55  
17:17:57  Previewing node 'snapsphot_2':
| id  | val_1 | val_2  | dbt_valid_from | dbt_valid_to |
| --- | ----- | ------ | -------------- | ------------ |
| 123 |     1 |        |     2024-05-06 |   2024-05-06 |
| 123 |     1 | test   |     2024-05-06 |              |
| 456 |     1 |        |     2024-05-06 |   2024-05-06 |
| 456 |     1 | test_2 |     2024-05-06 |              |

To reduce as many confounding variables as possible, could you try using all the same files and commands as me and share the terminal output ?

  • dbt_project.yml
  • The drop_table table deletion macro from above
  • snapshots/snapshot__snapshot_test.sql
  • All the same commands

@elsander
Copy link
Author

elsander commented May 7, 2024

@dbeatty10 Really bizarre-- I tried your version, using all of the same files and commands, and I got the behavior you did, where it's correctly updating the table with new data. The main difference I could at all imagine to be relevant is that I directly specified the target database and schema in the snapshot config.

Here is the snapshot section of my dbt_project.yml in case it's helpful for debugging:

snapshots:
  cohort_mart:
    +docs:
      node_color: "#E54F6D"
    ie_warehouse:
      +target_schema: snapshots__ie_warehouse
    cohort_utility_warehouse:
      +target_schema: snapshots__cohort_utility_warehouse
    golden_cohort_warehouse:
      +target_schema: snapshots__golden_cohort_warehouse
    cohort_self_storage:
      +target_schema: snapshots__cohort_self_storage

@dbeatty10
Copy link
Contributor

That is bizarre, but also good news that we're narrowing in!

Could you try tweaking my version and see if you can get it to give the same unexpected results that you originally posted?

If you can do that, then we'll have a solid reproducible example ("reprex").

@elsander
Copy link
Author

elsander commented May 8, 2024

Ok @dbeatty10, I spent some time trying to reproduce the issue. I was able to reproduce the issue with the following changes:

dual.sql

{{ config(schema="cohort_self_storage", materialized="table") }}

select 'X' as dummy

snapshot_test__v1.sql

SELECT DISTINCT
  '123' as id,
  1 as val_1
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
  '456' as id,
  1 as val_1
FROM {{ ref("dual") }}

snapshot_test__v2.sql

{{ config(schema="cohort_self_storage", materialized="table") }}

SELECT DISTINCT
  '123' as id,
  1 as val_1,
  'test' as val_2
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
  '456' as id,
  1 as val_1,
 'test_2' as val_2
FROM {{ ref("dual") }}

Without both of those config options, I am not able to reproduce the issue. With them, I do. These match settings I had as default in my dbt_project.yml, and if I don't specify them explicitly from the files, but use the defaults in the yml, I still replicate the issue. Let me know if you're able to reproduce on your end.

@dbeatty10
Copy link
Contributor

@elsander I still haven't been able to reproduce this.

Do you get something different when you use the files and commands below?

Do you see anything I've configured differently than the example you used to reproduce the issue?

Project files and commands

macros/drop_table.sql

{% macro drop_table(name) %}

    {% set sql %}
        drop table {{ ref(name) }};
    {% endset %}

    {% do log("query: " ~ sql, info=True) %}
    {% do run_query(sql) %}
    {% do log("Table dropped", info=True) %}

{% endmacro %}

models/dual.sql

{{ config(schema="cohort_self_storage", materialized="table") }}

select 'X' as dummy

snapshot_test__v1.sql

SELECT DISTINCT
  '123' as id,
  1 as val_1
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
  '456' as id,
  1 as val_1
FROM {{ ref("dual") }}

snapshot_test__v2.sql

{{ config(schema="cohort_self_storage", materialized="table") }}

SELECT DISTINCT
  '123' as id,
  1 as val_1,
  'test' as val_2
FROM {{ ref("dual") }}
UNION ALL
SELECT DISTINCT
  '456' as id,
  1 as val_1,
 'test_2' as val_2
FROM {{ ref("dual") }}

analyses/snapsphot_1.sql

select
    id,
    val_1,
    cast({{ date_trunc('day', 'dbt_valid_from') }} as date) as dbt_valid_from,
    cast({{ date_trunc('day', 'dbt_valid_to') }} as date) as dbt_valid_to

from {{ ref("snapshot__snapshot_test") }}

order by dbt_valid_from, id

analyses/snapsphot_2.sql

select
    id,
    val_1,
    val_2,
    cast({{ date_trunc('day', 'dbt_valid_from') }} as date) as dbt_valid_from,
    cast({{ date_trunc('day', 'dbt_valid_to') }} as date) as dbt_valid_to

from {{ ref("snapshot__snapshot_test") }}

order by dbt_valid_from, id

Run these commands:

dbt run-operation drop_table --args '{name: snapshot__snapshot_test }'
dbt run -s dual
cp snapshot_test__v1.sql models/snapshot_test.sql
dbt run -s models/snapshot_test.sql
dbt snapshot -s snapshots/snapshot__snapshot_test.sql
dbt show -s analyses/snapsphot_1.sql
cp snapshot_test__v2.sql models/snapshot_test.sql
dbt run -s models/snapshot_test.sql
dbt snapshot -s snapshots/snapshot__snapshot_test.sql
dbt show -s analyses/snapsphot_2.sql

Get this output:

16:17:14  Previewing node 'snapsphot_2':
| id  | val_1 | val_2  | dbt_valid_from | dbt_valid_to |
| --- | ----- | ------ | -------------- | ------------ |
| 123 |     1 | test   |     2024-05-09 |              |
| 123 |     1 |        |     2024-05-09 |   2024-05-09 |
| 456 |     1 | test_2 |     2024-05-09 |              |
| 456 |     1 |        |     2024-05-09 |   2024-05-09 |

@elsander
Copy link
Author

Yeah, I reran it today, with the same setup as before, and I couldn't reproduce it, either. I'm really at a loss. It was on the same commit where I'd replicated the issue before. It's really bizarre.

@elsander
Copy link
Author

I think my next step will be to see if others on my team can reproduce the issue as well from my original setup, to figure out if it's something about the DBT project config, or something specific to my computer/Docker container/??.

@graciegoheen
Copy link
Contributor

@elsander I'm going to close this for now, but if you're able to reproduce let me know and we can re-open!

@graciegoheen graciegoheen closed this as not planned Won't fix, can't repro, duplicate, stale May 28, 2024
@elsander
Copy link
Author

Makes sense to me. I've spent some time working with folks on my end, and although we see the behavior sometimes, it has been difficult to track down a reproducible example outside our specific environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working snapshots Issues related to dbt's snapshot functionality
Projects
None yet
Development

No branches or pull requests

3 participants