Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2869] Add new snapshot records when source removes a field #8207

Open
3 tasks done
Tracked by #10151
owen-mc-git opened this issue Jul 25, 2023 · 10 comments
Open
3 tasks done
Tracked by #10151

[CT-2869] Add new snapshot records when source removes a field #8207

owen-mc-git opened this issue Jul 25, 2023 · 10 comments
Labels
enhancement New feature or request Refinement Maintainer input needed snapshots Issues related to dbt's snapshot functionality

Comments

@owen-mc-git
Copy link

owen-mc-git commented Jul 25, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Summary,
Using DBT snapshots, if a field is removed from source no new records are created in the snapshot table to show that the prior values no longer exists in source. This is to request that this functionality is added/supported.
This functionality exists when a new field is added but not the reverse.

Expand:
When using DBT snapshots, if a new field added in the source, it will also be added to the snapshot table and a new record is created for each existing record in the table to show the new field added/updated at this new point in time.

However if a field is removed from source no new records are created in the snapshot table to show that the prior value no longer exists in source. When a net new record is added we do see a null value appear in the removed field location.
Not having a new record created for data that is removed gives a false sense that the source still has these values.

Example:

Source orig COL_1 COL_2 COL_3
1 abc1 def2 ghi3
2 abc2 def3 ghi4
3 abc3 def4 ghi5
Snapshot orig COL_1 COL_2 COL_3 period_start period_end
1 abc1 def2 ghi3 23-Jul-2023 null
2 abc2 def3 ghi4 23-Jul-2023 null
3 abc3 def4 ghi5 23-Jul-2023 null

source adding a new column on the 24th

Source Add column COL_1 COL_2 COL_3 COL4
1 abc1 def2 ghi3 jkl4
2 abc2 def3 ghi4 jkl5
3 abc3 def4 ghi5 jkl6
Snapshot changes COL_1 COL_2 COL_3 COL_4 period_start period_end
1 abc1 def2 ghi3 null 23-Jul-2023 24-Jul-2023
2 abc2 def3 ghi4 null 23-Jul-2023 24-Jul-2023
3 abc3 def4 ghi5 null 23-Jul-2023 24-Jul-2023
1 abc1 def2 ghi3 jkl4 24-Jul-2023 null
2 abc2 def3 ghi4 jkl5 24-Jul-2023 null
3 abc3 def4 ghi5 jkl6 24-Jul-2023 null

source removing a column and adding new record on the 25th

Source remove column COL_1 COL_2 COL4
1 abc1 def2 jkl4
2 abc2 def3 jkl5
3 abc3 def4 jkl6
4 abc4 def5 jkl7
Snapshot unchanged COL_1 COL_2 COL_3 COL_4 period_start period_end
1 abc1 def2 ghi3 null 23-Jul-2023 24-Jul-2023
2 abc2 def3 ghi4 null 23-Jul-2023 24-Jul-2023
3 abc3 def4 ghi5 null 23-Jul-2023 24-Jul-2023
1 abc1 def2 ghi3 jkl4 24-Jul-2023 null
2 abc2 def3 ghi4 jkl5 24-Jul-2023 null
3 abc3 def4 ghi5 jkl6 24-Jul-2023 null
4 abc4 def5 null jkl7 25-Jul-2023 null

what I would expect to see in a true SCD table

Snapshot COL_1 COL_2 COL_3 COL_4 period_start period_end
1 abc1 def2 ghi3 null 23-Jul-2023 24-Jul-2023
2 abc2 def3 ghi4 null 23-Jul-2023 24-Jul-2023
3 abc3 def4 ghi5 null 23-Jul-2023 24-Jul-2023
1 abc1 def2 ghi3 jkl4 24-Jul-2023 25-Jul-2023
2 abc2 def3 ghi4 jkl5 24-Jul-2023 25-Jul-2023
3 abc3 def4 ghi5 jkl6 24-Jul-2023 25-Jul-2023
1 abc1 def2 null jkl4 25-Jul-2023 null
2 abc2 def3 null jkl5 25-Jul-2023 null
3 abc3 def4 null jkl6 25-Jul-2023 null
4 abc4 def5 null jkl7 25-Jul-2023 null

Describe alternatives you've considered

No response

Who will this benefit?

users of snapshots who wish to see the true state of a source at the correct points in time

Are you interested in contributing this feature?

No response

Anything else?

FAQ section "What happens if I add new columns to my snapshot query?" https://docs.getdbt.com/docs/build/snapshots

@owen-mc-git owen-mc-git added enhancement New feature or request triage labels Jul 25, 2023
@github-actions github-actions bot changed the title Add new snapshot records when source removes a field [CT-2869] Add new snapshot records when source removes a field Jul 25, 2023
@compilerqueen
Copy link

+1, would love to see this handled as a config option similar to invalidate_hard_deletes.

@mrogove
Copy link

mrogove commented Jul 27, 2023

+1

@dbeatty10 dbeatty10 added the snapshots Issues related to dbt's snapshot functionality label Jul 27, 2023
@dbeatty10
Copy link
Contributor

Thanks for such a nice write-up @owen-mc-git 🤩 -- made it easy to see what is it doing currently vs. your expectation. See "Reprex" below for the code that I used as a reproducible example.

Refinement

@compilerqueen if we can address this, do you think it should be a config option... or should we just roll it out (essentially) as a bug fix?

Would be curious to hear your thoughts on pros/cons either way.

Potential solution

If we wanted to make the minimal change to produce the behavior that @owen-mc-git described, we could just add the following line right above the return statement here:

    {%- set ns.column_added = intersection|length < query_columns|length or intersection|length < existing_cols|length -%}

👉 Of course we'd probably do more changes than that (e.g., rename variables and remove unused code), but it was pretty pleasing to see that the behavior can be achieved with such minimal code changes 😎

Result

id col_1 col_2 col_3 col_4 dbt_valid_from dbt_valid_to
1 abc1 def2 ghi3 2023-07-23 2023-07-24
2 abc2 def3 ghi4 2023-07-23 2023-07-24
3 abc3 def4 ghi5 2023-07-23 2023-07-24
1 abc1 def2 ghi3 jkl4 2023-07-24 2023-07-25
2 abc2 def3 ghi4 jkl5 2023-07-24 2023-07-25
3 abc3 def4 ghi5 jkl6 2023-07-24 2023-07-25
1 abc1 def2 jkl4 2023-07-25
2 abc2 def3 jkl5 2023-07-25
3 abc3 def4 jkl6 2023-07-25
4 abc4 def5 jkl7 2023-07-25

Reprex

Click here for details

These project files assume dbt-duckdb ~= 1.6 but it shouldn't be that hard to modify for a different adapter. duckcli is used to quickly run queries and display the output.

seeds/my_seed.v1.csv

id,col_1,col_2,col_3
1,abc1,def2,ghi3
2,abc2,def3,ghi4
3,abc3,def4,ghi5

seeds/my_seed.v2.csv

id,col_1,col_2,col_3,col_4
1,abc1,def2,ghi3,jkl4
2,abc2,def3,ghi4,jkl5
3,abc3,def4,ghi5,jkl6

seeds/my_seed.v3.csv

id,col_1,col_2,col_4
1,abc1,def2,jkl4
2,abc2,def3,jkl5
3,abc3,def4,jkl6
4,abc4,def5,jkl7

snapshots/my_snapshot.sql

{% snapshot my_snapshot %}

{{
    config(
      target_database=target.database,
      target_schema=target.schema,
      updated_at="'" ~ var("updated_at", '2023-07-23') ~ "'::date",
      unique_key='id',
      strategy='check',
      check_cols='all',
    )
}}

{% set day_number = var("day", 1) | int %}

{% if day_number <= 1 %}

select * from {{ ref("my_seed.v1") }}

{% elif day_number == 2 %}

select * from {{ ref("my_seed.v2") }}

{% elif day_number >= 3 %}

select * from {{ ref("my_seed.v3") }}

{% endif %}

{% endsnapshot %}

Run the snapshots one after another and view the output in between:

rm db.db
dbt seed
dbt snapshot --vars "{'day': 1, 'updated_at': '2023-07-23'}"
duckcli db.db -e "select id, col_1, col_2, col_3, date_trunc('day', dbt_valid_from)::date as dbt_valid_from, date_trunc('day', dbt_valid_to)::date as dbt_valid_to from my_snapshot order by dbt_valid_from, id" --table
dbt snapshot --vars "{'day': 2, 'updated_at': '2023-07-24'}"
duckcli db.db -e "select id, col_1, col_2, col_3, col_4, date_trunc('day', dbt_valid_from)::date as dbt_valid_from, date_trunc('day', dbt_valid_to)::date as dbt_valid_to from my_snapshot order by dbt_valid_from, id" --table
dbt snapshot --vars "{'day': 3, 'updated_at': '2023-07-25'}"
duckcli db.db -e "select id, col_1, col_2, col_3, col_4, date_trunc('day', dbt_valid_from)::date as dbt_valid_from, date_trunc('day', dbt_valid_to)::date as dbt_valid_to from my_snapshot order by dbt_valid_from, id" --table

Output:

01:46:26  Running with dbt=1.6.0
01:46:26  Registered adapter: duckdb=1.6.0
01:46:26  Unable to do partial parsing because config vars, config profile, or config target have changed
01:46:27  Found 1 model, 1 snapshot, 3 seeds, 0 sources, 0 exposures, 0 metrics, 349 macros, 0 groups, 0 semantic models
01:46:27  
01:46:27  Concurrency: 1 threads (target='duckdb')
01:46:27  
01:46:27  1 of 3 START seed file main.my_seed.v1 ......................................... [RUN]
01:46:27  1 of 3 OK loaded seed file main.my_seed.v1 ..................................... [INSERT 3 in 0.12s]
01:46:27  2 of 3 START seed file main.my_seed.v2 ......................................... [RUN]
01:46:27  2 of 3 OK loaded seed file main.my_seed.v2 ..................................... [INSERT 3 in 0.03s]
01:46:27  3 of 3 START seed file main.my_seed.v3 ......................................... [RUN]
01:46:27  3 of 3 OK loaded seed file main.my_seed.v3 ..................................... [INSERT 4 in 0.03s]
01:46:27  
01:46:27  Finished running 3 seeds in 0 hours 0 minutes and 0.41 seconds (0.41s).
01:46:27  
01:46:27  Completed successfully
01:46:27  
01:46:27  Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
01:46:29  Running with dbt=1.6.0
01:46:29  Registered adapter: duckdb=1.6.0
01:46:29  Unable to do partial parsing because config vars, config profile, or config target have changed
01:46:30  Found 1 model, 1 snapshot, 3 seeds, 0 sources, 0 exposures, 0 metrics, 349 macros, 0 groups, 0 semantic models
01:46:30  
01:46:30  Concurrency: 1 threads (target='duckdb')
01:46:30  
01:46:30  1 of 1 START snapshot main.my_snapshot ......................................... [RUN]
01:46:30  1 of 1 OK snapshotted main.my_snapshot ......................................... [success in 0.14s]
01:46:31  
01:46:31  Finished running 1 snapshot in 0 hours 0 minutes and 0.32 seconds (0.32s).
01:46:31  
01:46:31  Completed successfully
01:46:31  
01:46:31  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
| id | col_1 | col_2 | col_3 | dbt_valid_from | dbt_valid_to |
|----|-------|-------|-------|----------------|--------------|
| 1  | abc1  | def2  | ghi3  | 2023-07-23     | <null>       |
| 2  | abc2  | def3  | ghi4  | 2023-07-23     | <null>       |
| 3  | abc3  | def4  | ghi5  | 2023-07-23     | <null>       |
01:46:33  Running with dbt=1.6.0
01:46:33  Registered adapter: duckdb=1.6.0
01:46:33  Unable to do partial parsing because config vars, config profile, or config target have changed
01:46:34  Found 1 model, 1 snapshot, 3 seeds, 0 sources, 0 exposures, 0 metrics, 349 macros, 0 groups, 0 semantic models
01:46:34  
01:46:34  Concurrency: 1 threads (target='duckdb')
01:46:34  
01:46:34  1 of 1 START snapshot main.my_snapshot ......................................... [RUN]
01:46:34  1 of 1 OK snapshotted main.my_snapshot ......................................... [success in 0.35s]
01:46:34  
01:46:34  Finished running 1 snapshot in 0 hours 0 minutes and 0.53 seconds (0.53s).
01:46:34  
01:46:34  Completed successfully
01:46:34  
01:46:34  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
| id | col_1 | col_2 | col_3 | col_4  | dbt_valid_from | dbt_valid_to |
|----|-------|-------|-------|--------|----------------|--------------|
| 1  | abc1  | def2  | ghi3  | <null> | 2023-07-23     | 2023-07-24   |
| 2  | abc2  | def3  | ghi4  | <null> | 2023-07-23     | 2023-07-24   |
| 3  | abc3  | def4  | ghi5  | <null> | 2023-07-23     | 2023-07-24   |
| 1  | abc1  | def2  | ghi3  | jkl4   | 2023-07-24     | <null>       |
| 2  | abc2  | def3  | ghi4  | jkl5   | 2023-07-24     | <null>       |
| 3  | abc3  | def4  | ghi5  | jkl6   | 2023-07-24     | <null>       |
01:46:37  Running with dbt=1.6.0
01:46:37  Registered adapter: duckdb=1.6.0
01:46:37  Unable to do partial parsing because config vars, config profile, or config target have changed
01:46:38  Found 1 model, 1 snapshot, 3 seeds, 0 sources, 0 exposures, 0 metrics, 349 macros, 0 groups, 0 semantic models
01:46:38  
01:46:38  Concurrency: 1 threads (target='duckdb')
01:46:38  
01:46:38  1 of 1 START snapshot main.my_snapshot ......................................... [RUN]
01:46:38  1 of 1 OK snapshotted main.my_snapshot ......................................... [success in 0.34s]
01:46:38  
01:46:38  Finished running 1 snapshot in 0 hours 0 minutes and 0.51 seconds (0.51s).
01:46:38  
01:46:38  Completed successfully
01:46:38  
01:46:38  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
| id | col_1 | col_2 | col_3  | col_4  | dbt_valid_from | dbt_valid_to |
|----|-------|-------|--------|--------|----------------|--------------|
| 1  | abc1  | def2  | ghi3   | <null> | 2023-07-23     | 2023-07-24   |
| 2  | abc2  | def3  | ghi4   | <null> | 2023-07-23     | 2023-07-24   |
| 3  | abc3  | def4  | ghi5   | <null> | 2023-07-23     | 2023-07-24   |
| 1  | abc1  | def2  | ghi3   | jkl4   | 2023-07-24     | 2023-07-25   |
| 2  | abc2  | def3  | ghi4   | jkl5   | 2023-07-24     | 2023-07-25   |
| 3  | abc3  | def4  | ghi5   | jkl6   | 2023-07-24     | 2023-07-25   |
| 1  | abc1  | def2  | <null> | jkl4   | 2023-07-25     | <null>       |
| 2  | abc2  | def3  | <null> | jkl5   | 2023-07-25     | <null>       |
| 3  | abc3  | def4  | <null> | jkl6   | 2023-07-25     | <null>       |
| 4  | abc4  | def5  | <null> | jkl7   | 2023-07-25     | <null>       |

@dbeatty10 dbeatty10 added Refinement Maintainer input needed and removed triage labels Sep 21, 2023
@owen-mc-git
Copy link
Author

@dbeatty10 following up on this issue, have there been any new updates that can be shared, thanks

@graciegoheen
Copy link
Contributor

Hi @owen-mc-git - we opened up a new issue for this #10235 so am closing this one as a dupe :)

@graciegoheen
Copy link
Contributor

graciegoheen commented May 29, 2024

This is distinct from #10235 as this ticket is focused on handling cases where a source field is removed (vs. a source row, which is covered by #10235).

This feels similar to the on_schema_change config we have for incremental models - docs here.

@graciegoheen
Copy link
Contributor

graciegoheen commented May 30, 2024

@owen-mc-git Hi! Just wanted to clarify - is this request specific to when you have a snapshot with strategy: 'check' and check_cols: 'all' (or check_cols containing the column that has sense been removed)?

@owen-mc-git
Copy link
Author

owen-mc-git commented May 30, 2024

@graciegoheen When I encountered the issue I was using the strategy: "check" and check_cols:'all;
I was snapshotting reference type tables that do not have update columns but can have changes in source including columns being removed if the ref data model changes

I di not do any test to see what would happen for the same scenario under strategy timestamp

@graciegoheen
Copy link
Contributor

graciegoheen commented May 30, 2024

Thanks! I'll block off some time to confirm what happens with the timestamp strategy. Our docs say - "dbt will not delete columns in the destination snapshot table if they are removed from the source query" - but I would expect the field that was removed from the source to be updated to null in the snapshot for the same reason you said above, "Not having a new record created for data that is removed gives a false sense that the source still has these values." - so will confirm that's the case.

@jordivandooren
Copy link

The logic referred to in the potential solution above by @dbeatty10 has since moved to dbt-adapters. See #8906 and #9401.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Refinement Maintainer input needed snapshots Issues related to dbt's snapshot functionality
Projects
None yet
Development

No branches or pull requests

6 participants