Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-3179] [Feature] Handle multiple changes in one snapshot run #8760

Closed
3 tasks done
tajsg18 opened this issue Oct 3, 2023 · 1 comment
Closed
3 tasks done

[CT-3179] [Feature] Handle multiple changes in one snapshot run #8760

tajsg18 opened this issue Oct 3, 2023 · 1 comment
Labels
enhancement New feature or request snapshots Issues related to dbt's snapshot functionality wontfix Not a bug or out of scope for dbt-core

Comments

@tajsg18
Copy link

tajsg18 commented Oct 3, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

One run of dbt snapshot can currently only handle one change per unique key. It would be useful to support multiple changes to data in one dbt snapshot run. This would mean supporting also "immutable" tables (that contain new rows for every mutation of a key), not just mutable tables.

Even though "immutable" tables already keep history, it's not in an SCD2 format.

See an example below.

Input data:

name age update_time
Emily 30 2023-10-03T15:12:45.278Z
Emily 31 2023-10-03T15:13:02.056Z

Result of dbt snapshot:

name age update_time dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
Emily 30 2023-10-03T15:16:14.187Z 29ae89d7a22d602801c0551b5bcaf93a 2023-10-03T15:16:14.187Z 2023-10-03T15:16:14.187Z null
Emily 31 2023-10-03T15:16:17.841Z 45fc06e3fb2503ed7ed5c8e841dc7476 2023-10-03T15:16:17.841Z 2023-10-03T15:16:17.841Z null

What I would expect:

name age update_time dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
Emily 30 2023-10-03T15:16:14.187Z 29ae89d7a22d602801c0551b5bcaf93a 2023-10-03T15:16:14.187Z 2023-10-03T15:16:14.187Z 2023-10-03T15:16:17.841Z
Emily 31 2023-10-03T15:16:17.841Z 45fc06e3fb2503ed7ed5c8e841dc7476 2023-10-03T15:16:17.841Z 2023-10-03T15:16:17.841Z null

I propose to introduce this change with a configuration or snapshot strategy, as it would definitely produce different behavior and it could be a breaking change for current users (as also noted in 7138).

Describe alternatives you've considered

An alternative solution is to build a macro to transform "immutable" data to SCD2 format.
The disadvantage of this approach would be that different functionality would have a similar purpose: transform data to SCD2 format.

Who will this benefit?

Anybody who would like to bring "immutable" data to the same SCD2 format as dbt snapshot outputs.

Are you interested in contributing this feature?

If there is interest, yes.

Anything else?

No response

@tajsg18 tajsg18 added enhancement New feature or request triage labels Oct 3, 2023
@github-actions github-actions bot changed the title [Feature] Handle multiple changes in one snapshot run [CT-3179] [Feature] Handle multiple changes in one snapshot run Oct 3, 2023
@dbeatty10 dbeatty10 added the snapshots Issues related to dbt's snapshot functionality label Oct 3, 2023
@dbeatty10 dbeatty10 self-assigned this Oct 3, 2023
@dbeatty10
Copy link
Contributor

Thanks for a fantastic write-up @tajsg18 ! 💡

You did a great job of putting your finger on the most important piece here:

This would mean supporting also "immutable" tables (that contain new rows for every mutation of a key), not just mutable tables.

Even though "immutable" tables already keep history, it's not in an SCD2 format.

We recommend the alternative you described

We're inclined to follow the Unix DOTADIW principle ("Do One Thing And Do It Well") for snapshots and not expand it to cover the case you described. i.e., keep the intended domain of dbt snapshots restricted to capturing changes of current database tables rather than expanding it.

But you presented a totally legit and super common use-case!

To cover it, we recommend the alternative you described for tables that already have a full history of changes over time (populated from some kind of CDC source or audit / history tables). e.g., using lead/lag window functions to transform that source data into your desired format (whether that is an Type 2 SCD like snapshots, or some other SCD or CDC format).

Personally, I'd love to see someone contribute a dbt package of temporal macros and/or materializations related to SCD and put it up on the dbt Package Hub.

I'm going to close this as "not planned" for the reasons mentioned above.

@dbeatty10 dbeatty10 removed the triage label Oct 3, 2023
@dbeatty10 dbeatty10 removed their assignment Oct 3, 2023
@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2023
@dbeatty10 dbeatty10 added the wontfix Not a bug or out of scope for dbt-core label Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request snapshots Issues related to dbt's snapshot functionality wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

2 participants