Problems & (Potential) Solutions for Snapshots #7018

dbeatty10 · 2023-02-21T03:32:50Z

dbeatty10
Feb 21, 2023
Maintainer

Some things people have run into

Problem 1: It is possible to introduce data that isn't actually unique
- 💡 Feature idea 1: What if we could check if values are unique and raise an error when applicable before performing the snapshot (like described here)?
Problem 2: Since updated_at is configurable, it is possible to run snapshots out of order (as seen here)
- 💡 Feature idea 2: What if we could check if the new data is being introduced that is out-of-order and raise and error before the data is snapshotted?
Problem 3: updated_at can be a data type that is "aware" while that given by the snapshot_get_time() macro is "naive" (like described here), which can lead to overlapping data.
- 💡 Feature idea 3: What if the snapshot_get_time() macro always gave a data type that was the most rich/aware available within the database?
- 💡 Feature idea 4: What if there were a way to configure an alternative macro to use instead of snapshot_get_time()? Maybe a config named snapshot_time that can be an expression or a call to a macro (that yields an expression)? See hard_deletes_updated_at below for an alternative.
- 💡 Feature idea 5: What if there were a way to configure the default timezone to use for naive timestamps within a snapshot?
- 💡 Feature idea 6: What if there were a way to configure the default timezone to use for all naive timestamps across an entire dbt project?
Problem 4: It is possible to configure the hard-delete timestamp expression on a per snapshot basis for the check strategy, but it is not possible for the timestamp strategy. Rather, the only option for timestamp strategy is to override the snapshot_get_time() macro which applies to all snapshots.
- 💡 Feature idea 7: What if there were a hard_deletes_updated_at configuration? See snapshot_time above for an alternative.
Problem 5: It is possible to have multiple instances of dbt running at the same time (like here).
- 💡 Feature idea 8: What if dbt could determine if there were multiple snapshots executing at the same time and raise an error for all instances subsequent to the one currently running?

Naive vs. Aware timestamps

One of the trickiest things that can happen within snapshots is when there are naive timestamps (rather than aware). In those cases, we need clear ways for the user to configure a "mutual agreement" with the dbt system how to interpret those timestamps when they are actively involved in the snapshot configuration. Up until this point, the approach has been that naive timestamp must be UTC and there is no way to specify anything else. Some users may want to use an adjacent column that contains the relevant UTC offset or time zone (or configure the time zone globally for the dbt project (like this) or specifically for one model).

Comparison of `current_timestamp` and `snapshot_get_time` macros across adapters

Click to toggle

There are (4) different data types for timestamps observed across databases and programming languages.

Java 8’s Time API (JSR-310) provides exemplars of each type (ordered here from most to least precise):

java.time.ZonedDateTime
java.time.OffsetDateTime
java.time.Instant
java.time.LocalDateTime

Snowflake

Snowflake has 3 of the 4 types (but not the most precise one):

Postgres

Postgres has 2 of the 4 types (one "aware" and one "naive"):

Redshift

Redshift has 2 of the 4 types (one "aware" and one "naive"):

BigQuery

BigQuery has 2 of the 4 types (one "aware" and one "naive"):

Spark

Spark only has 1 of the 4 types (and it is "aware"):

dbeatty10 · 2023-06-15T13:34:58Z

dbeatty10
Jun 15, 2023
Maintainer Author

Problem 6: When a row is no longer currently valid, dbt_valid_to is updated but dbt_updated_at is not (as described here). This effectively means it is treated as a "row created at" rather than a "row updated at")
Problem 7: dbt_updated_at is a "system time" for the check strategy (a la SQL:2011), but it is an "application time" for the timestamp strategy.

💡 Feature idea 9: What if dbt either added a new column named dbt_modified_at or actually updated dbt_updated_at at the same time as dbt_valid_to?
💡 Feature idea 10: What if dbt treated dbt_modified_at/dbt_updated_at as a "system time" for both snapshot strategies (since dbt_valid_from & dbt_valid_to represent the "application time" for both strategies)?
💡 Feature idea 11: To complement dbt_modified_at, what if dbt added an immutable dbt_created_at column (representing the "system time" that the row was inserted)?

1 reply

dbeatty10 Jun 23, 2023
Maintainer Author

All 7 of the problems listed above can affect the timestamp strategy.

The check strategy isn't affected by problems 4 and 7 and it is able to avoid 2 and 3 also as long as updated_at is left as the default.

So the default configuration of the check strategy is only affected by problems 1, 5, and 6.

jvanpee · 2023-08-17T19:58:16Z

jvanpee
Aug 17, 2023

Regarding the naïve timestamp approach. I feel this is a pretty big problem because it can result in the wrong data being written. In our case, we use Fivetran to load Snowflake. We use a timestamp snapshot strategy with the _FIVETRAN_SYNCED column as the updated_at config. We also include "where _fivetran_deleted = FALSE" on our snapshots so when the row is deleted in the source the snapshot will indicate the row is no longer valid. The _FIVETRAN_SYNCED column is defined as a timestamp_tz and therefore when the initial snapshot occurs the dbt_valid_from and dbt_valid_to columns are also defined as timestamp_tz. When a record is then deleted from the source, the dbt snapshots sets the dbt_valid_to column to snapshot_get_time(). Snapshot_get_time() however returns a timestamp_ntz data type in the UTC time zone. This UTC time is then implicitly converted to a timestamp_tz which stores the UTC value and adds the time zone offset of the session, which in our case is currently -0500. This time is now 5 hours in the future, which is not the correct time.

As mentioned I feel this a bug and should be addressed in dbt-core. Some ideas.
- The snapshot should automatically convert the column specified in the updated_at config to timestamp_ntz.
○ This will result in the dbt_valid_from and dbt_valid_to column also being timestamp_ntz.
- The snapshot should be aware of the data type and as a result snapshot_get_time() should return that same data type.
- Return an error if the column specified in the updated_at config is anything but timestamp_ntz.

I realize some of these ideas are breaking changes so maybe a new snapshot strategy should be introduced that allows users to make the switch when ready.

Workarounds
- Manually convert the updated_at column to timestamp_ntz in the snapshot definition.
- Ensure every user performing the snapshot is configured in the UTC time zone.
- Utilize soft deletes in the source to avoid dbt using the hard delete snapshot logic.
○ This will require modifying downstream models so they account for the deleted flag.
- Override the snapshot_get_time() macro.
○ This is what we did. More details below.

We found that overriding the snapshot_get_time() was the best work around for us. I don't like the solution and it is a little risky but we changed the Snowflake command to:

to_varchar(convert_timezone('UTC', current_timestamp()),'YYYY-MM-DD HH24:MI:SS.FF3 TZHTZM')

This converts the timestamp to a varchar which retains the offset information so when it is stored to timestamp_tz or timestamp_ltz the offset is correct. When stored to a timestamp_ntz column it is also correct and in UTC because the offset is trimmed away. As mentioned this is risky because I am not sure all the places snapshot_get_time() is used and changing the data type could be problematic.

0 replies

T-Dunlap · 2023-09-14T00:06:23Z

T-Dunlap
Sep 14, 2023
Collaborator

@graciegoheen & @dbeatty10 - A few more features to add to the list, as requested by some of our high profile dbt-ers...

💡 Feature idea 12: Add a config that customizes the dbt_valid_to column. Companies often add an end date in the far future (9999-12-31) to accommodate their downstream processes and tools (like Tableau). This is a common ask, and our general advice is to create a downstream model to add this logic, but that necessitates another layer in the DAG! Wouldn't it be great if this were another config?
💡 Feature idea 13: Add a config such as insert_hard_delete_row= 'true' that... (1) inserts a new row for the hard deletes, (2) adds a flag indicating is_hard_delete = 1 (3) does not apply a dbt_valid_to date (4) deprecates the preceding record with a dbt_valid_to date. Often times, companies are stuck in situations where a hard delete occurs, only to re-appear at a later date (I call them Zombie Records). The best way to handle this is to acknowledge the hard delete while still being able to track history.

1 reply

graciegoheen May 29, 2024
Maintainer

For idea 12 - #10187
For idea 13 - #10235

jeremyyeo · 2023-09-28T00:22:16Z

jeremyyeo
Sep 28, 2023
Collaborator

TL;DR: Snapshots do not support renaming their metafields (e.g. dbt_valid_from) today.

Adding another feature here - some users want to customize the dbt metafields - for example, modify:

dbt_valid_from > acme_start_at
dbt_valid_to   > acme_end_at

Some more advanced dbt-users may even know of the "trick" of modifying dbt's built in behaviour of materializations that come out of the box - i.e. overwrite the macros that come with dbt (https://github.com/dbt-labs/dbt-core/tree/f65e4b6940a0775a0c7fca1a54d9754ef954e926/core/dbt/include/global_project/macros/materializations/snapshots) - however, this DOES NOT WORK because dbt spefically checks for the existence of the columns (dbt_scd_id, dbt_valid_from, dbt_valid_to) exactly.

dbt-core/core/dbt/adapters/base/impl.py

Lines 689 to 721 in f65e4b6

    
               def valid_snapshot_target(self, relation: BaseRelation) -> None: 
        
                   """Ensure that the target relation is valid, by making sure it has the 
        
                   expected columns. 
        
                   :param Relation relation: The relation to check 
        
                   :raises InvalidMacroArgType: If the columns are 
        
                       incorrect. 
        
                   """ 
        
                   if not isinstance(relation, self.Relation): 
        
                       raise MacroArgTypeError( 
        
                           method_name="valid_snapshot_target", 
        
                           arg_name="relation", 
        
                           got_value=relation, 
        
                           expected_type=self.Relation, 
        
                       ) 
        
                   columns = self.get_columns_in_relation(relation) 
        
                   names = set(c.name.lower() for c in columns) 
        
                   expanded_keys = ("scd_id", "valid_from", "valid_to") 
        
                   extra = [] 
        
                   missing = [] 
        
                   for legacy in expanded_keys: 
        
                       desired = "dbt_" + legacy 
        
                       if desired not in names: 
        
                           missing.append(desired) 
        
                           if legacy in names: 
        
                               extra.append(legacy) 
        
                   if missing: 
        
                       if extra: 
        
                           raise SnapshotTargetIncompleteError(extra, missing) 
        
                       else: 
        
                           raise SnapshotTargetNotSnapshotTableError(missing)

The typical/recommended method currently is to create a view on top of the snapshot which renames those metafields to whatever you want them to be.

1 reply

graciegoheen May 29, 2024
Maintainer

Opened an issue for this #10185 :)

dbeatty10 · 2023-10-25T18:22:18Z

dbeatty10
Oct 25, 2023
Maintainer Author

@T-Dunlap, @graciegoheen, and I got a chance to discuss snapshots on a video call today. Some of the items we discussed:

💡 Feature idea 14: Be able to rename metafields (e.g. dbt_valid_from, dbt_valid_to, etc.) as mentioned by @jeremyyeo above
💡 Feature idea 15: Support additional types of Slowly Changing Dimensions (e.g., type 6, etc.)
Being able to set dbt_valid_to to a far in the future date (instead of null)

0 replies

karenderer · 2024-05-30T19:59:29Z

karenderer
May 30, 2024

Hi, are there any plans for handling data type changes to a snapshotted table? Reading this thread, the solutions suggested are fairly manual, it would be amazing if dbt handled this more gracefully (e.g. some ability to set default behavior for example, renaming the column with the previous datatype and creating a new column) https://discourse.getdbt.com/t/snapshots-when-column-data-type-changes/10452/2

2 replies

dbeatty10 May 31, 2024
Maintainer Author

💡 This is an interesting idea @karenderer.

If the previous column was named customer_id, and its datatype were changing from an alphanumeric string into an integer, what are you thinking you'd name each of the columns in the next snapshot?

Presumably you'd backfill previous rows for the new column with nulls? And the old column would also have null for any new rows?

karenderer Jun 10, 2024

For column names, I'd be ok with something like customer_id__dbt_snapshot__string and customer_id__dbt_snapshot__int.

Presumably you'd backfill previous rows for the new column with nulls? And the old column would also have null for any new rows?

Yes, exactly!

dbeatty10 · 2024-05-31T21:04:18Z

dbeatty10
May 31, 2024
Maintainer Author

We’ve launched an initiative this quarter to give snapshots some ♥️ love ♥️ and:

improve their usability
ensure their accuracy

We'll be running a community feedback session in a couple weeks about dbt snapshots:

Thursday, 13 June, 8am Pacific: dbt snapshots as a first class citizen

Please join us to see the problems we’re trying to solve and the designs we’re considering.
We'd love to have you join and share your feedback with us!

Some supporting resources:

👉 Register here 👈

1 reply

dbeatty10 Jun 13, 2024
Maintainer Author

Recording of the session today -- awesome to have so many of you join and share your thoughts!

https://dbtlabs.zoom.us/rec/share/6eNFolKy4uJHS4FpMRpfTVLX2TP5CRk0dZi0igyaUNUjguJZ0BnsOyEqw-SD5m7n.mPw-3rBRPz03bG8W
Passcode: !9N6VXsv

jenna-jordan · 2024-06-04T17:59:52Z

jenna-jordan
Jun 4, 2024

Jumping in on this discussion, I'm not sure if I have a snapshot problem/solution or more of a general question, but I was wondering if this post from Claire is still the best/most recent recommendation around handling snapshots across different environments. Basically, what is the recommended config for snapshots when (for example) you have 3 different databases for dev, staging, and prod? And, how does defer possibly play into this? Since snapshots are more like sources than models, is there special behavior around how snapshots are handled... or should there be?

1 reply

graciegoheen Jun 12, 2024
Maintainer

Just opened up this issue - would love to get your take!

mikkosulonen · 2024-06-06T05:42:08Z

mikkosulonen
Jun 6, 2024

Hi!

Jumping a bit late to the discussion but here's some changes to snapshot that we've had to make:

Running snapshots per batch on a loop.

We get data from sources in a batch. One batch = one load = 1-n files
We want to track history in Snapshots over all of the unloaded batches. Think failed dbt loads over a weekend; we can easily have multiple unloaded batches waiting.
Example:
- 4 unloaded batches:
- A PK is inserted in Batch 1
- Updated in Batch 2
- Deleted in Batch 3
- Re-inserted in Batch 4
Our solution has been to modify the snapshot SQL to have a "batch_identifier" column, which is sortable. In snapshot macros, the unloaded batches are identified, and then executed one batch at a time starting from the oldest, thus gathering history in the correct order.

Some sources have hard-deletes happening inside a window. In essence, they are partial loads of a source table: "Always extract previous 7 days in one batch"

For these, we've added a new strategy to check for the min an max values of the window in question.
Only inside that window, hard-deletes are tracked.
The window column can be datetime, integer, or even a list of values

For some really huge tables (in the order 10s of Billions of rows), we've added yet another optimization (which doesn't allow for hard-deletes)

During the snapshot staging table, for already snapshotted data, we limit the target search to only the PKs present in the snapshot source. This resulted in huge improvements in Snowflake execution times.

1 reply

robinsmithuk Jun 13, 2024

I would love to see how we could accomplish this.

This is one of the main issues that we are facing when "Type 2"ing our data and stopping us from adopting snapshots. From my understanding of snapshots, whereby we receive multiple files which may affect a "record", we want to see all of the history and not just the latest, which I believe with snapshots will be lost. i.e. as the snapshot functionality compares between the snapshots not how the record got to where it is now.

If there is a way to capture the grain between that would be amazing. As at the moment we have created a lot of custom macro's to handle this.

jeremyyeo · 2024-06-07T02:39:47Z

jeremyyeo
Jun 7, 2024
Collaborator

Randomly came back here and couldn't believe no one called out the fact that we don't yet have a good CI workflow for snapshots. This one always catches folks out because they have some logic that varies the target_schema config by environment... e.g.

-- macros/gsn.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- set default_schema = target.schema -%}
    {%- if target.name == 'prod' -%}
        {{ custom_schema_name }}
    {%- elif target.name == 'ci' -%}
        {{ default_schema }}
    {%- endif -%}
{%- endmacro %}

-- snapshots/snappy.sql
{% snapshot snappy %}
{{
    config(
        target_schema=generate_schema_name('snapshots'),
        unique_key='id',
        strategy='check',
        check_cols='all'
    )
}}

select 1 id, 'alice' as first_name
{% endsnapshot %}

When target.name == 'ci', the snapshots schema resolve to dbt_cloud_pr_123_456 but when target.name == 'prod', then the snapshots schema is snapshots. This trips the state:modified flag.

The outcome of something like this is that snapshots are then always marked "state:modified" in CI runs, and because snapshots are usually at the start of the DAG, users always have confusion as to why are so many models being executed in my CI run (state:modified+) even if I barely modified anything.

snappy > model_1 > model_2 > model_3
foo > bar

^ I modify foo only but because I have "dynamic target schema logic" for snapshots, the whole snappy chain get's rebuilt with dbt build -s state:modified+.

https://discourse.getdbt.com/t/using-dynamic-schemas-for-snapshots/1070/5

https://discourse.getdbt.com/t/why-are-ci-jobs-state-method-building-more-less-models-nodes-than-expected/7504/4

In any case, snapshots are usually snapping data that is ephemeral or changing without history - that in itself makes it difficult to have a CI workflow unlike just a normal typical model.

1 reply

graciegoheen Jun 12, 2024
Maintainer

Just opened up this issue - would love to get your take!

bruno-szdl · 2024-06-07T19:03:17Z

bruno-szdl
Jun 7, 2024

Jumping into the discussion to add one suggestion/idea. Sorry if this was already discussed, I couldn't find it.

One complaint I hear from time to time is that you can't create a snapshot if the source has duplicates.

Imagine we have a source like

id	status	date
1	created	2024-01-01
1	processed	2024-02-01

It would be nice if the snapshot, in its first run, could read this source and be built like

id	status	date	dbt_valid_from	dbt_valid_to
1	created	2024-01-01	2024-01-01	2024-02-01
1	processed	2024-02-01	2024-02-01	null

We could have something similar to incremental models

If the snapshot does not exist in the platform (first run), create a history of the source's data, instead of trying to insert only
if the snapshot exists in the platform, then go the default way

It would only work for the timestamp strategy, because the snapshot must know what is older and what is newer.

Just an idea, maybe there are other ways to do it. But just because it is bad when we can't create a snapshot on sources that already have some history.

By the way, loved that snapshots are in the spotlight! 🧡

6 replies

graciegoheen Jun 10, 2024
Maintainer

But just because it is bad when we can't create a snapshot on sources that already have some history.

Makes sense! Thank you for sharing and appreciate the example. Feels somewhat similar in spirit to this issue #10236 on validating uniqueness before snapshotting. But in your case, you would want the snapshot to "just work" when you do have "duplicates" in your source table - and translate those to the appropriate snapshot entries.

How are you handling this today? Are you just unable to snapshot sources like this? Or do you have a work-around in place?

graciegoheen Jun 10, 2024
Maintainer

I want the option of a snapshot that just appends a select * of a model, without checking for uniqueness

Would this not just be an incremental model with insert strategy? Or you still want the snapshot meta columns?

bruno-szdl Jun 11, 2024

But just because it is bad when we can't create a snapshot on sources that already have some history.

Makes sense! Thank you for sharing and appreciate the example. Feels somewhat similar in spirit to this issue #10236 on validating uniqueness before snapshotting. But in your case, you would want the snapshot to "just work" when you do have "duplicates" in your source table - and translate those to the appropriate snapshot entries.

How are you handling this today? Are you just unable to snapshot sources like this? Or do you have a work-around in place?

Exactly! You wrote a better definition than I did xD

In the snapshot's first run, dbt could translate the source to a snapshot format even if it has duplicates (it would need a date/timestamp column to work).

I don't have a workaround :(

I only snapshot new sources or sources that don't have duplicates.

karenderer Jun 11, 2024

Would this not just be an incremental model with insert strategy? Or you still want the snapshot meta columns?

That's true an incremental model append strategy would work if you know that's what you want beforehand.

I've seen a case where a snapshot model was successful for a few days and then it turned out the identified unique key not in fact unique. I suppose at that point we could give up on the snapshot model and make a new incremental model.

But if we want to keep using the same snapshot table then our workaround for now is to update the unique key with a surrogate key by concatenating multiple fields. I've also thought about forcing a primary key by generating a row_number on a partition but we haven't actually put that into practice yet.

andreqaugusto Jun 11, 2024

I had this issue in my previous job. We wanted to create a SCD type 2 model based on a source that was always appending data (think like CDC events coming from a backend table).

With the out-of-the-box support for valid_from, valid_to columns, dbt snapshot seemed to be the right tool for the job. However, it does not handle sources with appending data well enough. The solution was to do a new macro to handle these situations, inspired by this answer from another issue.

jeff-skoldberg-gmds · 2024-06-10T13:30:10Z

jeff-skoldberg-gmds
Jun 10, 2024

Problem:

dbt Snapshots build in prod, even when people are working in dev.

The rationale:

https://docs.getdbt.com/faqs/Snapshots/snapshot-target-schema

Thoughts:

This never made sense to me.

It should be impossible to write to prod when your target is dev
dbt markets to junior analytics engineers who need to do a lot of experimentation
It seems to accomplish the opposite of what is intended. The intention was "snapshots are super important so lets source them in prod" be instead, people inadvertently make changes to their prod objects without SDLC.

If you need prod-like data in your dev schema, you can defer or clone. If your warehouse doesn't support clone, just create a copy of the prod table in the usual way, CTAS.

Solution:

We need to consider backwards compatibility. Abruptly changing the default behavior would be disruptive.
A config called something like respect_target_schema, the default is False. All we need to do is add the config and set True and it will build our Snapshots in Dev.

2 replies

graciegoheen Jun 12, 2024
Maintainer

Just opened up this issue - would love to get your take!

jeff-skoldberg-gmds Jun 12, 2024

Graceie, I 💚 it. Thank you!!

alison985 · 2024-06-10T16:05:07Z

alison985
Jun 10, 2024

I don't have time to read this thread today sadly, but I have a design pattern(s) relying on current timestamp behavior. I hope sharing it earlier than later is useful. I presented it at a Chicago Meetup earlier this year. Slide deck. Example code.

0 replies

jeff-skoldberg-gmds · 2024-06-10T16:19:49Z

jeff-skoldberg-gmds
Jun 10, 2024

At one client, we had to modify the snapshot macro due to this:
https://discourse.getdbt.com/t/snapshot-behavior/9034

3 replies

graciegoheen Jun 10, 2024
Maintainer

Thanks for bringing this up! I've heard people call these "zombie records" - records that are deleted, and then later come back to life. Here's the issue we've opened up about this case, would love to get your thoughts :)

jeff-skoldberg-gmds Jun 11, 2024

Hi @graciegoheen , this looks like a great solution. But I want to mention one difference between the issue I faced (discourse thread) and this the issue you linked to.

In this highlighted text, the issue they are calling out is a gap in time.

My issue, however, was the record was never re-inserted. I believe the difference has to do with strategy. I think when you use strategy='timestamp' you get the gap. But when you use strategy='check' and the upstream model has some surrogate_key field to check, I think the result is nothing is re-inserted.

Our result after the zombie came back was:

order_id	dbt_valid_from	dbt_valid_to
1	2024-05-19	2024-05-20
2	2024-05-19

But yes, I do think the solution proposed above will solve the problem for us!

graciegoheen Jun 11, 2024
Maintainer

Ah I see - thanks for explaining that difference. The timestamp strategy relies on the updated_at column to track changes to your source records. So if order_id 1 is deleted and then comes back but the updated_at column does not change, then it will be ignored (i.e. nothing is re-inserted). This is true for all sorts of changes when using the timestamp strategy. The only way a new record would be inserted with the timestamp strategy is if updated_at field is changed.

bruno-szdl · 2024-06-11T21:40:37Z

bruno-szdl
Jun 11, 2024

I asked my network for suggestions and here is what they said:

"Make snapshot models a part of /models folder" and "config the snapshot materialization in the config block, like other materializations". I think these two would be achieved with the option 2 of this issue here [Feature] No more jinja block for snapshots - new snapshot design ideas #10246

Dmitry Pochuev
Make snapshot models a part of /models folder

Axel Thevenot
If it can be considered as a materialization on its own like incremental is, this would be for me the best improvement we can have 🙏
snapshots should not be built in prod when running in dev (Jeff explained his suggestion above)
Change the name from snapshot to something like SCD table.

Jeff Skoldberg
The other thing I don't like about snapshots is the name.
To me, a snapshot is when you insert the entire dataset every time. Like an incremental with no unique key at a unique key that includes current_date, so it takes only one snapshot per day.
Make snapshots like SCD type 7. Add a 'is_current' flag.

Cédric Olivier
Thanks for this information.

I personally overwrite the macros so that the snapshots correspond to a Type 7 Slow Changing Dimension - https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_7:_Hybrid%5B4%5D_-_Both_surrogate_and_natural_key

My implementation rules :

Individual or identical batch processing

The result of dbt snapshots must be the same whether the photos supplied as input are processed one after the other, or whether a multitude of photos are processed in the same batch.

Snapshot usage has to be as simple as possible.

dbt_valid_to is not null and set to 9999-12-31 for current line, allowing operator without be carefull about null.

Functional snapshot

Dbt Snapshot is used to made historisation on dimension. update_at value is the real dbt_valid_from (and dbt_valid_to) reference. Use of current_date ou current_timestamp is a non-sense.

I have done that on bigquery, and it's in our production for 2 years now.
Add some columns attributes like INSERT_TIMESTAMP, UPDATE_TIMESTAMP, and DELETE_TIMESTAMP.

BELAID C
Hello many thank's for this improvement,

Perhaps consider adding some SCD 2 columns attributes like INSERT_TIMESTAMP, UPDATE_TIMESTAMP, and DELETE_TIMESTAMP (if hard delete is activated). One of my favorite approaches is to retrieve these columns along with the EVENT_TYPE (Insert, Update, or Delete) and push them to the final table.

This would make it cleaner and easier to get incremental data for upstream models. and building DMT models based on Snapshot.

Moreover, this information is already available in the original code and is only used for merging. I did this change on custom macro and it more simple to use it.

Best regards,
Support for Backfills. This one is similar to my suggestion above on snapshots being able to snapshot sources with duplicates.

Erich Silva
Probably adding support for backfills? I believe this one might come up as a suggestion

Again, thanks for opening this discussion!

1 reply

jeff-skoldberg-gmds Jun 11, 2024

I debated starting a new comment with the "perhaps we should rename dbt Snapshots", but glad you posted it here.
Maybe they should be called scd tables or something. (Although sometimes I use them to track changes in facts, haha).
Usually I name my snapshot models <thing>_scd, so it would feel coherent to me at least of the model type was called snapshots.

Oh, can we call them models? I think we should be able to. LOL. I'm not sure why they would be considered a non-model materialization. So... +1 for what Dmitry Pochuev said above on the LinkedIn thread.

jenna-jordan · 2024-07-25T16:28:32Z

jenna-jordan
Jul 25, 2024

I've seen an antipattern where incremental models are used when snapshots should be used instead - since incremental models act so much like "upserts", it is very tempting to use them to accumulate data over time (with the intention of never fully refreshing them). But, using incremental models in this way breaks idempotency, and this is where snapshots should be used. I wonder if in the same way that model versions get a view that reflects the current version, snapshots could also all get a view created that is the more user friendly way to query the accumulated data (functionally what the incremental model version would look like, just getting the most recent version of the data).

7 replies

dbeatty10 Jul 25, 2024
Maintainer Author

Were you thinking syntax like this to get only the current data according to the snapshot?

select * from {{ ref("customers_snapshot", at="now") }}

And also be able to take a slice of the snapshot like this?

select * from {{ ref("customers_snapshot", at="2021-05-03 15:26:00 +00") }}

Or were you thinking some other kind of syntax?

jenna-jordan Jul 26, 2024

I think it would be handy to be able to grab the snapshot slice for any datetime, with perhaps a default to "now". And I like condensing the slicing into the downstream ref statement - that makes it work even more like model version, which is very good from a design cohesion POV.

tommyh Oct 24, 2024

@jenna-jordan - I 100% agree with you. To echo it back: incremental models can be accidentally misused to materialize the "point in time value" for a (joined in) column. While doing that is "nice" from the perspective that it works initially, it backs you into a corner because now you can't refresh the model without overwriting that historical data.

My main issue with using snapshots instead of incremental models for those situations (ie: snapshot__dim_users), is that you often want todo that in the middle of your dbt dag. Imagine the following dag where each model relies on the one before it:

stg_users (needs dbt run)
dim_users (needs dbt run)
snapshot__dim_users (needs dbt snapshot)
top_users_over_time (needs dbt run)

Right now, to solve that problem, there are 2 types of solutions:

make each "data model layer" it's own dbt project (mesh-y), which has a standard flow of (dbt_snapshot, dbt run)
keep it in one dbt project, but use a crazy set of tags to keep it orgranized (dbt snapshot -s tag:snapshot_sources, dbt run -s tag:staging_dim, dbt snapshot -s tag:snapshot_dim, dbt run -s tag:later_models)

If anyone has any other solutions, I'd love to know! :)

I believe the feature of "rewrite dbt snapshots to be a vanilla model which is executed by dbt run" was raised by @graciegoheen in a different dbt issue/brainstorm, but it was out of scope because of how difficult it would be to rewrite all of snapshots (which totally makes sense!).

Just wanted to raise it, so that I could connect those orthogonal ideas. @dbeatty1

mikkosulonen Oct 24, 2024

Right now, to solve that problem, there are 2 types of solutions:

make each "data model layer" it's own dbt project (mesh-y), which has a standard flow of (dbt_snapshot, dbt run)
keep it in one dbt project, but use a crazy set of tags to keep it orgranized (dbt snapshot -s tag:snapshot_sources, dbt run -s tag:staging_dim, dbt snapshot -s tag:snapshot_dim, dbt run -s tag:later_models)
If anyone has any other solutions, I'd love to know! :)

There's https://docs.getdbt.com/reference/commands/build 'dbt build' command to do that now 👍

The dbt build command will:

run models
test tests
snapshot snapshots
seed seeds
In DAG order, for selected >resources or an entire project.

jvanpee Oct 24, 2024

Our approach is to only snapshot sources, not dbt models. Then utilize a dbt model over those snapshots to re-create historical values.

So, in your case, you will need to snapshot every source that goes into stg_users and dim_users then use those snapshots in top_users_over_time.

The concern with taking a snapshot of a model is that a model definition can change. If a model definition changes, the historical values may not be able to be re-created. If however, the raw data in the sources is being captured, the historical values can be re-created with different definitions of the model.

I hope that helps.

Problems & (Potential) Solutions for Snapshots #7018

dbeatty10 Feb 21, 2023 Maintainer

Some things people have run into

Naive vs. Aware timestamps

Comparison of current_timestamp and snapshot_get_time macros across adapters

Snowflake

Postgres

Redshift

BigQuery

Spark

Replies: 16 comments · 28 replies

dbeatty10 Jun 15, 2023 Maintainer Author

dbeatty10 Jun 23, 2023 Maintainer Author

T-Dunlap Sep 14, 2023 Collaborator

graciegoheen May 29, 2024 Maintainer

jeremyyeo Sep 28, 2023 Collaborator

graciegoheen May 29, 2024 Maintainer

dbeatty10 Oct 25, 2023 Maintainer Author

dbeatty10 May 31, 2024 Maintainer Author

dbeatty10 May 31, 2024 Maintainer Author

dbeatty10 Jun 13, 2024 Maintainer Author

graciegoheen Jun 12, 2024 Maintainer

jeremyyeo Jun 7, 2024 Collaborator

graciegoheen Jun 12, 2024 Maintainer

graciegoheen Jun 10, 2024 Maintainer

graciegoheen Jun 10, 2024 Maintainer

Problem:

The rationale:

Thoughts:

Solution:

graciegoheen Jun 12, 2024 Maintainer

graciegoheen Jun 10, 2024 Maintainer

graciegoheen Jun 11, 2024 Maintainer

dbeatty10 Jul 25, 2024 Maintainer Author

dbeatty10
Feb 21, 2023
Maintainer

Comparison of `current_timestamp` and `snapshot_get_time` macros across adapters

Replies: 16 comments 28 replies

dbeatty10
Jun 15, 2023
Maintainer Author

dbeatty10 Jun 23, 2023
Maintainer Author

T-Dunlap
Sep 14, 2023
Collaborator

graciegoheen May 29, 2024
Maintainer

jeremyyeo
Sep 28, 2023
Collaborator

graciegoheen May 29, 2024
Maintainer

dbeatty10
Oct 25, 2023
Maintainer Author

dbeatty10 May 31, 2024
Maintainer Author

dbeatty10
May 31, 2024
Maintainer Author

dbeatty10 Jun 13, 2024
Maintainer Author

graciegoheen Jun 12, 2024
Maintainer

jeremyyeo
Jun 7, 2024
Collaborator

graciegoheen Jun 12, 2024
Maintainer

graciegoheen Jun 10, 2024
Maintainer

graciegoheen Jun 10, 2024
Maintainer

graciegoheen Jun 12, 2024
Maintainer

graciegoheen Jun 10, 2024
Maintainer

graciegoheen Jun 11, 2024
Maintainer

dbeatty10 Jul 25, 2024
Maintainer Author