Make snapshot data staging atomic #1884

drewbanin · 2019-11-02T19:03:55Z

Describe the bug

Note: Further testing is required to determine that this bug exists, and that the description provided here accurately depicts the errant behavior and steps to reproduce it.

The dbt snapshot command selects from the source dataset twice, once to generate a set of inserts, and then again to generate a set of updates. If these queries are run outside of a transaction (ie. on Snowflake or BigQuery), then it's possible for changes to the underlying source data to present two different views of the staging data to the snapshot query.

As such, it's possible to dbt to invalidate an existing snapshot record without correspondingly inserting a new entry for that record. This leaves the snapshot table in a bad state: since there is no "current" record in the Snapshot table, subsequent dbt snapshot invocations do not record changes to the underlying source data for "closed out" records as identified by their unique keys.

Steps To Reproduce

This is in the category of a "race condition". It appears that this issue can manifest if the source data for a snapshot is modified between the execution of these two sql statements: https://github.com/fishtown-analytics/dbt/blob/f985902a002fba36f6f709c6aacf9ae20778e58c/core/dbt/include/global_project/macros/materializations/snapshot/snapshot.sql#L165-L174

We should either combine these queries into a single query, or stage the source data into a temporary table in a way that guarantees consistency between the two queries. The code here was written this way for simplicity, but there's no real reason why the two queries cannot be combined into one.

Expected behavior

Mutations to source data during a dbt snapshot should be ignored and picked up in a subsequent snapshot job.

System information

Which database are you using dbt with?

bigquery
snowflake

The output of dbt --version:

>=0.14.0

Additional context

In addition to the primary defect noted here, dbt snapshot should be improved to handle (or help users fix) cases where there are no "current" records for a unique key in a snapshot table. This is basically going to be dbt's fault 100% of the time, so it should be easy to "fix" that if it should happen again in the future.

Secondly, dbt should probably also enforce the constraints that it requires by default in snapshot definitions. If a table's primary key is non-unique, or if the updated_at timestamp is null in snapshots configured with the timestamp strategy, then dbt snapshot is going to fail very hard. We should help users determine when this is the case and prevent dbt snapshot from getting itself into a bad state.

The text was updated successfully, but these errors were encountered:

jrandrews · 2020-04-02T17:50:55Z

This caused some problems for us yesterday. We have a snapshot. The underlying source table added a column and didn’t increment the lastchanged field (don’t ask, long story). So I tried to manually essentially force a new version of the record in the underlying table for the snapshot by running this SQL command:

SELECT * (EXCEPT lastchanged), CURRENT_TIMESTAMP() as lastchanged FROM {{ source (‘my_source_table’) }}

Well, then this happened in the snapshot records (filtering for a stream of changes to one common id value, look at how the valid_to and valid_from timestamps overlap for the 2nd and third row):

My guess is that because it is two different SQL statements being run, the value of CURRENT_TIMESTAMP() was different at the exact moment that the two of them were being run. It would be great to get this issue addressed so things like this don't happen.

Luckily I have a backup of the snapshot before I did this and I can go back and re-try it another way, probably just doing something like TIMESTAMP(DATETIME('2020-04-02 10:00:00'), 'UTC'))

jrandrews · 2020-04-02T17:52:33Z

The reason, by the way, for the 9999-01-01 date in the dbt_valid_to column is that these SQL results were from a view that we have on top of the snapshot that casts NULLs in that column to "future eternity." But yes, I have checked in the underlying snapshot data, and the problem does exist there.

jrandrews · 2020-04-25T18:45:07Z

Would this bug also cause cases where an "old" version of a record in a snapshot would be end-dated but a new record with a newer "lastchanged" timestamp would not be added? We're seeing this happen and regardless of how many times we run the "snapshot" command the new record never gets inserted.

drewbanin · 2020-04-27T23:36:37Z

@jrandrews I think it could! dbt runs:

An insert for all brand-new records, plus new versions of existing records
An update for all invalidated records

If a new data point which would invalidate an existing snapshot record is added to the snapshot query between (1) and (2), then the record in the snapshot table would be invalidated without a corresponding new version being inserted.

Lots of really compelling reasons to fix this issue for sure

drewbanin added bug Something isn't working snowflake bigquery snapshots Issues related to dbt's snapshot functionality labels Nov 2, 2019

drewbanin mentioned this issue Jan 20, 2020

Duplicate current rows in snapshot table #2034

Closed

5 tasks

drewbanin added this to the Octavius Catto milestone Apr 2, 2020

dwallace0723 mentioned this issue Apr 15, 2020

Jinja functions remain static in YAML files on RPC until modified #2330

Closed

5 tasks

drewbanin mentioned this issue May 1, 2020

Fix/snapshot staging is not atomic #2390

Merged

4 tasks

drewbanin added a commit that referenced this issue May 2, 2020

(#1884) Fix for non-atomic snapshot staging tables

8681dd8

drewbanin closed this as completed in #2390 May 4, 2020

jtcohen6 mentioned this issue Nov 8, 2020

DBT Snapshot filling in dbt_valid_to for most recent record #2871

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make snapshot data staging atomic #1884

Make snapshot data staging atomic #1884

drewbanin commented Nov 2, 2019

jrandrews commented Apr 2, 2020

jrandrews commented Apr 2, 2020

jrandrews commented Apr 25, 2020

drewbanin commented Apr 27, 2020

Make snapshot data staging atomic #1884

Make snapshot data staging atomic #1884

Comments

drewbanin commented Nov 2, 2019

Describe the bug

Steps To Reproduce

Expected behavior

System information

Additional context

jrandrews commented Apr 2, 2020

jrandrews commented Apr 2, 2020

jrandrews commented Apr 25, 2020

drewbanin commented Apr 27, 2020