-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make snapshot data staging atomic #1884
Comments
This caused some problems for us yesterday. We have a snapshot. The underlying source table added a column and didn’t increment the lastchanged field (don’t ask, long story). So I tried to manually essentially force a new version of the record in the underlying table for the snapshot by running this SQL command:
Well, then this happened in the snapshot records (filtering for a stream of changes to one common id value, look at how the valid_to and valid_from timestamps overlap for the 2nd and third row): My guess is that because it is two different SQL statements being run, the value of CURRENT_TIMESTAMP() was different at the exact moment that the two of them were being run. It would be great to get this issue addressed so things like this don't happen. Luckily I have a backup of the snapshot before I did this and I can go back and re-try it another way, probably just doing something like TIMESTAMP(DATETIME('2020-04-02 10:00:00'), 'UTC')) |
The reason, by the way, for the 9999-01-01 date in the dbt_valid_to column is that these SQL results were from a view that we have on top of the snapshot that casts NULLs in that column to "future eternity." But yes, I have checked in the underlying snapshot data, and the problem does exist there. |
Would this bug also cause cases where an "old" version of a record in a snapshot would be end-dated but a new record with a newer "lastchanged" timestamp would not be added? We're seeing this happen and regardless of how many times we run the "snapshot" command the new record never gets inserted. |
@jrandrews I think it could! dbt runs:
If a new data point which would invalidate an existing snapshot record is added to the snapshot query between (1) and (2), then the record in the snapshot table would be invalidated without a corresponding new version being inserted. Lots of really compelling reasons to fix this issue for sure |
Describe the bug
Note: Further testing is required to determine that this bug exists, and that the description provided here accurately depicts the errant behavior and steps to reproduce it.
The
dbt snapshot
commandselect
s from the source dataset twice, once to generate a set of inserts, and then again to generate a set of updates. If these queries are run outside of a transaction (ie. on Snowflake or BigQuery), then it's possible for changes to the underlying source data to present two different views of the staging data to the snapshot query.As such, it's possible to dbt to invalidate an existing snapshot record without correspondingly inserting a new entry for that record. This leaves the snapshot table in a bad state: since there is no "current" record in the Snapshot table, subsequent
dbt snapshot
invocations do not record changes to the underlying source data for "closed out" records as identified by their unique keys.Steps To Reproduce
This is in the category of a "race condition". It appears that this issue can manifest if the source data for a snapshot is modified between the execution of these two sql statements: https://github.com/fishtown-analytics/dbt/blob/f985902a002fba36f6f709c6aacf9ae20778e58c/core/dbt/include/global_project/macros/materializations/snapshot/snapshot.sql#L165-L174
We should either combine these queries into a single query, or stage the source data into a temporary table in a way that guarantees consistency between the two queries. The code here was written this way for simplicity, but there's no real reason why the two queries cannot be combined into one.
Expected behavior
Mutations to source data during a
dbt snapshot
should be ignored and picked up in a subsequent snapshot job.System information
Which database are you using dbt with?
The output of
dbt --version
:Additional context
In addition to the primary defect noted here,
dbt snapshot
should be improved to handle (or help users fix) cases where there are no "current" records for a unique key in a snapshot table. This is basically going to be dbt's fault 100% of the time, so it should be easy to "fix" that if it should happen again in the future.Secondly, dbt should probably also enforce the constraints that it requires by default in snapshot definitions. If a table's primary key is non-unique, or if the
updated_at
timestamp is null in snapshots configured with thetimestamp
strategy, thendbt snapshot
is going to fail very hard. We should help users determine when this is the case and preventdbt snapshot
from getting itself into a bad state.The text was updated successfully, but these errors were encountered: