Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make snapshot data staging atomic #1884

Closed
2 tasks done
drewbanin opened this issue Nov 2, 2019 · 4 comments · Fixed by #2390
Closed
2 tasks done

Make snapshot data staging atomic #1884

drewbanin opened this issue Nov 2, 2019 · 4 comments · Fixed by #2390
Labels
bigquery bug Something isn't working snapshots Issues related to dbt's snapshot functionality snowflake

Comments

@drewbanin
Copy link
Contributor

Describe the bug

Note: Further testing is required to determine that this bug exists, and that the description provided here accurately depicts the errant behavior and steps to reproduce it.

The dbt snapshot command selects from the source dataset twice, once to generate a set of inserts, and then again to generate a set of updates. If these queries are run outside of a transaction (ie. on Snowflake or BigQuery), then it's possible for changes to the underlying source data to present two different views of the staging data to the snapshot query.

As such, it's possible to dbt to invalidate an existing snapshot record without correspondingly inserting a new entry for that record. This leaves the snapshot table in a bad state: since there is no "current" record in the Snapshot table, subsequent dbt snapshot invocations do not record changes to the underlying source data for "closed out" records as identified by their unique keys.

Steps To Reproduce

This is in the category of a "race condition". It appears that this issue can manifest if the source data for a snapshot is modified between the execution of these two sql statements: https://github.com/fishtown-analytics/dbt/blob/f985902a002fba36f6f709c6aacf9ae20778e58c/core/dbt/include/global_project/macros/materializations/snapshot/snapshot.sql#L165-L174

We should either combine these queries into a single query, or stage the source data into a temporary table in a way that guarantees consistency between the two queries. The code here was written this way for simplicity, but there's no real reason why the two queries cannot be combined into one.

Expected behavior

Mutations to source data during a dbt snapshot should be ignored and picked up in a subsequent snapshot job.

System information

Which database are you using dbt with?

  • bigquery
  • snowflake

The output of dbt --version:

>=0.14.0

Additional context

In addition to the primary defect noted here, dbt snapshot should be improved to handle (or help users fix) cases where there are no "current" records for a unique key in a snapshot table. This is basically going to be dbt's fault 100% of the time, so it should be easy to "fix" that if it should happen again in the future.

Secondly, dbt should probably also enforce the constraints that it requires by default in snapshot definitions. If a table's primary key is non-unique, or if the updated_at timestamp is null in snapshots configured with the timestamp strategy, then dbt snapshot is going to fail very hard. We should help users determine when this is the case and prevent dbt snapshot from getting itself into a bad state.

@drewbanin drewbanin added bug Something isn't working snowflake bigquery snapshots Issues related to dbt's snapshot functionality labels Nov 2, 2019
@jrandrews
Copy link

This caused some problems for us yesterday. We have a snapshot. The underlying source table added a column and didn’t increment the lastchanged field (don’t ask, long story). So I tried to manually essentially force a new version of the record in the underlying table for the snapshot by running this SQL command:

SELECT * (EXCEPT lastchanged), CURRENT_TIMESTAMP() as lastchanged FROM {{ source (‘my_source_table’) }}

Well, then this happened in the snapshot records (filtering for a stream of changes to one common id value, look at how the valid_to and valid_from timestamps overlap for the 2nd and third row):

Screen Shot 2020-04-02 at 9 57 00 AM

My guess is that because it is two different SQL statements being run, the value of CURRENT_TIMESTAMP() was different at the exact moment that the two of them were being run. It would be great to get this issue addressed so things like this don't happen.

Luckily I have a backup of the snapshot before I did this and I can go back and re-try it another way, probably just doing something like TIMESTAMP(DATETIME('2020-04-02 10:00:00'), 'UTC'))

@jrandrews
Copy link

The reason, by the way, for the 9999-01-01 date in the dbt_valid_to column is that these SQL results were from a view that we have on top of the snapshot that casts NULLs in that column to "future eternity." But yes, I have checked in the underlying snapshot data, and the problem does exist there.

@jrandrews
Copy link

Would this bug also cause cases where an "old" version of a record in a snapshot would be end-dated but a new record with a newer "lastchanged" timestamp would not be added? We're seeing this happen and regardless of how many times we run the "snapshot" command the new record never gets inserted.

@drewbanin
Copy link
Contributor Author

@jrandrews I think it could! dbt runs:

  1. An insert for all brand-new records, plus new versions of existing records
  2. An update for all invalidated records

If a new data point which would invalidate an existing snapshot record is added to the snapshot query between (1) and (2), then the record in the snapshot table would be invalidated without a corresponding new version being inserted.

Lots of really compelling reasons to fix this issue for sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery bug Something isn't working snapshots Issues related to dbt's snapshot functionality snowflake
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants