-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/snapshot staging is not atomic #2390
Fix/snapshot staging is not atomic #2390
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty confident I get what's going on here - just moving the selection of updates and inserts into a single statement and creating a table in one create table as select...
so "now" is the same time on both.
The test seems pretty cool too, slick use of pg_sleep.
Once you've got the changelog and all that done, this looks good to me! 🚢
146844e
to
8681dd8
Compare
…ated-at-timestamp-for-comparisons Fix for changing snapshot strategy types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels really good. The DDL/DML is much more straightforward and linear. Selfishly, it's going to make the spark implementation (dbt-labs/dbt-spark#76) a lot easier, too.
I did some local testing on the core four adapters, and everything looks good. Nice work on the pg_sleep
test. Since this issue wasn't actually applicable to pg/redshift, is there any sense in trying to run models-slow
on snowflake/bq? It might require a js udf.
While I was combing through the logs, I did notice that the snapshot materialization calls macros (expand_target_column_types
, get_missing_columns
, and get_columns_in_relation
) which repeatedly execute the same information_schema
or describe
query. That's out of scope for the current PR, and I'm not sure if it's even worth opening an issue about; now that we run the more performant describe
query on Snowflake, the additional runtime is hardly noticeable.
@drewbanin We are running a dbt snapshot on top of Snowflake. We're often encountering a The source for the snapshot is a Snowflake table that is synced from a production SQL table using Fivetran. I'm not super-familiar with the inner workings of snapshots, but one hypothesis I have is that Fivetran is syncing the row in question while the snapshot is running. The image below shows the snapshot data for the row in question - can you tell from the data whether this hypothesis might be correct? The next question is: How could my hypothesis be correct, given that this PR should make snapshots atomic? Well, I learned through painful experience that CTEs are not guaranteed to only evaluate once during a given query execution. So I wonder if Thanks for your time :) |
resolves #1884
Description
Previously, snapshots built a staging table in two discrete queries (one was a
create table as
and the other was aninsert
). This meant that if source data changed between the two statements (ie. if new data was loaded) then dbt would generate an asymmetric set of inserts and updates.This manifested as records in the snapshot table becoming invalidated (the
dbt_valid_to
column was set to a not-null value) but no new record was inserted for the corresponding change.This PR combines the two queries (which is actually how this previously worked) to ensure that both inserts and updates are generated from a consistent view of the source table.
Some notes:
Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.