-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Timestamps are inconsistent between inserted/updated and deleted rows in snapshots #4347
Comments
@pcasteran Thanks for opening! At first, I thought this issue might be the same as #3710 (data type mismatches between Okay, let's dive in. Snapshots on BigQuery are basically the same as they are elsewhere. The The snapshot strategy determines the value of dbt-core/core/dbt/include/global_project/macros/materializations/snapshots/strategies.sql Lines 136 to 145 in 0d320c5
Why did we choose to do it that werid way a few years ago? It looks like it was to ensure that there was no overlap between a new record's In fact, I think we rendered this workaround obsolete, by creating a single atomic staging table in #2390 for use with either a single atomic So, I think we could take either of these approaches:
I think this could be a |
Hello @jtcohen6 @pcasteran 👋, Happy to take this one if it's ok for you ? Thanks 🙏 |
@kadero that's fine for me if you want to work on this one, thx :) |
❓ Should this ticket be re-opened or should I create a new issue? Proposed solutionConverting this to a non-breaking change might be as simple as: - {% set updated = snapshot_get_time() %}
+ {% set updated_at = config.get('updated_at', snapshot_get_time()) %} Definitions
Problem overviewSome users have source tables that represent point-in-time snapshots. e.g., a table named ExampleThis is a simple example showing changes to a table over 3 days, including insertion and deletions. Configuring the table copies as a sourceAn example source configuration in dbt is: version: 2
sources:
- name: users
database: raw
schema: users
tables:
- name: user_{{ var("iso8601") }} Configuring a dbt snapshotAs of this writing, specifying the {% snapshot users_snapshot %}
{{
config(
target_database='analytics',
target_schema='dbt_dbeatty_snapshots',
unique_key='id',
strategy='check',
check_cols='all',
updated_at="try_to_timestamp_ntz('" ~ var("iso8601") ~ "', 'YYYYMMDD')",
invalidate_hard_deletes=True,
)
}}
select * from {{ source("users", "user_" ~ var("iso8601")) }}
{% endsnapshot %} Taking dbt snapshotsProcessing successive days: dbt snapshot --vars '{"iso8601": "20190701"}'
dbt snapshot --vars '{"iso8601": "20190702"}'
dbt snapshot --vars '{"iso8601": "20190703"}' OutputSuppose consecutive dbt snapshots are taken for those 3 dates at the following timestamps:
Before #4513
After #4513
The output of the following columns differs before and after:
(I know, timestamps in tables aren't the easiest to read 😅 ) The proposed solution is intended to preserve the update and also reestablish the previous behavior. |
@dbeatty10 Nice catch, and thanks for the clear write-up! I agree our goal should be preserving the previous behavior for folks who have configured it as such, while also supporting the improved default for folks who haven't. Let's open a new ticket, and accompanying PR, since it sounds like the solution to accomplish that could be a quite-simple one-liner. |
Is there an existing issue for this?
Current Behavior
During the execution of a snapshot, the timestamp used for the
dbt_updated_at
,dbt_valid_from
anddbt_valid_to
columns is not consistent between the inserted, updated and deleted rows.For the inserted and updated rows the code is
{{ strategy.updated_at }}
which, in BigQuery at least, is calculated beforehand in another query and is equal toCURRENT_TIMESTAMP()
in that previous query. The value is then expanded in the actual snapshot query:On the other hand for the deleted rows, the code is :
{{ snapshot_get_time() }}
which is expanded asCURRENT_TIMESTAMP()
in the main query:As there are multiple queries involved in the execution of a snapshot, computed in different point in time, the values returned by the two calls of
CURRENT_TIMESTAMP()
are different and leads to different timestamps being used in the output.Expected Behavior
A single execution of a snapshot should output the same "current timestamp" value for the inserted, updated and deleted rows.
I guess the fix should be to use
{{ strategy.updated_at }}
instead of{{ snapshot_get_time() }}
when processing the deleted rows.Steps To Reproduce
No response
Relevant log output
No response
Environment
No response
What database are you using dbt with?
bigquery
Additional Context
No response
The text was updated successfully, but these errors were encountered: