Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs for snapshot configuration #2900

Open
1 task done
dbeatty10 opened this issue Feb 20, 2023 · 0 comments
Open
1 task done

Update docs for snapshot configuration #2900

dbeatty10 opened this issue Feb 20, 2023 · 0 comments
Labels
content Improvements or additions to content improvement Use this when an area of the docs needs improvement as it's currently unclear

Comments

@dbeatty10
Copy link
Contributor

dbeatty10 commented Feb 20, 2023

Contributions

  • I have read the contribution docs, and understand what's expected of me.

Link to the page on docs.getdbt.com requiring updates

https://docs.getdbt.com/docs/build/snapshots

What part(s) of the page would you like to see updated?

Things to know

Python datetimes delineate between timestamps that are "aware" vs. those that are "naive". Aware timestamps are able to represent a unique instant in time by explicitly storing the relevant UTC offset. Naive timestamps are unable to represent a unique instant in time (unless the offset is determined by "mutual agreement").

An example of mutual agreement would be a producer and consumer both agreeing that a particular naive timestamp represents UTC. dbt has been a proponent of mutually agreeing that naive timestamps are implicitly in UTC and can be considered to represent a unique instant in time.

The timestamp data type of the snapshot_get_time() macro is often a "naive" data type rather than an "aware" one. This has implications when the data or configuration of a snapshot includes aware timestamps, especially as it relates to implicit data type conversion performed by the database.

Overview

Here's some key pieces of information for us to communicate:

  • There are two different snapshot strategies that can be configured:
    1. timestamp
    2. check
  • There is an opt-in config that seems simple, but has tricky implications:
    1. Whether or not to find and update hard-deleted records that no longer exist
  • There is a macro that can be configured, but many people don't know about:
    1. The SQL expression returned by the snapshot_get_time() macro

Configuration options

Specific things that are configurable:

  1. The updated_at config
  2. The invalidate_hard_deletes config

updated_at config

  • For the timestamp strategy:
    • updated_at is required
    • updated_at must be a column name (at least for the Snowflake adapter, and possibly others as well); i.e., expressions will not work (which could be considered a bug to fix)
  • For the check strategy:
    • updated_at is optional (which is not clearly documented currently)
    • updated_at may be either a column name or an expression
  • For both strategies, the updated_at config is used to populate the dbt_valid_from, dbt_valid_to and dbt_updated_at columns.

invalidate_hard_deletes config

  • optional for both strategies (defaulting to false)
  • For the timestamp strategy:
    • when invalidate_hard_deletes is true, the snapshot_get_time() macro is used to represent when the record ceased to be valid.
  • For the check strategy:
    • when invalidate_hard_deletes is true, the updated_at config is used to represent when the record ceased to be valid

⚠️ Warnings and caveat emptors

The valid from/to intervals could exhibit undesirable behavior if any of the following occurs:

  • updated_at is not monotonically increasing for each unique_key value for each successive snapshot (e.g. out of order snapshot times)
  • updated_at has the timestamp_ntz data type, but it represents a local time zone other than UTC
  • invalidate_hard_deletes is true and the updated_at column or expression is an aware timestamp, but the snapshot_get_time() macro is a naive timestamp
  • multiple instances of dbt execute at the same time and perform snapshots

Additional information

Related issue:

@dbeatty10 dbeatty10 added content Improvements or additions to content improvement Use this when an area of the docs needs improvement as it's currently unclear labels Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Improvements or additions to content improvement Use this when an area of the docs needs improvement as it's currently unclear
Projects
None yet
Development

No branches or pull requests

1 participant