You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Imagine order_id=1 is deleted from the table you're snapshotting.
If invalidate_hard_deletes=true, deleted records are invalidated by setting dbt_valid_to to the current timestamp:
order_id
dbt_valid_from
dbt_valid_to
1
2024-05-19
2024-05-20
2
2024-05-20
If invalidate_harde_deletes=false, deleted records are not invalidated, dbt_valid_to remains as NULL (you can’t really tell if they’ve been deleted):
order_id
dbt_valid_from
dbt_valid_to
1
2024-05-19
2
2024-05-20
What folks WANT, is tracking the “deleted” state as a state the record is in:
order_id
dbt_valid_from
dbt_valid_to
dbt_is_deleted
1
2024-05-19
2024-05-20
False
1
2024-05-20
True
2
2024-05-19
False
and maybe later it comes back as a zombie record “it comes back to life”:
order_id
dbt_valid_from
dbt_valid_to
dbt_is_deleted
1
2024-05-19
2024-05-20
False
1
2024-05-20
2024-06-03
True
1
2024-06-03
False
2
2024-05-19
False
Here’s the same exact same data as it exists today if you were using the invalidate_hard_deletes config. Note how there are no rows for order_id=1 between the time period of 2024-05-20 to 2024-06-03 — it has a “gap” during that time period. Contrast this with the proposed is_deleted column above that has “no gaps”.
order_id
dbt_valid_from
dbt_valid_to
1
2024-05-19
2024-05-20
1
2024-06-03
2
2024-05-19
Acceptance Criteria
Provide users a “new way to track deleted records”
invalidate_hard_deletes
track hard deletes by adding a new record when row become "deleted" in source
implicit delete
always explicit
has gaps
has no gaps
smaller datasets
larger datasets
new config hard_deletes (name tbd)
default (if it's not set) is hard_deletes = 'ignore', current behavior when invalidate_hard_deletes is not set today
replaces old invalidate_hard_deletes config, by setting hard_deletes = 'invalidate' (name tbd; other ideas 'invalidate_current_record')
new behavior is captured by setting hard_deletes = 'new_record' (name tbd; other ideas 'track', 'insert_new_record', 'is_deleted_flag', 'is_deleted_indicator', 'new_row')
snapshots:
my_project:
+hard_deletes: new_record
can set config as project default or for individual snapshot
what should happen when someone changes this config? [spike?]
we offer migration utils in an adapter-agnostic way
support this config for net-new snapshots
handling deletions when a record comes back to life but updated_at field has not changed
should we also add the new dbt_is_deleted meta-field when hard_deletes = 'invalidate'?
Notes
what should happen when someone changes this config?
should new meta field be dbt_is_deleted or dbt_deleted_at? i think dbt_is_deleted makes more sense because the deleted at timestamp can be derived from whatever dbt_valid_from is for the deleted row
order_id
dbt_valid_from
dbt_valid_to
dbt_is_deleted
dbt_deleted_at
1
2024-05-19
2024-05-20
False
1
2024-05-20
2024-06-03
True
2024-05-20
1
2024-06-03
False
2
2024-05-19
False
The text was updated successfully, but these errors were encountered:
@graciegoheen@gshank this summary of #8207 seems like its discussing a deleted flag for a row, whereas 8207 was the treatment of a column in a row when the column was removed in source.
e.g
we had columns a,b,c in our source table, with values 'x, 'y', 'z' in each
column c was deleted in source, we'd be expecting a new row with values 'x', 'y', null (because z no longer exists)
DBT handles a new column from source gracefully and records the change but the reverse was not catered for. That was the premise of 8207. from reading this issue I do not get the same sense.
graciegoheen
changed the title
Improve capturing the history of records in a deleted state
new record in snapshot to track deleted state (is_deleted column, zombie records)
Jun 17, 2024
set dbt_valid_to to current timestamp for latest record of every unique_key that no longer exists in the source
from ignore to new_record:
create new dbt_is_deleted column set to false for every record
set dbt_valid_to to current timestamp for latest record of every unique_key that no longer exists in the source
insert new record with dbt_valid_from as current timestamp, dbt_valid_to null, and dbt_is_deleted true, for every unique_key that no longer exists in the source
from invalidate to new_record:
create new dbt_is_deleted column set to false for every record
insert new record with dbt_valid_from as latest dbt_valid_to, dbt_valid_to null, and dbt_is_deleted true, for every unique_key that has 0 records in the snapshot with dbt_valid_to as null
from new_record to invalidate:
delete every record where dbt_is_deleted is true
delete dbt_is_deleted column
from invalidate to ignore:
update dbt_valid_to to null for the latest record of every unique_id that has 0 records in the snapshot with dbt_valid_to as null
from new_record to ignore:
delete every record where dbt_is_deleted is true
delete dbt_is_deleted column
update dbt_valid_to to null for the latest record of every unique_id that has 0 records in the snapshot with dbt_valid_to as null
Description
Imagine
order_id=1
is deleted from the table you're snapshotting.If
invalidate_hard_deletes=true
, deleted records are invalidated by settingdbt_valid_to
to the current timestamp:If
invalidate_harde_deletes=false
, deleted records are not invalidated,dbt_valid_to
remains asNULL
(you can’t really tell if they’ve been deleted):What folks WANT, is tracking the “deleted” state as a state the record is in:
and maybe later it comes back as a zombie record “it comes back to life”:
Here’s the same exact same data as it exists today if you were using the
invalidate_hard_deletes
config. Note how there are no rows fororder_id=1
between the time period of2024-05-20
to2024-06-03
— it has a “gap” during that time period. Contrast this with the proposedis_deleted
column above that has “no gaps”.Acceptance Criteria
Provide users a “new way to track deleted records”
hard_deletes
(name tbd)hard_deletes = 'ignore'
, current behavior wheninvalidate_hard_deletes
is not set todayinvalidate_hard_deletes
config, by settinghard_deletes = 'invalidate'
(name tbd; other ideas'invalidate_current_record'
)hard_deletes = 'new_record'
(name tbd; other ideas'track'
,'insert_new_record'
,'is_deleted_flag'
, 'is_deleted_indicator'
,'new_row'
)updated_at
field has not changeddbt_is_deleted
meta-field whenhard_deletes = 'invalidate'
?Notes
dbt_is_deleted
ordbt_deleted_at
? i thinkdbt_is_deleted
makes more sense because the deleted at timestamp can be derived from whateverdbt_valid_from
is for the deleted rowThe text was updated successfully, but these errors were encountered: