Destination V2: Keep deleted records for further processing #30211

dimoschi · 2023-09-06T16:57:48Z

What area the feature impact?

Connectors

Revelant Information

In V2 Destinations after removing the SCD tables, there is no safe way to generate SCD/historical records. In the past SCD tables had the all records, even the ones deleted by the source. Without SCD, those records can only be found in raw tables, which makes it very hard to use them. There are several occasions where deleted records are needed for analytical purposes.

We have several cases and I believe other people have too, where apps keep only transactional data that get deleted the moment they are no longer useful to the app itself. However, the same deleted data could be useful for analytical purposes.

Another point is that using CDC doesn't really make any sense if the intention is to mirror the sources as closely as possible. In that case, a select star from the source would be more than enough.

I think there is definitely a need to have for each record all past states as it gets updated and the final state even after deletion.

evantahler · 2023-09-06T22:30:44Z

In the past SCD tables had the all records, even the ones deleted by the source.

This statement was never guaranteed to be true - which is why we removed SCD tables (along with it being underutilized) - it was an inaccurate guarentee.

SCD implies "capture all changes for all rows". We can only do this if we are guaranteed, no matter when the sync runs, we get all the changes. This only works for CDC today, and is more of an implementation fluke than design choice. Every other source, depending on when the sync runs, takes a sample of the source at that time, missing whatever changes happened between syncs.
Technically this doesn't always work for CDC either - if the syncs are too far apart and the WAL log is reset, we'll do a full refresh and miss any changes between syncs
Some users might be Ok with 'hourly' or 'daily' snapshots and not need every change. If that's the case, they can use dbt to do that easily-ish with snapshots. But, we don't know what the user might want here (frequency, retention period, etc), and will need to build a way to ask this question into the product.

nina-j · 2023-09-07T08:37:01Z

I take this as a general discussion of SDC/CDC and not particular to Snowflake - we use BigQuery.

We evaluated Airbyte against Datastream when deciding on new ingestion methods for our datalake, and one important reason for choosing Airbyte was exactly the ability to retain deletion information. Of course, with CDC there's always the WAL lifespan caveat, but my expectation would be for Airbyte to be less opinionated and simply reflect the WAL. Providing guarantees is another matter and in my opinion something that could be mentioned in documentation.

Can you clarify what happens in an incremental append scenario? Would we have _deleted_at columns still available?

dimoschi · 2023-09-07T08:56:38Z

@evantahler I didn't really understand your point. I am discussing SCD tables in the context of CDC sync, which by definition follows the WAL log or the binlog to capture all intermediate states of the records and not just the final one, which can be captured easily with a simple select start query.

Technically this doesn't always work for CDC either - if the syncs are too far apart and the WAL log is reset, we'll do a full refresh and miss any changes between syncs

That's something that should be taken care of by the Airbyte user. Keep the WAL/binlog as long as it is required to avoid data loss.

Some users might be Ok with 'hourly' or 'daily' snapshots and not need every change. If that's the case, they can use dbt to do that easily-ish with snapshots. But, we don't know what the user might want here (frequency, retention period, etc), and will need to build a way to ask this question into the product.

With CDC and SCD tables, you didn't have "to ask this question into the product". The user could choose, how often to sync data and it was guaranteed that all data would be captured in the SCD table (syncs far apart are already addressed).

dimoschi · 2023-09-07T12:28:13Z

Also, let me present a specific use case where SCD was really needed. In our application exist users and licenses and the relationship between them is many-to-many, so another license_user table exists. Now we have many customers that perform seat rotation (we allow that), so every now and then, when a user has finished the training of a license, the admins will remove that user (User A) and add a new one (User B). Now the product wants to know, how often this is happening and by how many users a license seat is utilized during the license duration. The app itself deletes the record from the license_user table once the user is removed from it.

Hope that it makes sense. This is only one use-case that I'm able to provide.

evantahler · 2023-09-07T21:46:22Z

For clarity, I think there are 2 distinct issues being discussed here:

In the final tables, perhaps we should be keeping deleted records. As we send CDC deletes as a 'virtual column' already from sources (_ab_cdc_deleted_at), perhaps we shouldn't be doing any special processing on this column at all, and leave deleted records (albeit with a deleted_at timestamp)
Destinations V2 keeps only the most-recent records in the final table when deduping (expected) and the raw tables. We delete old raw records here. This thread describes use-cases in which keeping the entire history might be useful (and cannot be solved with snapshots). However, this seems like a great use-case for "append" (not "dedupe") syncs - every change to the table will be kept.

I believe #2 is already possible by using an "append" sync. So for clarity, this issue should only focus on part 1 - keeping CDC deleted records.

dimoschi · 2023-09-12T11:20:23Z

@evantahler point 2. seems to be the solution to my problem. Although it is required to set up a new connection, as this is a one-time job it's not a huge deal. Thank you for providing a working solution to us.

As for point 1. even though keeping the deleted records is useful is contradictive to your objective, which was to have a table in the destination as similar as possible to the one in the source. Since point 2. addresses the issue of keeping deleted and intermediate states of records, from my side I'm covered.

nina-j · 2023-09-12T11:36:56Z

@evantahler Thank you for the clarification - we use incremental append and my confusion stemmed from it not being completely clear when records were deleted.

evantahler · 2023-09-12T22:26:42Z

Glad this clears things up, @dimoschi and @nina-j! Please close this issue if you have what you need.

In general, if there is something in our docs (e.g.) or UI that we can add to make what happens in the various sync modes more clear, please let us know!

kiwamizamurai · 2024-05-13T02:44:16Z

The current behavior of Incremental is not able to handle source schema changes yet, for example, when a column is added, renamed or deleted from an existing table etc. It is recommended to trigger a Full refresh - Overwrite to correctly replicate the data to the destination with the new schema changes.
https://docs.airbyte.com/using-airbyte/core-concepts/sync-modes/incremental-append

i think that there is still a problem. we have to do [Full refresh - Overwrite] when the schema changes, thereby deleting the history

evantahler · 2024-05-13T15:24:19Z

I think what you might be looking for @kiwamizamurai is this option to "propagate field changes only" when a schema change occurs.

kiwamizamurai · 2024-05-13T21:57:31Z

@evantahler I see. Thank you!

dimoschi added needs-triage type/enhancement New feature or request labels Sep 6, 2023

octavia-squidington-iii added autoteam community team/use labels Sep 6, 2023

dimoschi mentioned this issue Sep 6, 2023

[Roadmap] Normalization update to 1 table per stream #25194

Closed

evantahler added the team/destinations Destinations team's backlog label Sep 6, 2023

evantahler changed the title ~~Destination V2 Snowflake: Keep deleted records for further processing~~ Destination V2: Keep deleted records for further processing Sep 7, 2023

dimoschi closed this as completed Sep 14, 2023

marcosmarxm removed the needs-triage label May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Destination V2: Keep deleted records for further processing #30211

Destination V2: Keep deleted records for further processing #30211

dimoschi commented Sep 6, 2023

evantahler commented Sep 6, 2023 •

edited

Loading

nina-j commented Sep 7, 2023

dimoschi commented Sep 7, 2023 •

edited

Loading

dimoschi commented Sep 7, 2023

evantahler commented Sep 7, 2023 •

edited

Loading

dimoschi commented Sep 12, 2023

nina-j commented Sep 12, 2023

evantahler commented Sep 12, 2023

kiwamizamurai commented May 13, 2024 •

edited

Loading

evantahler commented May 13, 2024

kiwamizamurai commented May 13, 2024

Destination V2: Keep deleted records for further processing #30211

Destination V2: Keep deleted records for further processing #30211

Comments

dimoschi commented Sep 6, 2023

What area the feature impact?

Revelant Information

evantahler commented Sep 6, 2023 • edited Loading

nina-j commented Sep 7, 2023

dimoschi commented Sep 7, 2023 • edited Loading

dimoschi commented Sep 7, 2023

evantahler commented Sep 7, 2023 • edited Loading

dimoschi commented Sep 12, 2023

nina-j commented Sep 12, 2023

evantahler commented Sep 12, 2023

kiwamizamurai commented May 13, 2024 • edited Loading

evantahler commented May 13, 2024

kiwamizamurai commented May 13, 2024

evantahler commented Sep 6, 2023 •

edited

Loading

dimoschi commented Sep 7, 2023 •

edited

Loading

evantahler commented Sep 7, 2023 •

edited

Loading

kiwamizamurai commented May 13, 2024 •

edited

Loading