Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination V2: Keep deleted records for further processing #30211

Closed
dimoschi opened this issue Sep 6, 2023 · 11 comments
Closed

Destination V2: Keep deleted records for further processing #30211

dimoschi opened this issue Sep 6, 2023 · 11 comments
Labels
autoteam community team/destinations Destinations team's backlog team/use type/enhancement New feature or request

Comments

@dimoschi
Copy link

dimoschi commented Sep 6, 2023

What area the feature impact?

Connectors

Revelant Information

In V2 Destinations after removing the SCD tables, there is no safe way to generate SCD/historical records. In the past SCD tables had the all records, even the ones deleted by the source. Without SCD, those records can only be found in raw tables, which makes it very hard to use them. There are several occasions where deleted records are needed for analytical purposes.

We have several cases and I believe other people have too, where apps keep only transactional data that get deleted the moment they are no longer useful to the app itself. However, the same deleted data could be useful for analytical purposes.

Another point is that using CDC doesn't really make any sense if the intention is to mirror the sources as closely as possible. In that case, a select star from the source would be more than enough.

I think there is definitely a need to have for each record all past states as it gets updated and the final state even after deletion.

@evantahler
Copy link
Contributor

evantahler commented Sep 6, 2023

In the past SCD tables had the all records, even the ones deleted by the source.

This statement was never guaranteed to be true - which is why we removed SCD tables (along with it being underutilized) - it was an inaccurate guarentee.

  • SCD implies "capture all changes for all rows". We can only do this if we are guaranteed, no matter when the sync runs, we get all the changes. This only works for CDC today, and is more of an implementation fluke than design choice. Every other source, depending on when the sync runs, takes a sample of the source at that time, missing whatever changes happened between syncs.
  • Technically this doesn't always work for CDC either - if the syncs are too far apart and the WAL log is reset, we'll do a full refresh and miss any changes between syncs
  • Some users might be Ok with 'hourly' or 'daily' snapshots and not need every change. If that's the case, they can use dbt to do that easily-ish with snapshots. But, we don't know what the user might want here (frequency, retention period, etc), and will need to build a way to ask this question into the product.

@nina-j
Copy link

nina-j commented Sep 7, 2023

I take this as a general discussion of SDC/CDC and not particular to Snowflake - we use BigQuery.

We evaluated Airbyte against Datastream when deciding on new ingestion methods for our datalake, and one important reason for choosing Airbyte was exactly the ability to retain deletion information. Of course, with CDC there's always the WAL lifespan caveat, but my expectation would be for Airbyte to be less opinionated and simply reflect the WAL. Providing guarantees is another matter and in my opinion something that could be mentioned in documentation.

Can you clarify what happens in an incremental append scenario? Would we have _deleted_at columns still available?

@dimoschi
Copy link
Author

dimoschi commented Sep 7, 2023

@evantahler I didn't really understand your point. I am discussing SCD tables in the context of CDC sync, which by definition follows the WAL log or the binlog to capture all intermediate states of the records and not just the final one, which can be captured easily with a simple select start query.

Technically this doesn't always work for CDC either - if the syncs are too far apart and the WAL log is reset, we'll do a full refresh and miss any changes between syncs

That's something that should be taken care of by the Airbyte user. Keep the WAL/binlog as long as it is required to avoid data loss.

Some users might be Ok with 'hourly' or 'daily' snapshots and not need every change. If that's the case, they can use dbt to do that easily-ish with snapshots. But, we don't know what the user might want here (frequency, retention period, etc), and will need to build a way to ask this question into the product.

With CDC and SCD tables, you didn't have "to ask this question into the product". The user could choose, how often to sync data and it was guaranteed that all data would be captured in the SCD table (syncs far apart are already addressed).

@dimoschi
Copy link
Author

dimoschi commented Sep 7, 2023

Also, let me present a specific use case where SCD was really needed. In our application exist users and licenses and the relationship between them is many-to-many, so another license_user table exists. Now we have many customers that perform seat rotation (we allow that), so every now and then, when a user has finished the training of a license, the admins will remove that user (User A) and add a new one (User B). Now the product wants to know, how often this is happening and by how many users a license seat is utilized during the license duration. The app itself deletes the record from the license_user table once the user is removed from it.

Hope that it makes sense. This is only one use-case that I'm able to provide.

@evantahler evantahler changed the title Destination V2 Snowflake: Keep deleted records for further processing Destination V2: Keep deleted records for further processing Sep 7, 2023
@evantahler
Copy link
Contributor

evantahler commented Sep 7, 2023

For clarity, I think there are 2 distinct issues being discussed here:

  1. In the final tables, perhaps we should be keeping deleted records. As we send CDC deletes as a 'virtual column' already from sources (_ab_cdc_deleted_at), perhaps we shouldn't be doing any special processing on this column at all, and leave deleted records (albeit with a deleted_at timestamp)
  2. Destinations V2 keeps only the most-recent records in the final table when deduping (expected) and the raw tables. We delete old raw records here. This thread describes use-cases in which keeping the entire history might be useful (and cannot be solved with snapshots). However, this seems like a great use-case for "append" (not "dedupe") syncs - every change to the table will be kept.

I believe #2 is already possible by using an "append" sync. So for clarity, this issue should only focus on part 1 - keeping CDC deleted records.

@dimoschi
Copy link
Author

@evantahler point 2. seems to be the solution to my problem. Although it is required to set up a new connection, as this is a one-time job it's not a huge deal. Thank you for providing a working solution to us.

As for point 1. even though keeping the deleted records is useful is contradictive to your objective, which was to have a table in the destination as similar as possible to the one in the source. Since point 2. addresses the issue of keeping deleted and intermediate states of records, from my side I'm covered.

@nina-j
Copy link

nina-j commented Sep 12, 2023

@evantahler Thank you for the clarification - we use incremental append and my confusion stemmed from it not being completely clear when records were deleted.

@evantahler
Copy link
Contributor

Glad this clears things up, @dimoschi and @nina-j! Please close this issue if you have what you need.

In general, if there is something in our docs (e.g.) or UI that we can add to make what happens in the various sync modes more clear, please let us know!

@kiwamizamurai
Copy link

kiwamizamurai commented May 13, 2024

The current behavior of Incremental is not able to handle source schema changes yet, for example, when a column is added, renamed or deleted from an existing table etc. It is recommended to trigger a Full refresh - Overwrite to correctly replicate the data to the destination with the new schema changes.
https://docs.airbyte.com/using-airbyte/core-concepts/sync-modes/incremental-append

i think that there is still a problem. we have to do [Full refresh - Overwrite] when the schema changes, thereby deleting the history

@evantahler
Copy link
Contributor

I think what you might be looking for @kiwamizamurai is this option to "propagate field changes only" when a schema change occurs.

Screenshot 2024-05-13 at 8 23 01 AM

@kiwamizamurai
Copy link

@evantahler I see. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoteam community team/destinations Destinations team's backlog team/use type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants