-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination V2: Keep deleted records for further processing #30211
Comments
This statement was never guaranteed to be true - which is why we removed SCD tables (along with it being underutilized) - it was an inaccurate guarentee.
|
I take this as a general discussion of SDC/CDC and not particular to Snowflake - we use BigQuery. We evaluated Airbyte against Datastream when deciding on new ingestion methods for our datalake, and one important reason for choosing Airbyte was exactly the ability to retain deletion information. Of course, with CDC there's always the WAL lifespan caveat, but my expectation would be for Airbyte to be less opinionated and simply reflect the WAL. Providing guarantees is another matter and in my opinion something that could be mentioned in documentation. Can you clarify what happens in an incremental append scenario? Would we have |
@evantahler I didn't really understand your point. I am discussing SCD tables in the context of CDC sync, which by definition follows the WAL log or the binlog to capture all intermediate states of the records and not just the final one, which can be captured easily with a simple select start query.
That's something that should be taken care of by the Airbyte user. Keep the WAL/binlog as long as it is required to avoid data loss.
With CDC and SCD tables, you didn't have "to ask this question into the product". The user could choose, how often to sync data and it was guaranteed that all data would be captured in the SCD table (syncs far apart are already addressed). |
Also, let me present a specific use case where SCD was really needed. In our application exist Hope that it makes sense. This is only one use-case that I'm able to provide. |
For clarity, I think there are 2 distinct issues being discussed here:
I believe #2 is already possible by using an "append" sync. So for clarity, this issue should only focus on part 1 - keeping CDC deleted records. |
@evantahler point 2. seems to be the solution to my problem. Although it is required to set up a new connection, as this is a one-time job it's not a huge deal. Thank you for providing a working solution to us. As for point 1. even though keeping the deleted records is useful is contradictive to your objective, which was to have a table in the destination as similar as possible to the one in the source. Since point 2. addresses the issue of keeping deleted and intermediate states of records, from my side I'm covered. |
@evantahler Thank you for the clarification - we use incremental append and my confusion stemmed from it not being completely clear when records were deleted. |
i think that there is still a problem. we have to do [Full refresh - Overwrite] when the schema changes, thereby deleting the history |
I think what you might be looking for @kiwamizamurai is this option to "propagate field changes only" when a schema change occurs. |
@evantahler I see. Thank you! |
What area the feature impact?
Connectors
Revelant Information
In V2 Destinations after removing the SCD tables, there is no safe way to generate SCD/historical records. In the past SCD tables had the all records, even the ones deleted by the source. Without SCD, those records can only be found in raw tables, which makes it very hard to use them. There are several occasions where deleted records are needed for analytical purposes.
We have several cases and I believe other people have too, where apps keep only transactional data that get deleted the moment they are no longer useful to the app itself. However, the same deleted data could be useful for analytical purposes.
Another point is that using CDC doesn't really make any sense if the intention is to mirror the sources as closely as possible. In that case, a select star from the source would be more than enough.
I think there is definitely a need to have for each record all past states as it gets updated and the final state even after deletion.
The text was updated successfully, but these errors were encountered: