Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] Provide support to read old image of the changed Column #9352

Closed
suranjan opened this issue Jul 19, 2021 · 2 comments
Closed

[CDCSDK] Provide support to read old image of the changed Column #9352

suranjan opened this issue Jul 19, 2021 · 2 comments
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/medium Medium priority issue
Projects

Comments

@suranjan
Copy link
Contributor

suranjan commented Jul 19, 2021

Jira Link: DB-3519
For the CDC changes, there is a requirement to provide the old image as part of the CDCRecrod.

We need to finalize the approach and decide on the priority.

@suranjan suranjan created this issue from a note in CDC (To do) Jul 19, 2021
@suranjan suranjan added the area/cdcsdk CDC SDK label Jul 19, 2021
@vaibhav-yb vaibhav-yb changed the title Provide support to read old image of the changed Column [CDCSDK] Provide support to read old image of the changed Column Jan 12, 2022
@FeixiangZhao
Copy link

FeixiangZhao commented Jul 11, 2022

In PostgreSQL, there is a table option REPLICA IDENTITY.

This form changes the information which is written to the write-ahead log to identify rows which are updated or deleted. In most cases, the old value of each column is only logged if it differs from the new value; however, if the old value is stored externally, it is always logged regardless of whether it changed. This option has no effect except when logical replication is in use.

DEFAULT
Records the old values of the columns of the primary key, if any. This is the default for non-system tables.

USING INDEX index_name
Records the old values of the columns covered by the named index, that must be unique, not partial, not deferrable, and include only columns marked NOT NULL. If this index is dropped, the behavior is the same as NOTHING.

FULL
Records the old values of all columns in the row.

NOTHING
Records no information about the old row. This is the default for system tables.

By default, the PostgreSQL will send the updated data and primary key to the CDC event.
If we use the ALTER TABLE command to change the table option to FULL. It will send all the columns to the CDC event.
Just as:

=> ALTER TABLE public.payout_order REPLICA IDENTITY FULL;
ALTER TABLE

But at the moment, if we execute the same command in the Yugabyte DB, it will back an error message:

2022-07-08_07:54:51=# select version();
                                                                                         version
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.2-YB-2.13.2.0-b0 on x86_64-pc-linux-gnu, compiled by clang version 12.0.1 (https://github.com/yugabyte/llvm-project.git bdb147e675d8c87cee72cc1f87c4b82855977d94), 64-bit
(1 row)

2022-07-08_07:54:58=#
2022-07-08_07:54:59=# ALTER TABLE ONLY public.sub_account_override REPLICA IDENTITY FULL;
ERROR:  ALTER TABLE REPLICA IDENTITY not supported yet
LINE 1: ALTER TABLE ONLY public.sub_account_override REPLICA IDENTIT...
                                                     ^
HINT:  See https://github.com/YugaByte/yugabyte-db/issues/1124. Click '+' on the description to raise its priority

So it means that the Yugabyte DB does not support setting the REPLICA IDENTITY to FULL, not supporting sending all columns to the CDC event. In some case, the application both need the updated column, but also need the other column for the application.

We hope the YugabyteDB also can support the REPLICA IDENTITY to FULL. It will be the same future as PostgreSQL.

@aaron-wang-awx
Copy link

aaron-wang-awx commented Jul 13, 2022

In terms of CDC applications, there are basically two types.

  1. Use CDC for synchronization between multiple databases, such as backups between two PostgreSQL databases or data synchronization between two Yugabyte clusters. In this scenario, since the database stores the state prior to each record, sending only primary key and partially modified data can meet the requirement for field modification.

  2. Use CDC data for data analysis and application. We mostly use CDC for data application scenarios. In this scenario, the downstream does not store the complete database state, but analyzes and processes each piece of data. In this case, if the data contains only the modified fields, the data application requirements cannot be met. Restoring complete fields in downstream data applications requires storing database state in memory or some external storage, which increases downstream application complexity and resource consumption.

For example, if there is a table which size is 1TB, if we want to recovery the columns in downstream data application, we need to store the table state in memory and this is impossible. Otherwise, the application need to get the full columns from a database by primary key each time when a new CDC message comes in.

Therefore, we hope Yugabyte can support sending full fields. Although this will add some network IO since the data size will increase, but for the second application scenario, the whole data processing pipeline will be relatively simple.

@suranjan suranjan moved this from To do to In progress in CDC Sep 15, 2022
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Sep 15, 2022
@yugabyte-ci yugabyte-ci added status/awaiting-triage Issue awaiting triage and removed status/awaiting-triage Issue awaiting triage labels Nov 8, 2022
CDC automation moved this from In progress to Done Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cdcsdk CDC SDK kind/bug This issue is a bug priority/medium Medium priority issue
Projects
CDC
  
Done
Development

No branches or pull requests

5 participants