Skip to content

Commit

Permalink
[docs] Logical replication explore docs sections (#23231)
Browse files Browse the repository at this point in the history
* logical replication skeleton

* move docs in subsection

* Fix sections

* Fix subsection

* Fix file name

* Fix name

* fix links

* Fix links

* review comments

* page titles

* minor edits

---------

Co-authored-by: Dwight Hodge <ghodge@yugabyte.com>
  • Loading branch information
siddharth2411 and ddhodge committed Jul 17, 2024
1 parent 87fe4b8 commit c212f4b
Show file tree
Hide file tree
Showing 21 changed files with 293 additions and 177 deletions.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Logical replication in YugabyteDB
headerTitle: CDC - logical replication
linkTitle: CDC - logical replication
title: Change data capture using Logical Replication in YugabyteDB
headerTitle: CDC using Logical Replication
linkTitle: CDC using Logical Replication
description: Learn how YugabyteDB supports asynchronous replication of data changes (inserts, updates, and deletes) to external databases or applications.
headContent: Asynchronous replication of data changes (inserts, updates, and deletes) to external databases or applications
badges: ea
Expand All @@ -15,7 +15,7 @@ type: docs

Change data capture (CDC) in YugabyteDB provides technology to ensure that any changes in data due to operations such as inserts, updates, and deletions are identified, captured, and made available for consumption by applications and other tools.

CDC in YugabyteDB is based on the Postgres Logical Replication model. The fundamental concept here is that of the Replication Slot. A Replication Slot represents a stream of changes that can be replayed to the client in the order they were made on the origin server in a manner that preserves transactional consistency. This is the basis for the support for Transactional CDC in YugabyteDB. Where the strict requirements of Transactional CDC are not present, multiple replication slots can be used to stream changes from unrelated tables in parallel.
CDC in YugabyteDB is based on the PostgreSQL Logical Replication model. The fundamental concept here is that of the Replication Slot. A Replication Slot represents a stream of changes that can be replayed to the client in the order they were made on the origin server in a manner that preserves transactional consistency. This is the basis for the support for Transactional CDC in YugabyteDB. Where the strict requirements of Transactional CDC are not present, multiple replication slots can be used to stream changes from unrelated tables in parallel.

## Architecture

Expand All @@ -31,50 +31,50 @@ The following are the main components of the Yugabyte CDC solution -

### Data Flow

Logical replication starts by copying a snapshot of the data on the publisher database. Once that is done, changes on the publisher are streamed to the server as they occur in near real time.
Logical replication starts by copying a snapshot of the data on the publisher database. After that is done, changes on the publisher are streamed to the server as they occur in near real time.

To setup Logical Replication, an application will first have to create a replication slot. When a replication slot is created a boundary is established between the snapshot data and the streaming changes. This boundary or consistent_point is a consistent state of the source database. It corresponds to a commit time (HybridTime value). Data from transactions with commit time <= commit time corresponding to the consistent_point are consumed as part of the initial snapshot. Changes from transactions with commit time > commit time of the consistent_point are consumed in the streaming phase in transaction commit time order.
To setup Logical Replication, an application will first have to create a replication slot. When a replication slot is created, a boundary is established between the snapshot data and the streaming changes. This boundary or `consistent_point` is a consistent state of the source database. It corresponds to a commit time (HybridTime value). Data from transactions with commit time <= commit time corresponding to the `consistent_point` are consumed as part of the initial snapshot. Changes from transactions with commit time greater than the commit time of the `consistent_point` are consumed in the streaming phase in transaction commit time order.

#### Initial Snapshot

The initial snapshot data for each table is consumed by executing a corresponding snapshot query (SELECT statement) on that table. This snapshot query should be executed as of the database state corresponding to the consistent_point. This database state is represented by a value of HybridTime. 
The initial snapshot data for each table is consumed by executing a corresponding snapshot query (SELECT statement) on that table. This snapshot query should be executed as of the database state corresponding to the `consistent_point`. This database state is represented by a value of HybridTime.

First, a `SET LOCAL yb_read_time TO <consistent_point commit time> ht` command should be executed on the connection (session). The SELECT statement corresponding to the snapshot query should then be executed as part of the same transaction.
First, a `SET LOCAL yb_read_time TO '<consistent_point commit time> ht'` command should be executed on the connection (session). The SELECT statement corresponding to the snapshot query should then be executed as part of the same transaction.

The HybridTime value to use in the `SET LOCAL yb_read_time `command is the value of the `snapshot_name` field that is returned by the `CREATE_REPLICATION_SLOT` command. Alternatively, it can be obtained by querying the `pg_replication_slots` view.

During Snapshot consumption, the snapshot data from all tables will be from the same consistent state (consistent_point). At the end of Snapshot consumption, the state of the target system is at/based on the consistent_point. History of the tables as of the consistent_point is retained on the source until the snapshot is consumed.
During Snapshot consumption, the snapshot data from all tables will be from the same consistent state (`consistent_point`). At the end of Snapshot consumption, the state of the target system is at/based on the `consistent_point`. History of the tables as of the `consistent_point` is retained on the source until the snapshot is consumed.

#### Streaming Data Flow

YugabyteDB automatically splits user tables into multiple shards (also called tablets) using either a hash- or range-based strategy. The primary key for each row in the table uniquely identifies the location of the tablet in the row.

Each tablet has its own WAL. WAL is NOT in-memory, but it is disk persisted. Each WAL preserves the information on the changes involved in the transactions (or changes) for that tablet as well as additional metadata related to the transactions.

**Step 1 - Data flow from the tablets’ WAL to the VWAL**
**Step 1 - Data flow from the tablet WAL to the VWAL**

![CDCService-VWAL](/images/architecture/cdc_service_vwal_interaction.png)

Each tablet sends changes in transaction commit time order. Further, within a transaction, the changes are in the order in which the operations were performed in the transaction.
Each tablet sends changes in transaction commit time order. Further, in a transaction, the changes are in the order in which the operations were performed in the transaction.

**Step 2 - Sorting in the VWAL and sending transactions to the Walsender**

![VWAL-Walsender](/images/architecture/vwal_walsender_interaction.png)

VWAL collects changes across multiple tablets, assembles the transactions, assigns LSN to each change and transaction boundary (BEGIN, COMMIT) record and sends the changes to the Walsender in transaction commit time order.
VWAL collects changes across multiple tablets, assembles the transactions, assigns LSN to each change and transaction boundary (BEGIN, COMMIT) record, and sends the changes to the Walsender in transaction commit time order.

**Step 3 - Walsender to client**

Walsender sends changes to the output plugin, which filters them according to the slot's publication and converts them into the client's desired format. These changes are then streamed to the client using the appropriate streaming replication protocols determined by the output plugin. Yugabyte follows the same streaming replication protocols as defined in PostgreSQL.

{{< note title="Note" >}}
<!--TODO (Siddharth): Fix the Link to the protocol section. -->
Please refer to [Replication Protocol](../../../explore/logical-replication/#Streaming-Protocol) section for more details.
Refer to [Replication Protocol](../../../explore/logical-replication/#Streaming-Protocol) for more details.

{{< /note >}}

{{< tip title="Explore" >}}
<!--TODO (Siddharth): Fix the Link to the getting started section. -->
See [Getting Started with Logical Replication](../../../explore/logical-replication/getting-started) in Explore to setup Logical Replication in YugabyteDB.
See [Getting Started with Logical Replication](../../../explore/logical-replication/getting-started) to set up Logical Replication in YugabyteDB.

{{< /tip >}}
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Change data capture (CDC) in YugabyteDB
headerTitle: CDC - gRPC replication
linkTitle: CDC - gRPC replication
title: Change data capture (CDC) gRPC Replication in YugabyteDB
headerTitle: CDC using gRPC Replication
linkTitle: CDC using gRPC Replication
description: Learn how YugabyteDB supports asynchronous replication of data changes (inserts, updates, and deletes) to external databases or applications.
badges: ea
aliases:
Expand All @@ -14,34 +14,42 @@ menu:
type: docs
---

Change data capture (CDC) in YugabyteDB provides technology to ensure that any changes in data due to operations such as inserts, updates, and deletions are identified, captured, and automatically applied to another data repository instance, or made available for consumption by applications and other tools. CDC provides the following guarantees.

- [Ordering is maintained per-tablet](#per-tablet-ordered-delivery)
- [At-least once delivery](#at-least-once-delivery)
- [No gaps](#no-gaps-in-change-stream)

## Architecture

![Stateless CDC Service](/images/architecture/stateless_cdc_service.png)

Every YB-TServer has a `CDC service` that is stateless. The main APIs provided by the CDC service are the following:

* `createCDCSDKStream` API for creating the stream on the database.
* `getChangesCDCSDK` API that can be used by the client to get the latest set of changes.
- `createCDCSDKStream` API for creating the stream on the database.
- `getChangesCDCSDK` API that can be used by the client to get the latest set of changes.

## CDC streams

Creating a new CDC stream returns a stream UUID. This is facilitated via the [yb-admin](../../../admin/yb-admin/#change-data-capture-cdc-commands) tool.

## Debezium
YugabyteDB automatically splits user tables into multiple shards (also called tablets) using either a hash- or range-based strategy. The primary key for each row in the table uniquely identifies the location of the tablet in the row.

Each tablet has its own WAL file. WAL is NOT in-memory, but it is disk persisted. Each WAL preserves the order in which transactions (or changes) happened. Hybrid TS, Operation ID, and additional metadata about the transaction is also preserved.

![How does CDC work](/images/explore/cdc-overview-work2.png)

YugabyteDB normally purges WAL segments after some period of time. This means that the connector does not have the complete history of all changes that have been made to the database. Therefore, when the connector first connects to a particular YugabyteDB database, it starts by performing a consistent snapshot of each of the database schemas.

The Debezium YugabyteDB connector captures row-level changes in the schemas of a YugabyteDB database. The first time it connects to a YugabyteDB cluster, the connector takes a consistent snapshot of all schemas. After that snapshot is complete, the connector continuously captures row-level changes that insert, update, and delete database content, and that were committed to a YugabyteDB database.

![How does CDC work](/images/explore/cdc-overview-work.png)

The connector produces a change event for every row-level insert, update, and delete operation that was captured, and sends change event records for each table in a separate Kafka topic. Client applications read the Kafka topics that correspond to the database tables of interest, and can react to every row-level event they receive from those topics. For each table, the default behavior is that the connector streams all generated events to a separate Kafka topic for that table. Applications and services consume data change event records from that topic.

The core primitive of CDC is the _stream_. Streams can be enabled and disabled on databases. Every change to a watched database table is emitted as a record in a configurable format to a configurable sink. Streams scale to any YugabyteDB cluster independent of its size and are designed to impact production traffic as little as possible.

To consume the events generated by CDC, Debezium is used as the connector. Debezium is an open-source distributed platform that needs to be pointed at the database using the stream ID. For information on how to set up Debezium for YugabyteDB CDC, see [Debezium integration](../../../integrations/cdc/debezium/).
![How does CDC work](/images/explore/cdc-overview-work3.png)

## Pushing changes to external systems
## CDC guarantees

Using the Debezium connector for YugabyteDB, changes are pushed from YugabyteDB to a Kafka topic, which can then be used by any end-user application for the processing and analysis of the records.
CDC in YugabyteDB provides technology to ensure that any changes in data due to operations (such as inserts, updates, and deletions) are identified, captured, and automatically applied to another data repository instance, or made available for consumption by applications and other tools. CDC provides the following guarantees.

## Per-tablet ordered delivery
### Per-tablet ordered delivery

All data changes for one row or multiple rows in the same tablet are received in the order in which they occur. Due to the distributed nature of the problem, however, there is no guarantee for the order across tablets.

Expand All @@ -53,13 +61,13 @@ Consider the following scenario:

In this case, it is possible for CDC to push the later update corresponding to `row #2` change to Kafka before pushing the earlier update, corresponding to `row #1`.

## At-least-once delivery
### At-least-once delivery

Updates for rows are pushed at least once. With the at-least-once delivery, you never lose a message, however the message might be delivered to a CDC consumer more than once. This can happen in case of a tablet leader change, where the old leader already pushed changes to Kafka, but the latest pushed `op id` was not updated in the CDC metadata.

For example, a CDC client has received changes for a row at times `t1` and `t3`. It is possible for the client to receive those updates again.

## No gaps in change stream
### No gaps in change stream

When you have received a change for a row for timestamp `t`, you do not receive a previously unseen change for that row from an earlier timestamp. This guarantees that receiving any change implies that all earlier changes have been received for a row.

Expand Down
Loading

0 comments on commit c212f4b

Please sign in to comment.