Skip to content

Commit

Permalink
improvements to postgres + cloudsql docs (airbytehq#29415)
Browse files Browse the repository at this point in the history
  • Loading branch information
Hesperide authored Aug 14, 2023
1 parent 8d19017 commit 86ba823
Show file tree
Hide file tree
Showing 6 changed files with 196 additions and 28 deletions.
46 changes: 20 additions & 26 deletions docs/integrations/sources/postgres.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ From your [Airbyte Cloud](https://cloud.airbyte.com/workspaces) or Airbyte Open

To fill out the required information:
1. Enter the hostname, port number, and name for your Postgres database.
2. You may optionally opt to list each of the schemas you want to sync. These are case sensitive, and multiple schemas may be entered. By default, `public` is the only selected schema.
2. You may optionally opt to list each of the schemas you want to sync. These are case-sensitive, and multiple schemas may be entered. By default, `public` is the only selected schema.
3. Enter the username and password you created in [Step 1](#step-1-create-a-dedicated-read-only-postgres-user).
4. Select an SSL mode. You will most frequently choose `require` or `verify-ca`. Both of these always require encryption. `verify-ca` also requires certificates from your Postgres database. See here to learn about other SSL modes and SSH tunneling.
5. Select `Standard (xmin)` from available replication methods. This uses the [xmin system column](#xmin) to reliably replicate data from your database.
Expand Down Expand Up @@ -105,11 +105,11 @@ ALTER USER <user_name> REPLICATION;

#### Step 3: Enable logical replication on your Postgres database

To enable logical replication on bare metal, VMs (EC2/GCE/etc), or Docker, configure the following parameters in the <a href="http://https://www.postgresql.org/docs/current/config-setting.html">postgresql.conf file</a> for your Postgres database:
To enable logical replication on bare metal, VMs (EC2/GCE/etc), or Docker, configure the following parameters in the <a href="https://www.postgresql.org/docs/current/config-setting.html">postgresql.conf file</a> for your Postgres database:

| Parameter | Description | Set value to |
| --------------------- | ------------------------------------------------------------------------------ |------------------------------------------------------------------------------------------------------------------------------------|
| wal_level | Type of coding used within the Postgres write-ahead log | `logical ` |
| Parameter | Description | Set value to |
|-----------------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| wal_level | Type of coding used within the Postgres write-ahead log | `logical ` |
| max_wal_senders | The maximum number of processes used for handling WAL changes | `min: 1` |
| max_replication_slots | The maximum number of replication slots that are allowed to stream WAL changes | `1` (if Airbyte is the only service reading subscribing to WAL changes. More than 1 if other services are also reading from the WAL) |

Expand All @@ -128,31 +128,30 @@ az postgres server restart --resource-group group --name server

#### Step 4: Create a replication slot on your Postgres database

Airbyte requires a replication slot configured only for its use. Only one source should be configured that uses this replication slot. See Setting up CDC for Postgres for instructions.
Airbyte requires a replication slot configured only for its use. Only one source should be configured that uses this replication slot.

For this step, Airbyte requires use of the pgoutput plugin. To create a replication slot called `airbyte_slot` using pgoutput, run:
For this step, Airbyte requires use of the pgoutput plugin. To create a replication slot called `airbyte_slot` using pgoutput, run as the user with the newly granted `REPLICATION` role:

```
SELECT pg_create_logical_replication_slot('airbyte_slot', 'pgoutput');
```

The output of this command will include the name of the replication slot to fill into the Airbyte source setup page.

#### Step 5: Create publication and replication identities for each Postgres table

For each table you want to replicate with CDC, add the replication identity (the method of distinguishing between rows) first:
For each table you want to replicate with CDC, follow the steps below:

To use primary keys to distinguish between rows for tables that don't have a large amount of data per row, run:
1. Add the replication identity (the method of distinguishing between rows) for each table you want to replicate:

```
ALTER TABLE tbl1 REPLICA IDENTITY DEFAULT;
```

In case your tables use data types that support [TOAST](https://www.postgresql.org/docs/current/storage-toast.html) and have very large field values, use:

```
ALTER TABLE tbl1 REPLICA IDENTITY FULL;
```
In rare cases, if your tables use data types that support [TOAST](https://www.postgresql.org/docs/current/storage-toast.html) or have very large field values, consider instead using replica identity type full: `
ALTER TABLE tbl1 REPLICA IDENTITY FULL;`.

After setting the replication identity, run:
2. Create the Postgres publication. You should include all tables you want to replicate as part of the publication:

```
CREATE PUBLICATION airbyte_publication FOR TABLE <tbl1, tbl2, tbl3>;`
Expand All @@ -161,11 +160,6 @@ CREATE PUBLICATION airbyte_publication FOR TABLE <tbl1, tbl2, tbl3>;`
The publication name is customizable. Refer to the [Postgres docs](https://www.postgresql.org/docs/10/sql-alterpublication.html) if you need to add or remove tables from your publication in the future.

:::note
You must add the replication identity before creating the publication. Otherwise, `ALTER`/`UPDATE`/`DELETE` statements may fail if Postgres cannot determine how to uniquely identify rows.
Also, the publication should include all the tables and only the tables that need to be synced. Otherwise, data from these tables may not be replicated correctly.
:::

:::warning
The Airbyte UI currently allows selecting any tables for CDC. If a table is selected that is not part of the publication, it will not be replicated even though it is selected. If a table is part of the publication but does not have a replication identity, that replication identity will be created automatically on the first run if the Airbyte user has the necessary permissions.
:::

Expand Down Expand Up @@ -226,14 +220,14 @@ When using an SSH tunnel, you are configuring Airbyte to connect to an intermedi

To connect to a Postgres instance via an SSH tunnel:

1. While [setting up](#setup-guide) the Postgres source connector, from the SSH tunnel dropdown, select:
1. While [setting up](#step-2-create-a-new-postgres-source-in-airbyte-ui) the Postgres source connector, from the SSH tunnel dropdown, select:
- SSH Key Authentication to use a private as your secret for establishing the SSH tunnel
- Password Authentication to use a password as your secret for establishing the SSH Tunnel
2. For **SSH Tunnel Jump Server Host**, enter the hostname or IP address for the intermediate (bastion) server that Airbyte will connect to.
3. For **SSH Connection Port**, enter the port on the bastion server. The default port for SSH connections is 22.
4. For **SSH Login Username**, enter the username to use when connecting to the bastion server. **Note:** This is the operating system username and not the Postgres username.
5. For authentication:
- If you selected **SSH Key Authentication**, set the **SSH Private Key** to the [private Key](#generating-a-private-key) that you are using to create the SSH connection.
- If you selected **SSH Key Authentication**, set the **SSH Private Key** to the [private Key](#generating-a-private-key-for-ssh-tunneling) that you are using to create the SSH connection.
- If you selected **Password Authentication**, enter the password for the operating system user to connect to the bastion server. **Note:** This is the operating system password and not the Postgres password.

#### Generating a private key for SSH Tunneling
Expand All @@ -255,7 +249,7 @@ To see connector limitations, or troubleshoot your Postgres connector, see more
According to Postgres [documentation](https://www.postgresql.org/docs/14/datatype.html), Postgres data types are mapped to the following data types when synchronizing data. You can check the test values examples [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-postgres/src/test-integration/java/io/airbyte/integrations/io/airbyte/integration_tests/sources/PostgresSourceDatatypeTest.java). If you can't find the data type you are looking for or have any problems feel free to add a new test!

| Postgres Type | Resulting Type | Notes |
| ------------------------------------- | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
|---------------------------------------|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
| `bigint` | number | |
| `bigserial`, `serial8` | number | |
| `bit` | string | Fixed-length bit string (e.g. "0100"). |
Expand Down Expand Up @@ -306,7 +300,7 @@ According to Postgres [documentation](https://www.postgresql.org/docs/14/datatyp
## Changelog

| Version | Date | Pull Request | Subject |
| ------- | ---------- | -------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- |
|---------|------------|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 3.1.3 | 2023-08-03 | [28708](https://github.com/airbytehq/airbyte/pull/28708) | Enable checkpointing snapshots in CDC connections |
| 3.1.2 | 2023-08-01 | [28954](https://github.com/airbytehq/airbyte/pull/28954) | Fix an issue that prevented use of tables with names containing uppercase letters |
| 3.1.1 | 2023-07-31 | [28892](https://github.com/airbytehq/airbyte/pull/28892) | Fix an issue that prevented use of cursor columns with names containing uppercase letters |
Expand Down Expand Up @@ -409,13 +403,13 @@ According to Postgres [documentation](https://www.postgresql.org/docs/14/datatyp
| 0.4.43 | 2022-08-03 | [15226](https://github.com/airbytehq/airbyte/pull/15226) | Make connectionTimeoutMs configurable through JDBC url parameters |
| 0.4.42 | 2022-08-03 | [15273](https://github.com/airbytehq/airbyte/pull/15273) | Fix a bug in `0.4.36` and correctly parse the CDC initial record waiting time |
| 0.4.41 | 2022-08-03 | [15077](https://github.com/airbytehq/airbyte/pull/15077) | Sync data from beginning if the LSN is no longer valid in CDC |
| | 2022-08-03 | [14903](https://github.com/airbytehq/airbyte/pull/14903) | Emit state messages more frequently (⛔ this version has a bug; use `1.0.1` instead |
| | 2022-08-03 | [14903](https://github.com/airbytehq/airbyte/pull/14903) | Emit state messages more frequently (⛔ this version has a bug; use `1.0.1` instead |
| 0.4.40 | 2022-08-03 | [15187](https://github.com/airbytehq/airbyte/pull/15187) | Add support for BCE dates/timestamps |
| | 2022-08-03 | [14534](https://github.com/airbytehq/airbyte/pull/14534) | Align regular and CDC integration tests and data mappers |
| 0.4.39 | 2022-08-02 | [14801](https://github.com/airbytehq/airbyte/pull/14801) | Fix multiple log bindings |
| 0.4.38 | 2022-07-26 | [14362](https://github.com/airbytehq/airbyte/pull/14362) | Integral columns are now discovered as int64 fields. |
| 0.4.37 | 2022-07-22 | [14714](https://github.com/airbytehq/airbyte/pull/14714) | Clarified error message when invalid cursor column selected |
| 0.4.36 | 2022-07-21 | [14451](https://github.com/airbytehq/airbyte/pull/14451) | Make initial CDC waiting time configurable (⛔ this version has a bug and will not work; use `0.4.42` instead) | |
| 0.4.36 | 2022-07-21 | [14451](https://github.com/airbytehq/airbyte/pull/14451) | Make initial CDC waiting time configurable (⛔ this version has a bug and will not work; use `0.4.42` instead) | |
| 0.4.35 | 2022-07-14 | [14574](https://github.com/airbytehq/airbyte/pull/14574) | Removed additionalProperties:false from JDBC source connectors |
| 0.4.34 | 2022-07-17 | [13840](https://github.com/airbytehq/airbyte/pull/13840) | Added the ability to connect using different SSL modes and SSL certificates. |
| 0.4.33 | 2022-07-14 | [14586](https://github.com/airbytehq/airbyte/pull/14586) | Validate source JDBC url parameters |
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 86ba823

Please sign in to comment.