diff --git a/docs/integrations/sources/postgres.md b/docs/integrations/sources/postgres.md index 1af4e013d05e..976f5b653052 100644 --- a/docs/integrations/sources/postgres.md +++ b/docs/integrations/sources/postgres.md @@ -44,7 +44,7 @@ From your [Airbyte Cloud](https://cloud.airbyte.com/workspaces) or Airbyte Open To fill out the required information: 1. Enter the hostname, port number, and name for your Postgres database. -2. You may optionally opt to list each of the schemas you want to sync. These are case sensitive, and multiple schemas may be entered. By default, `public` is the only selected schema. +2. You may optionally opt to list each of the schemas you want to sync. These are case-sensitive, and multiple schemas may be entered. By default, `public` is the only selected schema. 3. Enter the username and password you created in [Step 1](#step-1-create-a-dedicated-read-only-postgres-user). 4. Select an SSL mode. You will most frequently choose `require` or `verify-ca`. Both of these always require encryption. `verify-ca` also requires certificates from your Postgres database. See here to learn about other SSL modes and SSH tunneling. 5. Select `Standard (xmin)` from available replication methods. This uses the [xmin system column](#xmin) to reliably replicate data from your database. @@ -105,11 +105,11 @@ ALTER USER REPLICATION; #### Step 3: Enable logical replication on your Postgres database -To enable logical replication on bare metal, VMs (EC2/GCE/etc), or Docker, configure the following parameters in the postgresql.conf file for your Postgres database: +To enable logical replication on bare metal, VMs (EC2/GCE/etc), or Docker, configure the following parameters in the postgresql.conf file for your Postgres database: -| Parameter | Description | Set value to | -| --------------------- | ------------------------------------------------------------------------------ |------------------------------------------------------------------------------------------------------------------------------------| -| wal_level | Type of coding used within the Postgres write-ahead log | `logical ` | +| Parameter | Description | Set value to | +|-----------------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------| +| wal_level | Type of coding used within the Postgres write-ahead log | `logical ` | | max_wal_senders | The maximum number of processes used for handling WAL changes | `min: 1` | | max_replication_slots | The maximum number of replication slots that are allowed to stream WAL changes | `1` (if Airbyte is the only service reading subscribing to WAL changes. More than 1 if other services are also reading from the WAL) | @@ -128,31 +128,30 @@ az postgres server restart --resource-group group --name server #### Step 4: Create a replication slot on your Postgres database -Airbyte requires a replication slot configured only for its use. Only one source should be configured that uses this replication slot. See Setting up CDC for Postgres for instructions. +Airbyte requires a replication slot configured only for its use. Only one source should be configured that uses this replication slot. -For this step, Airbyte requires use of the pgoutput plugin. To create a replication slot called `airbyte_slot` using pgoutput, run: +For this step, Airbyte requires use of the pgoutput plugin. To create a replication slot called `airbyte_slot` using pgoutput, run as the user with the newly granted `REPLICATION` role: ``` SELECT pg_create_logical_replication_slot('airbyte_slot', 'pgoutput'); ``` +The output of this command will include the name of the replication slot to fill into the Airbyte source setup page. + #### Step 5: Create publication and replication identities for each Postgres table -For each table you want to replicate with CDC, add the replication identity (the method of distinguishing between rows) first: +For each table you want to replicate with CDC, follow the steps below: -To use primary keys to distinguish between rows for tables that don't have a large amount of data per row, run: +1. Add the replication identity (the method of distinguishing between rows) for each table you want to replicate: ``` ALTER TABLE tbl1 REPLICA IDENTITY DEFAULT; ``` -In case your tables use data types that support [TOAST](https://www.postgresql.org/docs/current/storage-toast.html) and have very large field values, use: - -``` -ALTER TABLE tbl1 REPLICA IDENTITY FULL; -``` +In rare cases, if your tables use data types that support [TOAST](https://www.postgresql.org/docs/current/storage-toast.html) or have very large field values, consider instead using replica identity type full: ` +ALTER TABLE tbl1 REPLICA IDENTITY FULL;`. -After setting the replication identity, run: +2. Create the Postgres publication. You should include all tables you want to replicate as part of the publication: ``` CREATE PUBLICATION airbyte_publication FOR TABLE ;` @@ -161,11 +160,6 @@ CREATE PUBLICATION airbyte_publication FOR TABLE ;` The publication name is customizable. Refer to the [Postgres docs](https://www.postgresql.org/docs/10/sql-alterpublication.html) if you need to add or remove tables from your publication in the future. :::note -You must add the replication identity before creating the publication. Otherwise, `ALTER`/`UPDATE`/`DELETE` statements may fail if Postgres cannot determine how to uniquely identify rows. -Also, the publication should include all the tables and only the tables that need to be synced. Otherwise, data from these tables may not be replicated correctly. -::: - -:::warning The Airbyte UI currently allows selecting any tables for CDC. If a table is selected that is not part of the publication, it will not be replicated even though it is selected. If a table is part of the publication but does not have a replication identity, that replication identity will be created automatically on the first run if the Airbyte user has the necessary permissions. ::: @@ -226,14 +220,14 @@ When using an SSH tunnel, you are configuring Airbyte to connect to an intermedi To connect to a Postgres instance via an SSH tunnel: -1. While [setting up](#setup-guide) the Postgres source connector, from the SSH tunnel dropdown, select: +1. While [setting up](#step-2-create-a-new-postgres-source-in-airbyte-ui) the Postgres source connector, from the SSH tunnel dropdown, select: - SSH Key Authentication to use a private as your secret for establishing the SSH tunnel - Password Authentication to use a password as your secret for establishing the SSH Tunnel 2. For **SSH Tunnel Jump Server Host**, enter the hostname or IP address for the intermediate (bastion) server that Airbyte will connect to. 3. For **SSH Connection Port**, enter the port on the bastion server. The default port for SSH connections is 22. 4. For **SSH Login Username**, enter the username to use when connecting to the bastion server. **Note:** This is the operating system username and not the Postgres username. 5. For authentication: - - If you selected **SSH Key Authentication**, set the **SSH Private Key** to the [private Key](#generating-a-private-key​) that you are using to create the SSH connection. + - If you selected **SSH Key Authentication**, set the **SSH Private Key** to the [private Key](#generating-a-private-key-for-ssh-tunneling) that you are using to create the SSH connection. - If you selected **Password Authentication**, enter the password for the operating system user to connect to the bastion server. **Note:** This is the operating system password and not the Postgres password. #### Generating a private key for SSH Tunneling @@ -255,7 +249,7 @@ To see connector limitations, or troubleshoot your Postgres connector, see more According to Postgres [documentation](https://www.postgresql.org/docs/14/datatype.html), Postgres data types are mapped to the following data types when synchronizing data. You can check the test values examples [here](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-postgres/src/test-integration/java/io/airbyte/integrations/io/airbyte/integration_tests/sources/PostgresSourceDatatypeTest.java). If you can't find the data type you are looking for or have any problems feel free to add a new test! | Postgres Type | Resulting Type | Notes | -| ------------------------------------- | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | +|---------------------------------------|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------| | `bigint` | number | | | `bigserial`, `serial8` | number | | | `bit` | string | Fixed-length bit string (e.g. "0100"). | @@ -306,7 +300,7 @@ According to Postgres [documentation](https://www.postgresql.org/docs/14/datatyp ## Changelog | Version | Date | Pull Request | Subject | -| ------- | ---------- | -------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --- | +|---------|------------|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 3.1.3 | 2023-08-03 | [28708](https://github.com/airbytehq/airbyte/pull/28708) | Enable checkpointing snapshots in CDC connections | | 3.1.2 | 2023-08-01 | [28954](https://github.com/airbytehq/airbyte/pull/28954) | Fix an issue that prevented use of tables with names containing uppercase letters | | 3.1.1 | 2023-07-31 | [28892](https://github.com/airbytehq/airbyte/pull/28892) | Fix an issue that prevented use of cursor columns with names containing uppercase letters | @@ -409,13 +403,13 @@ According to Postgres [documentation](https://www.postgresql.org/docs/14/datatyp | 0.4.43 | 2022-08-03 | [15226](https://github.com/airbytehq/airbyte/pull/15226) | Make connectionTimeoutMs configurable through JDBC url parameters | | 0.4.42 | 2022-08-03 | [15273](https://github.com/airbytehq/airbyte/pull/15273) | Fix a bug in `0.4.36` and correctly parse the CDC initial record waiting time | | 0.4.41 | 2022-08-03 | [15077](https://github.com/airbytehq/airbyte/pull/15077) | Sync data from beginning if the LSN is no longer valid in CDC | -| | 2022-08-03 | [14903](https://github.com/airbytehq/airbyte/pull/14903) | Emit state messages more frequently (⛔ this version has a bug; use `1.0.1` instead | +| | 2022-08-03 | [14903](https://github.com/airbytehq/airbyte/pull/14903) | Emit state messages more frequently (⛔ this version has a bug; use `1.0.1` instead | | 0.4.40 | 2022-08-03 | [15187](https://github.com/airbytehq/airbyte/pull/15187) | Add support for BCE dates/timestamps | | | 2022-08-03 | [14534](https://github.com/airbytehq/airbyte/pull/14534) | Align regular and CDC integration tests and data mappers | | 0.4.39 | 2022-08-02 | [14801](https://github.com/airbytehq/airbyte/pull/14801) | Fix multiple log bindings | | 0.4.38 | 2022-07-26 | [14362](https://github.com/airbytehq/airbyte/pull/14362) | Integral columns are now discovered as int64 fields. | | 0.4.37 | 2022-07-22 | [14714](https://github.com/airbytehq/airbyte/pull/14714) | Clarified error message when invalid cursor column selected | -| 0.4.36 | 2022-07-21 | [14451](https://github.com/airbytehq/airbyte/pull/14451) | Make initial CDC waiting time configurable (⛔ this version has a bug and will not work; use `0.4.42` instead) | | +| 0.4.36 | 2022-07-21 | [14451](https://github.com/airbytehq/airbyte/pull/14451) | Make initial CDC waiting time configurable (⛔ this version has a bug and will not work; use `0.4.42` instead) | | | 0.4.35 | 2022-07-14 | [14574](https://github.com/airbytehq/airbyte/pull/14574) | Removed additionalProperties:false from JDBC source connectors | | 0.4.34 | 2022-07-17 | [13840](https://github.com/airbytehq/airbyte/pull/13840) | Added the ability to connect using different SSL modes and SSL certificates. | | 0.4.33 | 2022-07-14 | [14586](https://github.com/airbytehq/airbyte/pull/14586) | Validate source JDBC url parameters | diff --git a/docs/integrations/sources/postgres/assets/airbyte_cloud_sql_postgres_add_network.png b/docs/integrations/sources/postgres/assets/airbyte_cloud_sql_postgres_add_network.png new file mode 100644 index 000000000000..134273ddbd5c Binary files /dev/null and b/docs/integrations/sources/postgres/assets/airbyte_cloud_sql_postgres_add_network.png differ diff --git a/docs/integrations/sources/postgres/assets/airbyte_cloud_sql_postgres_db.png b/docs/integrations/sources/postgres/assets/airbyte_cloud_sql_postgres_db.png new file mode 100644 index 000000000000..817da0a78532 Binary files /dev/null and b/docs/integrations/sources/postgres/assets/airbyte_cloud_sql_postgres_db.png differ diff --git a/docs/integrations/sources/postgres/assets/airbyte_cloud_sql_postgres_logical_replication_flag.png b/docs/integrations/sources/postgres/assets/airbyte_cloud_sql_postgres_logical_replication_flag.png new file mode 100644 index 000000000000..8372afc93b8e Binary files /dev/null and b/docs/integrations/sources/postgres/assets/airbyte_cloud_sql_postgres_logical_replication_flag.png differ diff --git a/docs/integrations/sources/postgres/cloud-sql-postgres.md b/docs/integrations/sources/postgres/cloud-sql-postgres.md new file mode 100644 index 000000000000..0dd9bcf5ee3e --- /dev/null +++ b/docs/integrations/sources/postgres/cloud-sql-postgres.md @@ -0,0 +1,169 @@ +# Cloud SQL for PostgreSQL + +Airbyte's certified Postgres connector offers the following features: +* Multiple methods of keeping your data fresh, including [Change Data Capture (CDC)](https://docs.airbyte.com/understanding-airbyte/cdc) and replication using the [xmin system column](https://docs.airbyte.com/integrations/sources/postgres#xmin). +* All available [sync modes](https://docs.airbyte.com/cloud/core-concepts#connection-sync-modes), providing flexibility in how data is delivered to your destination. +* Reliable replication at any table size with [checkpointing](https://docs.airbyte.com/understanding-airbyte/airbyte-protocol/#state--checkpointing) and chunking of database reads. + +![Airbyte Postgres Connection](https://raw.githubusercontent.com/airbytehq/airbyte/c078e8ed6703020a584d9362efa5665fbe8db77f/docs/integrations/sources/postgres/assets/airbyte_postgres_source.png?raw=true) + +## Quick Start + +![Cloud SQL for PostgreSQL](./assets/airbyte_cloud_sql_postgres_db.png) + +Here is an outline of the minimum required steps to configure a connection to Postgres on Google Cloud SQL: +1. Create a dedicated read-only Postgres user with permissions for replicating data +2. Create a new Postgres source in the Airbyte UI using `xmin` system column +3. (Airbyte Cloud Only) Allow inbound traffic from Airbyte IPs + +Once this is complete, you will be able to select Postgres as a source for replicating data. + +#### Step 1: Create a dedicated read-only Postgres user + +These steps create a dedicated read-only user for replicating data. Alternatively, you can use an existing Postgres user in your database. To create a user, first [connect to your database](https://cloud.google.com/sql/docs/postgres/connect-overview#external-connection-methods). If you are getting started, you can use [Cloud Shell to connect directly from the UI](https://cloud.google.com/sql/docs/postgres/connect-instance-cloud-shell). + +The following commands will create a new user: + +```roomsql +CREATE USER PASSWORD 'your_password_here'; +``` + +Now, provide this user with read-only access to relevant schemas and tables. Re-run this command for each schema you expect to replicate data from (e.g. `public`): + +```roomsql +GRANT USAGE ON SCHEMA TO ; +GRANT SELECT ON ALL TABLES IN SCHEMA TO ; +ALTER DEFAULT PRIVILEGES IN SCHEMA GRANT SELECT ON TABLES TO ; +``` + +#### Step 2: Create a new Postgres source in Airbyte UI + +From your [Airbyte Cloud](https://cloud.airbyte.com/workspaces) or Airbyte Open Source account, select `Sources` from the left navigation bar, search for `Postgres`, then create a new Postgres source. + +![Create an Airbyte source](https://github.com/airbytehq/airbyte/blob/c078e8ed6703020a584d9362efa5665fbe8db77f/docs/integrations/sources/postgres/assets/airbyte_source_selection.png?raw=true) + +To fill out the required information: +1. Enter the hostname, port number, and name for your Postgres database. +2. You may optionally opt to list each of the schemas you want to sync. These are case-sensitive, and multiple schemas may be entered. By default, `public` is the only selected schema. +3. Enter the username and password you created in [Step 1](#step-1-create-a-dedicated-read-only-postgres-user). +4. Select an SSL mode. You will most frequently choose `require` or `verify-ca`. Both of these always require encryption. `verify-ca` also requires certificates from your Postgres database. See here to learn about other SSL modes and SSH tunneling. +5. Select `Standard (xmin)` from available replication methods. This uses the [xmin system column](https://docs.airbyte.com/integrations/sources/postgres#xmin) to reliably replicate data from your database. + 1. If your database is particularly large (> 500 GB), you will benefit from [configuring your Postgres source using logical replication (CDC)](https://docs.airbyte.com/integrations/sources/postgres#cdc). + +#### Step 3: (Airbyte Cloud Only) Allow inbound traffic from Airbyte IPs. + +If you are on Airbyte Cloud, you will always need to modify your database configuration to allow inbound traffic from Airbyte IPs. To allowlist IPs in Cloud SQL: +1. In your Google Cloud SQL database dashboard, select `Connections` from the left menu. Then, select `Add Network` under the `Connectivity` section. + +![Add a Network](./assets/airbyte_cloud_sql_postgres_add_network.png) + +2. Add a new network, and enter Airbyte's IPs: + +```roomsql +34.106.109.131 +34.106.196.165 +34.106.60.246 +34.106.229.69 +34.106.127.139 +34.106.218.58 +34.106.115.240 +34.106.225.141 +13.37.4.46 +13.37.142.60 +35.181.124.238 +``` + +Now, click `Set up source` in the Airbyte UI. Airbyte will now test connecting to your database. Once this succeeds, you've configured an Airbyte Postgres source! + +## Advanced Configuration + +### Setup using CDC + +Airbyte uses [logical replication](https://www.postgresql.org/docs/10/logical-replication.html) of the Postgres write-ahead log (WAL) to incrementally capture deletes using a replication plugin: +* See [here](https://docs.airbyte.com/understanding-airbyte/cdc) to learn more on how Airbyte implements CDC. +* See [here](https://docs.airbyte.com/integrations/sources/postgres/postgres-troubleshooting#cdc-requirements) to learn more about Postgres CDC requirements and limitations. + +We recommend configuring your Postgres source with CDC when: +- You need a record of deletions. +- You have a very large database (500 GB or more). +- Your table has a primary key but doesn't have a reasonable cursor field for incremental syncing (`updated_at`). + +These are the additional steps required (after following the [quick start](#quick-start)) to configure your Postgres source using CDC: +1. Provide additional `REPLICATION` permissions to read-only user +2. Enable logical replication on your Postgres database +3. Create a replication slot on your Postgres database +4. Create publication and replication identities for each Postgres table +5. Enable CDC replication in the Airbyte UI + +#### Step 1: Prepopulate your Postgres source configuration + +We recommend following the steps in the [quick start](#quick-start) section to confirm that Airbyte can connect to your Postgres database prior to configuring CDC settings. + +For CDC, you must connect to primary/master databases. Pointing the connector configuration to replica database hosts for CDC will lead to failures. + +#### Step 2: Provide additional permissions to read-only user + +To configure CDC for the Postgres source connector, grant `REPLICATION` permissions to the user created in [step 1 of the quick start](#step-1-create-a-dedicated-read-only-postgres-user): +``` +ALTER USER REPLICATION; +``` + +#### Step 3: Enable logical replication on your Postgres database + +To enable logical replication on Cloud SQL for Postgres, set the `cloudsql.logical_decoding` flag to on. You can find the `Flags` section in the `Edit Configuration` view of your database: + +![Enable Logical Decoding](./assets/airbyte_cloud_sql_postgres_logical_replication_flag.png) + +#### Step 4: Create a replication slot on your Postgres database + +Airbyte requires a replication slot configured only for its use. Only one source should be configured that uses this replication slot. + +For this step, Airbyte requires use of the pgoutput plugin. To create a replication slot called `airbyte_slot` using pgoutput, provide the instance superuser (default `postgres`) with `REPLICATION` permissions, and run the following: + +``` +ALTER user postgres with REPLICATION; +SELECT pg_create_logical_replication_slot('airbyte_slot', 'pgoutput'); +``` + +The output of this command will include the name of the replication slot to fill into the Airbyte source setup page. + +#### Step 5: Create publication and replication identities for each Postgres table + +For each table you want to replicate with CDC, follow the steps below: + +1. Add the replication identity (the method of distinguishing between rows) for each table you want to replicate: + +``` +ALTER TABLE tbl1 REPLICA IDENTITY DEFAULT; +``` + +In rare cases, if your tables use data types that support [TOAST](https://www.postgresql.org/docs/current/storage-toast.html) or have very large field values, consider instead using replica identity type full: ` +ALTER TABLE tbl1 REPLICA IDENTITY FULL;`. + +2. Create the Postgres publication. You should include all tables you want to replicate as part of the publication: + +``` +CREATE PUBLICATION airbyte_publication FOR TABLE tbl1, tbl2, tbl3;` +``` + +The publication name is customizable. Refer to the [Postgres docs](https://www.postgresql.org/docs/10/sql-alterpublication.html) if you need to add or remove tables from your publication in the future. + +:::note +The Airbyte UI currently allows selecting any tables for CDC. If a table is selected that is not part of the publication, it will not be replicated even though it is selected. If a table is part of the publication but does not have a replication identity, that replication identity will be created automatically on the first run if the Airbyte user has the necessary permissions. +::: + +#### Step 6: Enable CDC replication in Airbyte UI + +In your Postgres source, change the replication mode to `Logical Replication (CDC)`, and enter the replication slot and publication you just created. + +## Postgres Replication Methods + +The Postgres source currently offers 3 methods of replicating updates to your destination: CDC, xmin and standard (with a user defined cursor). See [here](https://docs.airbyte.com/integrations/sources/postgres#postgres-replication-methods) for more details. + +## Connecting with SSL or SSH Tunnel + +See [these instructions](https://docs.airbyte.com/integrations/sources/postgres#connecting-with-ssl-or-ssh-tunneling) to learn more about SSL modes and connecting via SSH tunnel. + +## Limitations & Troubleshooting + +To see connector limitations, or troubleshoot your Postgres connector, see more [in our Postgres troubleshooting guide](https://docs.airbyte.com/integrations/sources/postgres/postgres-troubleshooting). diff --git a/docusaurus/sidebars.js b/docusaurus/sidebars.js index 4c7e60eca012..74907ddf4cd1 100644 --- a/docusaurus/sidebars.js +++ b/docusaurus/sidebars.js @@ -37,11 +37,16 @@ const sourcePostgres = { id: 'integrations/sources/postgres', }, items: [ - { + { + type: "doc", + label: "Cloud SQL for Postgres", + id: "integrations/sources/postgres/cloud-sql-postgres", + }, + { type: "doc", label: "Troubleshooting", id: "integrations/sources/postgres/postgres-troubleshooting", - } + } ], };