Skip to content

Commit

Permalink
Redshift Destination: update spec (#12100)
Browse files Browse the repository at this point in the history
* Redshift Destination: update spec

* update spec.json

* update links in spec.json

* added more links to spec.json | refactoring

* updated docs with stadard connector template

* added hyperlink to documentation for part_size field
  • Loading branch information
VitaliiMaltsev authored Apr 27, 2022
1 parent c856d79 commit b16e13e
Show file tree
Hide file tree
Showing 2 changed files with 101 additions and 85 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -49,22 +49,22 @@
"title": "Default Schema"
},
"s3_bucket_name": {
"title": "S3 Bucket Name",
"title": "S3 Bucket Name (Optional)",
"type": "string",
"description": "The name of the staging S3 bucket to use if utilising a COPY strategy. COPY is recommended for production workloads for better speed and scalability. See <a href=\"https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html\">AWS docs</a> for more details.",
"description": "The name of the staging S3 bucket to use if utilising a COPY strategy. COPY is recommended for production workloads for better speed and scalability. See <a href=\"https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html\">AWS docs</a> for more details.",
"examples": ["airbyte.staging"]
},
"s3_bucket_path": {
"title": "S3 Bucket Path",
"title": "S3 Bucket Path (Optional)",
"type": "string",
"description": "The directory under the S3 bucket where data will be written. If not provided, then defaults to the root directory.",
"description": "The directory under the S3 bucket where data will be written. If not provided, then defaults to the root directory. See <a href=\"https://docs.aws.amazon.com/prescriptive-guidance/latest/defining-bucket-names-data-lakes/faq.html#:~:text=be%20globally%20unique.-,For%20S3%20bucket%20paths,-%2C%20you%20can%20use\">path's name recommendations</a> for more details.",
"examples": ["data_sync/test"]
},
"s3_bucket_region": {
"title": "S3 Bucket Region",
"title": "S3 Bucket Region (Optional)",
"type": "string",
"default": "",
"description": "The region of the S3 staging bucket to use if utilising a copy strategy.",
"description": "The region of the S3 staging bucket to use if utilising a COPY strategy. See <a href=\"https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html#:~:text=In-,Region,-%2C%20choose%20the%20AWS\">AWS docs</a> for details.",
"enum": [
"",
"us-east-1",
Expand Down Expand Up @@ -94,28 +94,28 @@
},
"access_key_id": {
"type": "string",
"description": "The Access Key Id granting allow one to access the above S3 staging bucket. Airbyte requires Read and Write permissions to the given bucket.",
"title": "S3 Key Id",
"description": "This ID grants access to the above S3 staging bucket. Airbyte requires Read and Write permissions to the given bucket. See <a href=\"https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys\">AWS docs</a> on how to generate an access key ID and secret access key.",
"title": "S3 Key Id (Optional)",
"airbyte_secret": true
},
"secret_access_key": {
"type": "string",
"description": "The corresponding secret to the above access key id.",
"title": "S3 Access Key",
"description": "The corresponding secret to the above access key id. See <a href=\"https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys\">AWS docs</a> on how to generate an access key ID and secret access key.",
"title": "S3 Access Key (Optional)",
"airbyte_secret": true
},
"part_size": {
"type": "integer",
"minimum": 10,
"maximum": 100,
"examples": ["10"],
"description": "Optional. Increase this if syncing tables larger than 100GB. Only relevant for COPY. Files are streamed to S3 in parts. This determines the size of each part, in MBs. As S3 has a limit of 10,000 parts per file, part size affects the table size. This is 10MB by default, resulting in a default limit of 100GB tables. Note, a larger part size will result in larger memory requirements. A rule of thumb is to multiply the part size by 10 to get the memory requirement. Modify this with care.",
"title": "Stream Part Size"
"description": "Increase this if syncing tables larger than 100GB. Only relevant for COPY. Files are streamed to S3 in parts. This determines the size of each part, in MBs. As S3 has a limit of 10,000 parts per file, part size affects the table size. This is 10MB by default, resulting in a default limit of 100GB tables. Note: a larger part size will result in larger memory requirements. A rule of thumb is to multiply the part size by 10 to get the memory requirement. Modify this with care. See <a href=\"https://docs.airbyte.com/integrations/destinations/redshift/#:~:text=above%20key%20id.-,Part%20Size,-Affects%20the%20size\",> docs</a> for details.",
"title": "Stream Part Size (Optional)"
},
"purge_staging_data": {
"title": "Purge Staging Files and Tables",
"title": "Purge Staging Files and Tables (Optional)",
"type": "boolean",
"description": "Whether to delete the staging files from S3 after completing the sync. See the docs for details. Only relevant for COPY. Defaults to true.",
"description": "Whether to delete the staging files from S3 after completing the sync. See <a href=\"https://docs.airbyte.com/integrations/destinations/redshift/#:~:text=the%20root%20directory.-,Purge%20Staging%20Data,-Whether%20to%20delete\"> docs</a> for details.",
"default": true
}
}
Expand Down
158 changes: 87 additions & 71 deletions docs/integrations/destinations/redshift.md
Original file line number Diff line number Diff line change
@@ -1,124 +1,139 @@
# Redshift

## Overview
This page guides you through the process of setting up the Redshift destination connector.

## Prerequisites

The Airbyte Redshift destination allows you to sync data to Redshift.

This Redshift destination connector has two replication strategies:

1. INSERT: Replicates data via SQL INSERT queries. This is built on top of the destination-jdbc code base and is configured to rely on JDBC 4.2 standard drivers provided by Amazon via Mulesoft [here](https://mvnrepository.com/artifact/com.amazon.redshift/redshift-jdbc42) as described in Redshift documentation [here](https://docs.aws.amazon.com/redshift/latest/mgmt/jdbc20-install.html). **Not recommended for production workloads as this does not scale well**.
2. COPY: Replicates data by first uploading data to an S3 bucket and issuing a COPY command. This is the recommended loading approach described by Redshift [best practices](https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html). Requires an S3 bucket and credentials.

Airbyte automatically picks an approach depending on the given configuration - if S3 configuration is present, Airbyte will use the COPY strategy and vice versa.

We recommend users use INSERT for testing, to avoid any additional setup, and switch to COPY for production workloads.

### Sync overview

#### Output schema

Each stream will be output into its own raw table in Redshift. Each table will contain 3 columns:

* `_airbyte_ab_id`: a uuid assigned by Airbyte to each event that is processed. The column type in Redshift is `VARCHAR`.
* `_airbyte_emitted_at`: a timestamp representing when the event was pulled from the data source. The column type in Redshift is `TIMESTAMP WITH TIME ZONE`.
* `_airbyte_data`: a json blob representing with the event data. The column type in Redshift is `VARCHAR` but can be be parsed with JSON functions.

#### Features

| Feature | Supported?\(Yes/No\) | Notes |
| :--- | :--- | :--- |
| Full Refresh Sync | Yes | |
| Incremental - Append Sync | Yes | |
| Incremental - Deduped History | Yes | |
| Namespaces | Yes | |
| SSL Support | Yes | |
For INSERT strategy:
* **Host**
* **Port**
* **Username**
* **Password**
* **Schema**
* **Database**
* This database needs to exist within the cluster provided.

#### Target Database
2. COPY: Replicates data by first uploading data to an S3 bucket and issuing a COPY command. This is the recommended loading approach described by Redshift [best practices](https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html). Requires an S3 bucket and credentials.

You will need to choose an existing database or create a new database that will be used to store synced data from Airbyte.
Airbyte automatically picks an approach depending on the given configuration - if S3 configuration is present, Airbyte will use the COPY strategy and vice versa.

## Getting started
For COPY strategy:

### Requirements
* **S3 Bucket Name**
* See [this](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) to create an S3 bucket.
* **S3 Bucket Region**
* Place the S3 bucket and the Redshift cluster in the same region to save on networking costs.
* **Access Key Id**
* See [this](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) on how to generate an access key.
* We recommend creating an Airbyte-specific user. This user will require [read and write permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_s3_rw-bucket.html) to objects in the staging bucket.
* **Secret Access Key**
* Corresponding key to the above key id.
* **Part Size**
* Affects the size limit of an individual Redshift table. Optional. Increase this if syncing tables larger than 100GB. Files are streamed to S3 in parts. This determines the size of each part, in MBs. As S3 has a limit of 10,000 parts per file, part size affects the table size. This is 10MB by default, resulting in a default table limit of 100GB. Note, a larger part size will result in larger memory requirements. A rule of thumb is to multiply the part size by 10 to get the memory requirement. Modify this with care.

1. Active Redshift cluster
2. Allow connections from Airbyte to your Redshift cluster \(if they exist in separate VPCs\)
3. A staging S3 bucket with credentials \(for the COPY strategy\).
Optional parameters:
* **Bucket Path**
* The directory within the S3 bucket to place the staging data. For example, if you set this to `yourFavoriteSubdirectory`, we will place the staging data inside `s3://yourBucket/yourFavoriteSubdirectory`. If not provided, defaults to the root directory.
* **Purge Staging Data**
* Whether to delete the staging files from S3 after completing the sync. Specifically, the connector will create CSV files named `bucketPath/namespace/streamName/syncDate_epochMillis_randomUuid.csv` containing three columns (`ab_id`, `data`, `emitted_at`). Normally these files are deleted after the `COPY` command completes; if you want to keep them for other purposes, set `purge_staging_data` to `false`.

:::info

Even if your Airbyte instance is running on a server in the same VPC as your Redshift cluster, you may need to place them in the **same security group** to allow connections between the two.
## Step 1: Set up Redshift

:::
1. [Log in](https://aws.amazon.com/console/) to AWS Management console.
If you don't have a AWS account already, you’ll need to [create](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/) one in order to use the API.
2. Go to the AWS Redshift service
3. [Create](https://docs.aws.amazon.com/ses/latest/dg/event-publishing-redshift-cluster.html) and activate AWS Redshift cluster if you don't have one ready
4. (Optional) [Allow](https://aws.amazon.com/premiumsupport/knowledge-center/cannot-connect-redshift-cluster/) connections from Airbyte to your Redshift cluster \(if they exist in separate VPCs\)
5. (Optional) [Create](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) a staging S3 bucket \(for the COPY strategy\).

### Setup guide
## Step 2: Set up the destination connector in Airbyte

#### 1. Make sure your cluster is active and accessible from the machine running Airbyte
**For Airbyte Cloud:**

This is dependent on your networking setup. The easiest way to verify if Airbyte is able to connect to your Redshift cluster is via the check connection tool in the UI. You can check AWS Redshift documentation with a tutorial on how to properly configure your cluster's access [here](https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-authorize-cluster-access.html)
1. [Log into your Airbyte Cloud](https://cloud.airbyte.io/workspaces) account.
2. In the left navigation bar, click **Destinations**. In the top-right corner, click **+ new destination**.
3. On the destination setup page, select **Redshift** from the Destination type dropdown and enter a name for this connector.
4. Fill in all the required fields to use the INSERT or COPY strategy
5. Click `Set up destination`.

#### 2. Fill up connection info
**For Airbyte OSS:**

Next is to provide the necessary information on how to connect to your cluster such as the `host` whcih is part of the connection string or Endpoint accessible [here](https://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-connect-to-cluster.html#rs-gsg-how-to-get-connection-string) without the `port` and `database` name \(it typically includes the cluster-id, region and end with `.redshift.amazonaws.com`\).
1. Go to local Airbyte page.
2. In the left navigation bar, click **Destinations**. In the top-right corner, click **+ new destination**.
3. On the destination setup page, select **Redshift** from the Destination type dropdown and enter a name for this connector.
4. Fill in all the required fields to use the INSERT or COPY strategy
5. Click `Set up destination`.

You should have all the requirements needed to configure Redshift as a destination in the UI. You'll need the following information to configure the destination:

* **Host**
* **Port**
* **Username**
* **Password**
* **Schema**
* **Database**
* This database needs to exist within the cluster provided.
## Supported sync modes

#### 2a. Fill up S3 info \(for COPY strategy\)
The Redshift destination connector supports the following [sync modes](https://docs.airbyte.com/cloud/core-concepts/#connection-sync-mode):
- Full Refresh
- Incremental - Append Sync
- Incremental - Deduped History

Provide the required S3 info.
## Performance considerations

* **S3 Bucket Name**
* See [this](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html) to create an S3 bucket.
* **S3 Bucket Region**
* Place the S3 bucket and the Redshift cluster in the same region to save on networking costs.
* **Access Key Id**
* See [this](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) on how to generate an access key.
* We recommend creating an Airbyte-specific user. This user will require [read and write permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_examples_s3_rw-bucket.html) to objects in the staging bucket.
* **Secret Access Key**
* Corresponding key to the above key id.
* **Part Size**
* Affects the size limit of an individual Redshift table. Optional. Increase this if syncing tables larger than 100GB. Files are streamed to S3 in parts. This determines the size of each part, in MBs. As S3 has a limit of 10,000 parts per file, part size affects the table size. This is 10MB by default, resulting in a default table limit of 100GB. Note, a larger part size will result in larger memory requirements. A rule of thumb is to multiply the part size by 10 to get the memory requirement. Modify this with care.
Synchronization performance depends on the amount of data to be transferred.
Cluster scaling issues can be resolved directly using the cluster settings in the AWS Redshift console

Optional parameters:
* **Bucket Path**
* The directory within the S3 bucket to place the staging data. For example, if you set this to `yourFavoriteSubdirectory`, we will place the staging data inside `s3://yourBucket/yourFavoriteSubdirectory`. If not provided, defaults to the root directory.
* **Purge Staging Data**
* Whether to delete the staging files from S3 after completing the sync. Specifically, the connector will create CSV files named `bucketPath/namespace/streamName/syncDate_epochMillis_randomUuid.csv` containing three columns (`ab_id`, `data`, `emitted_at`). Normally these files are deleted after the `COPY` command completes; if you want to keep them for other purposes, set `purge_staging_data` to `false`.
## Connector-specific features & highlights

## Notes about Redshift Naming Conventions
### Notes about Redshift Naming Conventions

From [Redshift Names & Identifiers](https://docs.aws.amazon.com/redshift/latest/dg/r_names.html):

### Standard Identifiers
#### Standard Identifiers

* Begin with an ASCII single-byte alphabetic character or underscore character, or a UTF-8 multibyte character two to four bytes long.
* Subsequent characters can be ASCII single-byte alphanumeric characters, underscores, or dollar signs, or UTF-8 multibyte characters two to four bytes long.
* Be between 1 and 127 bytes in length, not including quotation marks for delimited identifiers.
* Contain no quotation marks and no spaces.

### Delimited Identifiers
#### Delimited Identifiers

Delimited identifiers \(also known as quoted identifiers\) begin and end with double quotation marks \("\). If you use a delimited identifier, you must use the double quotation marks for every reference to that object. The identifier can contain any standard UTF-8 printable characters other than the double quotation mark itself. Therefore, you can create column or table names that include otherwise illegal characters, such as spaces or the percent symbol. ASCII letters in delimited identifiers are case-insensitive and are folded to lowercase. To use a double quotation mark in a string, you must precede it with another double quotation mark character.

Therefore, Airbyte Redshift destination will create tables and schemas using the Unquoted identifiers when possible or fallback to Quoted Identifiers if the names are containing special characters.

## Data Size Limitations
### Data Size Limitations

Redshift specifies a maximum limit of 65535 bytes to store the raw JSON record data. Thus, when a row is too big to fit, the Redshift destination fails to load such data and currently ignores that record.
See [docs](https://docs.aws.amazon.com/redshift/latest/dg/r_Character_types.html)

## Encryption
### Encryption

All Redshift connections are encrypted using SSL

### Output schema

Each stream will be output into its own raw table in Redshift. Each table will contain 3 columns:

* `_airbyte_ab_id`: a uuid assigned by Airbyte to each event that is processed. The column type in Redshift is `VARCHAR`.
* `_airbyte_emitted_at`: a timestamp representing when the event was pulled from the data source. The column type in Redshift is `TIMESTAMP WITH TIME ZONE`.
* `_airbyte_data`: a json blob representing with the event data. The column type in Redshift is `VARCHAR` but can be be parsed with JSON functions.

## Data type mapping

| Redshift Type | Airbyte Type | Notes |
| :--- | :--- | :--- |
| `boolean` | `boolean` | |
| `int` | `integer` | |
| `float` | `number` | |
| `varchar` | `string` | |
| `date/varchar` | `date` | |
| `time/varchar` | `time` | |
| `timestamptz/varchar` | `timestamp_with_timezone` | |
| `varchar` | `array` | |
| `varchar` | `object` | |

## Changelog

| Version | Date | Pull Request | Subject |
Expand All @@ -142,3 +157,4 @@ All Redshift connections are encrypted using SSL
| 0.3.12 | 2021-07-21 | [3555](https://github.com/airbytehq/airbyte/pull/3555) | Enable partial checkpointing for halfway syncs |
| 0.3.11 | 2021-07-20 | [4874](https://github.com/airbytehq/airbyte/pull/4874) | allow `additionalProperties` in connector spec |


0 comments on commit b16e13e

Please sign in to comment.