Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial and documentation for config-based connectors #15027

Merged
merged 99 commits into from
Aug 12, 2022
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
4855a72
5-step tutorial
girarda Jul 25, 2022
138bd52
move
girarda Jul 26, 2022
637c2a7
tiny bit of editing
girarda Jul 26, 2022
9fabad8
Merge branch 'master' into alex/lowcodeTutorial
girarda Jul 28, 2022
ff775e3
Update tutorial
girarda Jul 28, 2022
6ebee74
update docs
girarda Aug 1, 2022
ff2b602
reset
girarda Aug 1, 2022
906f915
move files
girarda Aug 1, 2022
a64c758
record selector, request options, and more links
girarda Aug 1, 2022
2099b24
update
girarda Aug 1, 2022
8bab845
update
girarda Aug 1, 2022
a03e6f3
connector definition
girarda Aug 1, 2022
d444d74
link
girarda Aug 1, 2022
71e0a5b
links
girarda Aug 2, 2022
a9512ab
Merge branch 'master' into alex/lowcodeTutorial
girarda Aug 2, 2022
7b36ca6
update example
girarda Aug 2, 2022
218bfd9
footnote
girarda Aug 2, 2022
78599e3
typo
girarda Aug 2, 2022
c9bfb99
document string interpolation
girarda Aug 2, 2022
58567c5
note on string interpolation
girarda Aug 2, 2022
76a95ae
update
girarda Aug 2, 2022
8feb1a6
Merge branch 'master' into alex/lowcodeTutorial
girarda Aug 2, 2022
ecf9b34
fix code sample
girarda Aug 2, 2022
990d44a
fix
girarda Aug 2, 2022
f9b1b68
update sample
girarda Aug 2, 2022
945cc3e
fix
girarda Aug 2, 2022
a3349df
use the actual config
girarda Aug 2, 2022
318e613
Update as per comments
girarda Aug 7, 2022
c54b0c4
Merge branch 'master' into alex/lowcodeTutorial
girarda Aug 8, 2022
9cc1e4b
write as yaml
girarda Aug 8, 2022
f096296
typo
girarda Aug 8, 2022
8bd35b4
Clarify options overloading
girarda Aug 8, 2022
cfb4528
clarify that docker must be running
girarda Aug 8, 2022
85d5afb
remove extra footnote
girarda Aug 8, 2022
61a75b5
use venv directly
girarda Aug 8, 2022
7e1dc95
Apply suggestions from code review
girarda Aug 8, 2022
3df5071
signup instructions
girarda Aug 8, 2022
b074832
update
girarda Aug 8, 2022
672eb16
clarify that both dot and bracket notations are interchangeable
girarda Aug 8, 2022
6575c9d
Clarify how check works
girarda Aug 8, 2022
e747b4a
create spec and config before updating connector definition
girarda Aug 8, 2022
d5ac31d
clarify what now_local() is
girarda Aug 8, 2022
fdce2c6
rename to yaml structure
girarda Aug 8, 2022
198b421
Go through tutorial and update end of section code samples
girarda Aug 9, 2022
18bc40f
fix link
girarda Aug 9, 2022
f4e5ed4
update
girarda Aug 9, 2022
83e3845
update code samples
girarda Aug 9, 2022
bab017b
Update code samples
girarda Aug 9, 2022
37d1fde
Update to bracket notation
girarda Aug 9, 2022
1fc83fe
remove superfluous comments
girarda Aug 9, 2022
6944916
Update docs/connector-development/config-based/tutorial/2-install-dep…
girarda Aug 9, 2022
c317a49
Update docs/connector-development/config-based/tutorial/3-connecting-…
girarda Aug 9, 2022
096a370
Update docs/connector-development/config-based/tutorial/3-connecting-…
girarda Aug 9, 2022
ff804be
Update docs/connector-development/config-based/tutorial/3-connecting-…
girarda Aug 9, 2022
49be031
Update docs/connector-development/config-based/tutorial/3-connecting-…
girarda Aug 9, 2022
6790422
Update docs/connector-development/config-based/tutorial/3-connecting-…
girarda Aug 9, 2022
34214b4
Update docs/connector-development/config-based/tutorial/4-reading-dat…
girarda Aug 9, 2022
bf9a205
fix path
girarda Aug 9, 2022
74e4de8
update
girarda Aug 9, 2022
ca0f93c
motivation blurp
girarda Aug 9, 2022
46b2ee4
Merge branch 'master' into alex/lowcodeTutorial
girarda Aug 10, 2022
9cfd223
warning
girarda Aug 10, 2022
65a966c
warning
girarda Aug 10, 2022
dd4437c
fix code block
girarda Aug 10, 2022
365c0dc
update code samples
girarda Aug 10, 2022
ebaa701
update code sample
girarda Aug 10, 2022
aacc30a
update code samples
girarda Aug 10, 2022
3b1e85f
small updates
girarda Aug 10, 2022
b4498f3
update yaml structure
girarda Aug 10, 2022
306e9e5
custom class example
girarda Aug 10, 2022
c2d9b86
language annotations
girarda Aug 10, 2022
562844b
update warning
girarda Aug 11, 2022
faada9a
Merge branch 'master' into alex/lowcodeTutorial
girarda Aug 11, 2022
08487f7
Update tutorial to use dpath extractor
girarda Aug 11, 2022
30d25c0
Update record selector docs
girarda Aug 11, 2022
63a295c
unit test
girarda Aug 11, 2022
019cc0a
link to contributing
girarda Aug 12, 2022
117ee2f
tiny update
girarda Aug 12, 2022
3a00dac
$ in front of commands
girarda Aug 12, 2022
b2040fc
$ in front of commands
girarda Aug 12, 2022
db243a8
More readings
girarda Aug 12, 2022
cc0d76c
link to existing config-based connectors
girarda Aug 12, 2022
6cbdaa0
index
girarda Aug 12, 2022
619bf37
update
girarda Aug 12, 2022
9a4f1c9
delete broken link
girarda Aug 12, 2022
5337868
supported features
girarda Aug 12, 2022
e4919d5
update
girarda Aug 12, 2022
e27fade
Add some links
girarda Aug 12, 2022
048bddb
Update docs/connector-development/config-based/overview.md
girarda Aug 12, 2022
019aad2
Update docs/connector-development/config-based/record-selector.md
girarda Aug 12, 2022
cc308f2
Update docs/connector-development/config-based/overview.md
girarda Aug 12, 2022
e3cffc8
Update docs/connector-development/config-based/overview.md
girarda Aug 12, 2022
785a3e4
Update docs/connector-development/config-based/overview.md
girarda Aug 12, 2022
2db7694
mention the unit
girarda Aug 12, 2022
2a7d5fc
headers
girarda Aug 12, 2022
eba0322
remove mentions of interpolating on stream slice, etc.
girarda Aug 12, 2022
e7f0023
Merge branch 'master' into alex/lowcodeTutorial
girarda Aug 12, 2022
8353121
update
girarda Aug 12, 2022
e6637d3
exclude config-based docs
girarda Aug 12, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def token(self) -> str:

class BasicHttpAuthenticator(AbstractHeaderAuthenticator):
"""
Builds auth based off the basic authentication scheme as defined by RFC 7617, which transmits credentials as USER ID/password pairs, encoded using bas64
Builds auth based off the basic authentication scheme as defined by RFC 7617, which transmits credentials as USER ID/password pairs, encoded using base64
https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication#basic_authentication_scheme

The header is of the form
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

from typing import Mapping, Type

from airbyte_cdk.sources.declarative.auth.oauth import DeclarativeOauth2Authenticator
from airbyte_cdk.sources.declarative.auth.token import ApiKeyAuthenticator, BasicHttpAuthenticator, BearerAuthenticator
from airbyte_cdk.sources.declarative.datetime.min_max_datetime import MinMaxDatetime
from airbyte_cdk.sources.declarative.declarative_stream import DeclarativeStream
Expand Down Expand Up @@ -56,6 +57,7 @@
"ListStreamSlicer": ListStreamSlicer,
"MinMaxDatetime": MinMaxDatetime,
"NoPagination": NoPagination,
"OAuthAuthenticator": DeclarativeOauth2Authenticator,
"OffsetIncrement": OffsetIncrement,
"RecordSelector": RecordSelector,
"RemoveFields": RemoveFields,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ class YamlParser(ConnectionDefinitionParser):
"""
Parses a Yaml string to a ConnectionDefinition

In addition to standard Yaml parsing, the input_string can contain refererences to values previously defined.
In addition to standard Yaml parsing, the input_string can contain references to values previously defined.
This parser will dereference these values to produce a complete ConnectionDefinition.

References can be defined using a *ref(<arg>) string.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ class LimitPaginator(Paginator):
* updates the request path with "{{ response._metadata.next }}"
paginator:
type: "LimitPaginator"
limit_value: 10
page_size: 10
limit_option:
option_type: request_parameter
field_name: page_size
Expand All @@ -41,7 +41,7 @@ class LimitPaginator(Paginator):
`
paginator:
type: "LimitPaginator"
limit_value: 5
page_size: 5
limit_option:
option_type: header
field_name: page_size
Expand All @@ -58,7 +58,7 @@ class LimitPaginator(Paginator):
`
paginator:
type: "LimitPaginator"
limit_value: 5
page_size: 5
limit_option:
option_type: request_parameter
field_name: page_size
Expand Down
70 changes: 70 additions & 0 deletions docs/connector-development/config-based/authentication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Authentication

The `Authenticator` defines how to configure outgoing HTTP requests to authenticate on the API source.

## Authenticators

### ApiKeyAuthenticator

The `ApiKeyAuthenticator` sets an HTTP header on outgoing requests.
The following definition will set the header "Authorization" with a value "Bearer hello":

```
authenticator:
type: "ApiKeyAuthenticator"
header: "Authorization"
token: "Bearer hello"
```

### BearerAuthenticator

The `BearerAuthenticator` is a specialized `ApiKeyAuthenticator` that always sets the header "Authorization" with the value "Bearer {token}".
The following definition will set the header "Authorization" with a value "Bearer hello"

```
authenticator:
type: "BearerAuthenticator"
token: "hello"
```

More information on bearer authentication can be found [here](https://swagger.io/docs/specification/authentication/bearer-authentication/)

### BasicHttpAuthenticator

The `BasicHttpAuthenticator` set the "Authorization" header with a (USER ID/password) pair, encoded using base64 as per [RFC 7617](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication#basic_authentication_scheme).
The following definition will set the header "Authorization" with a value "Basic <encoded credentials>"

The encoding scheme is:

1. concatenate the username and the password with `":"` in between
2. Encode the resulting string in base 64
3. Decode the result in utf8
alafanechere marked this conversation as resolved.
Show resolved Hide resolved

```
authenticator:
type: "BasicHttpAuthenticator"
username: "hello"
password: "world"
```

The password is optional. Authenticating with APIs using Basic HTTP and a single API key can be done as:

```
authenticator:
type: "BasicHttpAuthenticator"
username: "hello"
```

### OAuth

OAuth authentication is supported through the `OAuthAuthenticator`, which requires the following parameters:

alafanechere marked this conversation as resolved.
Show resolved Hide resolved
- token_refresh_endpoint: The endpoint to refresh the access token
- client_id: The client id
- client_secret: Client secret
- refresh_token: The token used to refresh the access token
- scopes: The scopes to request
- token_expiry_date: The access token expiration date
- access_token_name: THe field to extract access token from in the response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- access_token_name: THe field to extract access token from in the response
- access_token_name: The field to extract access token from in the response

- expires_in_name:The field to extract expires_in from in the response
- refresh_request_body: The request body to send in the refresh request
173 changes: 173 additions & 0 deletions docs/connector-development/config-based/error-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# Error handling

By default, only retry server errors (HTTP 5XX) and too many requests (HTTP 429) will be retried up to 5 times with exponential backoff.
Other HTTP errors will result in a failed read.

Other behaviors can be configured through the `Requester`'s `error_handler` field.

## Defining errors

Response filters can be used to define how to handle requests resulting in responses with a specific HTTP status code.
For instance, this example will configure the handler to also retry responses with 404 error:

```
requester:
<...>
error_handler:
response_filters:
- http_codes: [404]
action: RETRY
```

Response filters can be used to specify HTTP errors to ignore instead of retrying.
For instance, this example will configure the handler to ignore responses with 404 error:

```
requester:
<...>
error_handler:
response_filters:
- http_codes: [404]
action: IGNORE
```

Errors can also be defined by parsing the error message.
For instance, this error handler will ignores responses if the error message contains the string "ignorethisresponse"

```
requester:
<...>
error_handler:
response_filters:
- error_message_contain: "ignorethisresponse"
action: IGNORE
```

This can also be done through a more generic string interpolation strategy with the following parameters:

- response:

This example ignores errors where the response contains a "code" field:

```
requester:
<...>
error_handler:
response_filters:
- predicate: "{{ 'code' in response }}"
action: IGNORE
```

The error handler can have multiple response filters.
The following example is configured to ignore 404 errors, and retry 429 errors:

```
requester:
<...>
error_handler:
response_filters:
- http_codes: [404]
action: IGNORE
- http_codes: [429]
action: RETRY
```

## Backoff Strategies

The error handle supports a few backoff strategies, which are described in the following sections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The error handle supports a few backoff strategies, which are described in the following sections.
The error handler supports a few backoff strategies, which are described in the following sections.


### Exponential backoff

This is the default backoff strategy. The requester will backoff with an exponential backoff interval

### Constant Backoff

When using the `ConstantBackoffStrategy`, the requester will backoff with a constant interval.

### Wait time defined in header

When using the `WaitTimeFromHeaderBackoffStrategy`, the requester will backoff by an interval specified in the response header.
In this example, the requester will backoff by the response's "wait_time" header value:

```
requester:
<...>
error_handler:
<...>
backoff_strategies:
- type: "WaitTimeFromHeaderBackoffStrategy"
header: "wait_time"
```

Optionally, a regex can be configured to extract the wait time from the header value.

```
requester:
<...>
error_handler:
<...>
backoff_strategies:
- type: "WaitTimeFromHeaderBackoffStrategy"
header: "wait_time"
regex: "[-+]?\d+"
```

### Wait until time defined in header

When using the `WaitUntilTimeFromHeaderBackoffStrategy`, the requester will backoff until the time specified in the response header.
In this example, the requester will wait until the time specified in the "wait_until" header value:

```
requester:
<...>
error_handler:
<...>
backoff_strategies:
- type: "WaitUntilTimeFromHeaderBackoffStrategy"
header: "wait_until"
regex: "[-+]?\d+"
min_wait: 5
```

The strategy accepts an optional regex to extract the time from the header value, and a minimum time to wait.

## Advanced error handling

The error handler can have multiple backoff strategies, allowing it to fallback if a strategy cannot be evaluated.
For instance, the following defines an error handler that will read the backoff time from a header, and default to a constant backoff if the wait time could not be extracted from the response:

```
requester:
<...>
error_handler:
<...>
backoff_strategies:
- type: "WaitTimeFromHeaderBackoffStrategy"
header: "wait_time"
- type: "ConstantBackoffStrategy"
backoff_time_in_seconds: 5

```

The `requester` can be configured to use a `CompositeErrorHandler`, which sequentially iterates over a list of error handlers, enabling different retry mechanisms for different types of errors.

In this example, a constant backoff of 5 seconds, will be applied if the response contains a "code" field, and an exponential backoff will be applied if the error code is 403:

```
requester:
<...>
error_handler:
type: "CompositeErrorHandler"
error_handlers:
- response_filters:
- predicate: "{{ 'code' in response }}"
action: RETRY
backoff_strategies:
- type: "ConstantBackoffStrategy"
backoff_time_in_seconds: 5
- response_filters:
- http_codes: [ 403 ]
action: RETRY
backoff_strategies:
- type: "ExponentialBackoffStrategy"
```
90 changes: 90 additions & 0 deletions docs/connector-development/config-based/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Config-based connectors overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something which would be great to have somewhere (not sure if this page is the right place, but seems like the most natural) is the motivation for this framework. Something like the "All APIs are the same" section in the PRD


The goal of this document is to give enough technical specifics to understand how config-based connectors work.
When you're ready to start building a connector, you can start with [the tutorial](../../../config-based/tutorial/0-getting-started.md) or dive into the [reference documentation](https://airbyte-cdk.readthedocs.io/en/latest/api/airbyte_cdk.sources.declarative.html)
alafanechere marked this conversation as resolved.
Show resolved Hide resolved

## Overview

Config-based connectors work by parsing a YAML configuration describing the Source, then running the configured connector using a Python backend.

The process then submits HTTP requests to the API endpoint, and extracts records out of the response.

## Source

Config-based connectors are a declarative way to define HTTP API sources.

A source is defined by 2 components:

1. The source's `Stream`s, which define the data to read
2. A `ConnectionChecker`, which describes how to run the `check` operation to test the connection to the API source

## Stream

Streams define the schema of the data of interest, as well as how to read it from the underlying API source.
A stream generally corresponds to a resource within the API. They are analogous to tables for a RDMS source.
alafanechere marked this conversation as resolved.
Show resolved Hide resolved

A stream is defined by:

1. Its name
2. A primary key: used to uniquely identify records, enabling deduplication
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CDK does technically support a pk, list of pks, and even a list of nested lists of pks. Is it our intention to lock this down to just a single string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with full type definition

alafanechere marked this conversation as resolved.
Show resolved Hide resolved
3. A schema: describes the data to sync
4. A data retriever: describes how to retrieve the data from the API
alafanechere marked this conversation as resolved.
Show resolved Hide resolved
5. A cursor field: used to identify the stream's state from a record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also support a list of cursor field strings. I'm thinking Maybe we just get rid of "A ..." and just start with Cursor field(s), Checkpoint Interval, Schema, etc. Also a nit, but capitalizing the first letter after the : tends to be better grammar. Schema: Describes the data to sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

6. A set of transformations to be applied on the records read from the source before emitting them to the destination
7. A checkpoint interval: defines when to checkpoint syncs.
alafanechere marked this conversation as resolved.
Show resolved Hide resolved

More details on streams and sources can be found in the [basic concepts section](../cdk-python/basic-concepts.md).
More details on cursor fields, and checkpointing can be found in the [incremental-stream section](../cdk-python/incremental-stream.md)

## Data retriever

The data retriever defines how to read the data from an API source, and acts as an orchestrator for the data retrieval flow.
The is currently only one implementation, the `SimpleRetriever`, which is defined by

alafanechere marked this conversation as resolved.
Show resolved Hide resolved
1. Requester: describes how to submit requests to the API source
2. Paginator[^1]: describes how to navigate through the API's pages
3. Record selector: describes how to select records from an HTTP response
4. Stream Slicer: describes how to partition the stream, enabling incremental syncs and checkpointing

Each of those components (and their subcomponents) are defined by an explicit interface and one or many implementations.
The developer can choose and configure the implementation they need depending on specifications of the integrations they are building against.
girarda marked this conversation as resolved.
Show resolved Hide resolved

### Data flow

The retriever acts as a coordinator, moving the data between its components before emitting `AirbyteMessage`s that can be read by the platform.
The `SimpleRetriever`'s data flow can be described as follows:

1. Given the connection config and the current stream state, the `StreamSlicer` computes the stream slices to read.
2. Iterate over all the stream slices defined by the stream slicer.
3. For each stream slice,
1. Submit a request as defined by the requester
2. Select the records from the response
3. Repeat for as long as the paginator points to a next page

More details on the paginator can be found in the [pagination section](pagination.md)
More details on the record selector can be found in the [record selector section](record-selector.md)
More details on the stream slicers can be found in the [stream slicers section](stream-slicers.md)

## Requester

The `Requester` defines how to prepare HTTP requests to send to the source API [^2].
There currently is only one implementation, the `HttpRequester`, which is defined by
girarda marked this conversation as resolved.
Show resolved Hide resolved

1. A base url: the root of the API source
2. A path: the specific endpoint to fetch data from for a resource
3. The HTTP method: the HTTP method to use (GET or POST)
4. A request options provider: defines the request parameters and headers to set on outgoing HTTP requests
alafanechere marked this conversation as resolved.
Show resolved Hide resolved
5. An authenticator: defines how to authenticate to the source
6. An error handler: defines how to handle errors

More details on authentication can be found in the [authentication section](authentication.md).
More details on error handling can be found in the [error handling section](error-handling.md)

## Connection Checker

The `ConnectionChecker` defines how to test the connection to the integration.

The only implementation as of now is `CheckStream`, which tries to read a record from a specified list of streams and fails if no records could be read.

[^1] The paginator is conceptually more related to the requester than the data retriever, but is part of the `SimpleRetriever` because it inherits from `HttpStream` to increase code reusability.
[^2] As of today, the requester acts as a config object and is not directly responsible for preparing the HTTP requests. This is done in the `SimpleRetriever`'s parent class `HttpStream`.
Loading