Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Use correct header heirarchy in airbyte-protocol docs #18917

Merged
merged 3 commits into from
Nov 7, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 32 additions & 32 deletions docs/understanding-airbyte/airbyte-protocol.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ For the sake of brevity, we will not re-describe `spec` and `check`. They are ex

This concludes the overview of the Actor Interface. The remaining content will dive deeper into each concept covered so far.

# Actor Specification
## Actor Specification

The specification allows the Actor to share information about itself.

Expand Down Expand Up @@ -244,9 +244,9 @@ ConnectorSpecification:
"$ref": "#/definitions/DestinationSyncMode"
```

# Catalog
## Catalog

## Overview
### Overview

An `AirbyteCatalog` is a struct that is produced by the `discover` action of a source. It is a list of `AirbyteStream`s. Each `AirbyteStream` describes the data available to be synced from the source. After a source produces an `AirbyteCatalog` or `AirbyteStream`, they should be treated as read only. A `ConfiguredAirbyteCatalog` is a list of `ConfiguredAirbyteStream`s. Each `ConfiguredAirbyteStream` describes how to sync an `AirbyteStream`.

Expand Down Expand Up @@ -324,21 +324,21 @@ e.g. In the case where the API has two endpoints `api/customers` and `api/produc

**Note:** Stream and field names can be any UTF8 string. Destinations are responsible for cleaning these names to make them valid table and column names in their respective data stores.

## Namespace
### Namespace

Technical systems often group their underlying data into namespaces with each namespace's data isolated from another namespace. This isolation allows for better organisation and flexibility, leading to better usability.

An example of a namespace is the RDBMS's `schema` concept. An API namespace might be used for multiple accounts (e.g. `company_a` vs `company_b`, each having a "users" and "purchases" stream). Some common use cases for schemas are enforcing permissions, segregating test and production data and general data organization.

The `AirbyteStream` represents this concept through an optional field called `namespace`. Additional documentation on Namespaces can be found [here](namespaces.md).

## Cursor
### Cursor

- The cursor is how sources track which records are new or updated since the last sync.
- A "cursor field" is the field that is used as a comparable for making this determination.
- If a configuration requires a cursor field, it requires an array of strings that serves as a path to the desired field. e.g. if the structure of a stream is `{ value: 2, metadata: { updated_at: 2020-11-01 } }` the `default_cursor_field` might be `["metadata", "updated_at"]`.

## AirbyteStream
### AirbyteStream

This section will document the meaning of each field in an `AirbyteStream`

Expand All @@ -347,11 +347,11 @@ This section will document the meaning of each field in an `AirbyteStream`
- `source_defined_cursor` - If a source supports the `INCREMENTAL` sync mode, and it sets this field to true, it is responsible for determining internally how it tracks which records in a source are new or updated since the last sync. When set to `true`, `default_cursor_field` should also be set.
- `default_cursor_field` - If a source supports the `INCREMENTAL` sync mode, it may, optionally, set this field. If this field is set, and the user does not override it with the `cursor_field` attribute in the `ConfiguredAirbyteStream` \(described below\), this field will be used as the cursor. It is an array of keys to a field in the schema.

### Data Types
#### Data Types

Airbyte maintains a set of types that intersects with those of JSONSchema but also includes its own. More information on supported data types can be found in [Supported Data Types](supported-data-types.md).

## ConfiguredAirbyteStream
### ConfiguredAirbyteStream

This section will document the meaning of each field in an `ConfiguredAirbyteStream`

Expand Down Expand Up @@ -405,18 +405,18 @@ DestinationSyncMode:
- If an `AirbyteStream` does not define a `cursor_field` or a `default_cursor_field`, then `ConfiguredAirbyteStream` must define a `cursor_field`.
- `destination_sync_mode` - The sync mode that will be used the destination to sync that stream. The value in this field MUST be present in the `supported_destination_sync_modes` array in the specification for the Destination.

### Source Sync Modes
#### Source Sync Modes

- `incremental` - send all the data for the Stream since the last sync (e.g. the state message passed to the Source). This is the most common sync mode. It only sends new data.
- `full_refresh` - resend all data for the Stream on every sync. Ignores State. Should only be used in cases where data is very small, there is no way to keep a cursor into the data, or it is necessary to capture a snapshot in time of the whole dataset. Be careful using this, because misuse can lead to sending much more data than expected.

### Destination Sync Modes
#### Destination Sync Modes

- `append` - add new data from the sync to the end of whatever already data already exists.
- `append_dedup` - add new data from the sync to the end of whatever already data already exists and deduplicate it on primary key. This is the most **common** sync mode. It does require that a primary exists in the data. This is also known as SCD Type 1 & 2.
- `overwrite` - replace whatever data exists in the destination data store with the data that arrives in this sync.

## Logic for resolving the Cursor Field
### Logic for resolving the Cursor Field

This section lays out how a cursor field is determined in the case of a Stream that is doing an `incremental` sync.

Expand All @@ -425,7 +425,7 @@ This section lays out how a cursor field is determined in the case of a Stream t
- If `default_cursor_field` in `AirbyteStream` is set, then the sources use that field as the cursor. If it is not set, continue...
- Illegal - If `source_defined_cursor`, `cursor_field`, and `default_cursor_field` are all false-y, this is an invalid configuration.

## Schema Mismatches
### Schema Mismatches

Over time, it is possible for the catalog to become out of sync with the underlying data store it represents. The Protocol is design to be resilient to this. In should never fail due to a mismatch.

Expand All @@ -438,15 +438,15 @@ Over time, it is possible for the catalog to become out of sync with the underly

In short, if the catalog is ever out of sync with the schema of the underlying data store, it should never block replication for data that is present.

# State & Checkpointing
## State & Checkpointing

Sources are able to emit state in order to allow checkpointing data replication. The goal is that given wherever a sync stops (whether this is due to all data available at the time being replicated or due to a failure), the next time the Source attempts to extract data it can pick up where it left off and not have to go back to the beginning.

This concept enables incremental syncs--syncs that only replicate data that is new since the previous sync.

State also enables Partial Success. In the case where during a sync there is a failure before all data has been extracted and committed, if all records up to a certain state are committed, then the next time the sync happens, it can start from that state as opposed to going back to the beginning. Partial Success is powerful, because especially in the case of high data volumes and long syncs, being able to pick up from wherever the failure occurred can costly re-syncing of data that has already been replicated.

## State & Source
### State & Source

This section will step through how state is used to allow a Source to pick up where it left off. A Source takes state as an input. A Source should be able to take that input and use it to determine where it left off the last time. The contents of the Source is a black box to the Protocol. The Protocol provides an envelope for the Source to put its state in and then passes the state back in that envelope. The Protocol never needs to know anything about the contents of the state. Thus, the Source can track state however makes most sense to it.

Expand All @@ -461,7 +461,7 @@ In Sync 2, the last state that was emitted from Sync 1 is passed into the Source

While this example, demonstrates a success case, we can see how this process helps in failure cases as well. Let's say that in Sync 1 after emitting the first state message and before emitting the record for Carl, the Source lost connectivity with Postgres due to a network blip and the Source Actor crashed. When Sync 2 runs, it will get the state record with 2022/01/01 instead, so it will replicate Carl and Drew, but it skips Alice and Bob. While in this toy example this procedure only saves replicating one record, in a production use case, being able to checkpoint regularly can save having to resend huge amounts of data due to transient issues.

## State & the Whole Sync
### State & the Whole Sync

The previous section, for the sake of clarity, looked exclusively at the life cycle of state relative to the Source. In reality knowing that a record was emitted from the Source is NOT enough guarantee to know that we can skip sending the record in future syncs. For example, imagine the Source successfully emits the record, but the Destination fails. If we skip that record in the next sync, it means it never truly made it to its destination. This insight means, that a State should only ever be passed to a Source in the next run if it was both emitted from the Source and the Destination.

Expand All @@ -475,11 +475,11 @@ The normal success case (T3, not depicted) would be that all the records would m

-- [link](https://whimsical.com/state-TYX5bSCVtVF4BU1JbUwfpZ) to source image

## V1
### V1

The state for an actor is emitted as a complete black box. When emitted it is wrapped in the [AirbyteStateMessage](#airbytestatemessage-v1). The contents of the `data` field is what is passed to the Source on start up. This gives the Source lead to decide how to track the state of each stream. That being said, a common pattern is a `Map<StreamDescriptor, StreamStateBlob>`. Nothing outside the source can make any inference about the state of the object EXCEPT, if it is null, it can be concluded that there is no state and the Source will start at the beginning.

## V2 (coming soon!)
### V2 (coming soon!)

In addition to allowing a Source to checkpoint data replication, the state object is structure to allow for the ability to configure and reset streams in isolation from each other. For example, if adding or removing a stream, it is possible to do so without affecting the state of any other stream in the Source.

Expand All @@ -502,15 +502,15 @@ This table breaks down attributes of these state types.
- **Stream-Level Replication Isolation** means that a Source could be run in parallel by splitting up its streams across running instances. This is only possible for Stream state types, because they are the only state type that can update its current state completely on a per-stream basis. This is one of the main drawbacks of Sources that use Global state; it is not possible to increase their throughput through parallelization.
- **Single state message describes full state for Source** means that any state message contains the full state information for a Source. Stream does not meet this condition because each state message is scoped by stream. This means that in order to build a full picture of the state for the Source, the state messages for each configured stream must be gathered.

# Messages
## Messages

## Common
### Common

For forwards compatibility all messages should allow for unknown properties (in JSONSchema parlance that is `additionalProperties: true`).

Messages are structs emitted by actors.

### StreamDescriptor
#### StreamDescriptor

A stream descriptor contains all information required to identify a Stream:

Expand All @@ -533,7 +533,7 @@ StreamDescriptor:
type: string
```

## AirbyteMessage
### AirbyteMessage

The output of each method in the actor interface is wrapped in an `AirbyteMessage`. This struct is an envelope for the return value of any message in the described interface.

Expand Down Expand Up @@ -578,7 +578,7 @@ AirbyteMessage:
"$ref": "#/definitions/AirbyteTraceMessage"
```

## AirbyteRecordMessage
### AirbyteRecordMessage

The record message contains the actual data that is being replicated.

Expand Down Expand Up @@ -612,7 +612,7 @@ AirbyteRecordMessage:
type: integer
```

## AirbyteStateMessage (V1)
### AirbyteStateMessage (V1)

The state message enables the Source to emit checkpoints while replicating data. These checkpoints mean that if replication fails before completion, the next sync is able to start from the last checkpoint instead of returning to the beginning of the previous sync. The details of this process are described in [State & Checkpointing](#state--checkpointing).

Expand All @@ -631,7 +631,7 @@ AirbyteStateMessage:
existingJavaType: com.fasterxml.jackson.databind.JsonNode
```

## AirbyteStateMessage (V2 -- coming soon!)
### AirbyteStateMessage (V2 -- coming soon!)

The state message enables the Source to emit checkpoints while replicating data. These checkpoints mean that if replication fails before completion, the next sync is able to start from the last checkpoint instead of returning to the beginning of the previous sync. The details of this process are described in [State & Checkpointing](#state--checkpointing).

Expand Down Expand Up @@ -696,7 +696,7 @@ AirbyteGlobalState:
"$ref": "#/definitions/AirbyteStreamState"
```

## AirbyteConnectionStatus Message
### AirbyteConnectionStatus Message

This message reports whether an Actor was able to connect to its underlying data store with all the permissions it needs to succeed. The goal is that if a successful stat is returned, that the user should be confident that using that Actor will succeed. The depth of the verification is not specified in the protocol. More robust verification is preferred but going to deep can create undesired performance tradeoffs

Expand All @@ -717,15 +717,15 @@ AirbyteConnectionStatus:
type: string
```

## ConnectorSpecification Message
### ConnectorSpecification Message

This message returns the `ConnectorSpecification` struct which is described in detail in [Actor Specification](#actor-specification)

## AirbyteCatalog Message
### AirbyteCatalog Message

This message returns the `AirbyteCatalog` struct which is described in detail in [Catalog](#catalog)

## AirbyteLogMessage
### AirbyteLogMessage

Logs are helping for debugging an Actor. In order for a log emitted from an Actor be properly parsed it should be emitted as an `AirbyteLogMessage` wrapped in an `AirbyteMessage`.

Expand Down Expand Up @@ -757,7 +757,7 @@ AirbyteLogMessage:
type: string
```

## AirbyteTraceMessage
### AirbyteTraceMessage

The trace message allows an Actor to emit metadata about the runtime of the Actor. As currently implemented, it allows an Actor to surface information about errors. This message is designed to grow to handle other use cases, including progress and performance metrics.

Expand Down Expand Up @@ -804,7 +804,7 @@ AirbyteErrorTraceMessage:
- config_error
```

## AirbyteControlMessage
### AirbyteControlMessage

An `AirbyteControlMessage` is for connectors to signal to the Airbyte Platform or Orchestrator that an action with a side-effect should be taken. This means that the Orchestrator will likely be altering some stored data about the connector, connection, or sync.

Expand All @@ -830,7 +830,7 @@ AirbyteControlMessage:
"$ref": "#/definitions/AirbyteControlConnectorConfigMessage"
```

### AirbyteControlConnectorConfigMessage
#### AirbyteControlConnectorConfigMessage

`AirbyteControlConnectorConfigMessage` allows a connector to update its configuration in the middle of a sync. This is valuable for connectors with short-lived or single-use credentials.

Expand All @@ -853,6 +853,6 @@ AirbyteControlConnectorConfigMessage:

For example, if the currently persisted config file is `{"api_key": 123, start_date: "01-01-2022"}` and the following `AirbyteControlConnectorConfigMessage` is output `{type: ORCHESTRATOR, connectorConfig: {"config": {"api_key": 456}, "emitted_at": <current_time>}}` then the persisted configuration is merged, and will become `{"api_key": 456, start_date: "01-01-2022"}`.

# Acknowledgements
## Acknowledgements

We'd like to note that we were initially inspired by Singer.io's [specification](https://github.com/singer-io/getting-started/blob/master/docs/SPEC.md#singer-specification) and would like to acknowledge that some of their design choices helped us bootstrap our project. We've since made a lot of modernizations to our protocol and specification, but don't want to forget the tools that helped us get started.