config db data catalog #16427

cgardens · 2022-09-08T05:06:23Z

What

Got some feedback that our database schema is poorly documented (especially the places where we are storing JSON blobs in the db). Goal here is to document each table. Specifically make sure that all cases where we store a JSON blob are clearly described and link to the schema of the JSON blob. Also add explanation around what each table and (most) columns mean.

How

Took a first pass at this for the config db. Want to get this reviewed and then can do jobs db and cloud db next (or someone else can jump on it).

To do

get feedback from subject matter experts to confirm accuracy.

cgardens · 2022-09-08T05:07:53Z

docs/understanding-airbyte/database-data-catalog.md

+  * The `schedule_type` column defines what type of schedule is being used. If the `type` is manual, then `schedule_data` will be null. Otherwise, `schedule_data` column is a JSON blob with the schema of [StandardSync#scheduleData](airbyte-config/config-models/src/main/resources/types/StandardSync.yaml#79) that defines the actual schedule. The columns `manual` and `schedule` are deprecated and should be ignored (they will be dropped soon).
+  * The `namespace_type` column configures whether the namespace for the connection should use that defined by the source, the destination, or a user-defined format (`custom`). If `custom` the `namespace_format` column defines the string that will be used as the namespace.
+  * The `status` column describes the activity level of the connector: `active` - current schedule is respected, `inactive` - current schedule is ignored (the connection does not run) but it could be switched back to active, and `deprecated` - the connection is permanently off (cannot be moved to active or inactive).
+* `state`


@gosusnp can you sanity check that everything I said about the state table is true?

A couple of extra columns:

The type column describes the type of the state of the row. type can be STREAM, GLOBAL or LEGACY

The connection_id is a foreign key to the connection for which we are tracking state

cgardens · 2022-09-08T05:08:07Z

docs/understanding-airbyte/database-data-catalog.md

+  * Operations are scoped by workspace, using the `workspace_id` column.
+* `connection_operation`
+  * This table joins the `operation` table to the `connection` for which it is configured. 
+* `workspace_service_account`


@sherifnada do you know what this table is all about?

@subodh1810 is this the one used for staging by default?

@subodh1810 bump

cgardens · 2022-09-08T05:08:30Z

docs/understanding-airbyte/database-data-catalog.md

+* `secrets`
+  * This table is used to store secrets in open-source versions of the platform that have not set some other secrets store. This table allows us to use the same code path for secrets handling regardless of whether an external secrets store is set or not. This table is used by default for the open-source product.
+* `airbyte_configs_migrations` is metadata table used by Flyway (our database migration tool). It is not used for any application use cases.
+* `airbyte_configs`


@sherifnada or @davinchia do you remember what this table was for?

I think this is where we kept raw json configs

I don't think it's used anymore. @benmoriceau or @subodh1810 do you remember why this wasn't removed?

I think this is where we kept everything before @subodh1810 did the work to normalise tables.

The latest created_at/modified_at timestamps in prod is early Jan, so I think we can drop this now.

Should it be kept to keep the history of the configs before a the normalization?

okay. so clarification:

airbyte_configs in the config db is empty (so there's nothing to save here anyway).

airbyte_configs in the cloud db still has some old data (not inclined to think it is useful at this point).

Given that, I'm inclined to drop both.

cgardens · 2022-09-08T05:09:38Z

docs/understanding-airbyte/database-data-catalog.md

+    * `stream` - this column is a JSON blob that is a blackbox to the platform and known only to the connector that generated it.
+    * `global` - this column is a JSON blob that is a blackbox to the platform and known only to the connector that generated it. This is true for both the states for each stream and the shared state.
+    * `legacy` - this column is a JSON blob with a top-level key called `state`. Within that `state` is a blackbox to the platform and known only to the connector that generated it.
+* `stream_reset`


@alovew does this description accurately represent this table?

yes, basically. I might say 'that is enqueued to be reset or is currently being reset'

cgardens · 2022-09-08T05:10:13Z

docs/understanding-airbyte/database-data-catalog.md

+  * Each record in this table configures a connection (`source_id`, `destination_id`, and relevant configuration).
+  * The `resource_requirements` field sets a default resource requirement for the connection. This overrides the default we set for all connector definitions and the default set for the connector definitions. The column is a JSON blob with the schema defined in [ResourceRequirements.yaml](airbyte-config/config-models/src/main/resources/types/ResourceRequirements.yaml).
+  * The `source_catalog_id` column is a foreign key to the `sourc_catalog` table and represents the catalog that was used to configure the connection. This should not be confused with the `catalog` column which contains the [ConfiguredCatalog](airbyte-protocol.md#catalog) for the connection.
+  * The `schedule_type` column defines what type of schedule is being used. If the `type` is manual, then `schedule_data` will be null. Otherwise, `schedule_data` column is a JSON blob with the schema of [StandardSync#scheduleData](airbyte-config/config-models/src/main/resources/types/StandardSync.yaml#79) that defines the actual schedule. The columns `manual` and `schedule` are deprecated and should be ignored (they will be dropped soon).


@mfsiega-airbyte does this description accurately represent how we save schedules in the db?

@mfsiega-airbyte bump

gosusnp

A couple of fixes on the state part.

docs/understanding-airbyte/database-data-catalog.md

gosusnp · 2022-09-08T22:00:22Z

docs/understanding-airbyte/database-data-catalog.md

+  * The `schedule_type` column defines what type of schedule is being used. If the `type` is manual, then `schedule_data` will be null. Otherwise, `schedule_data` column is a JSON blob with the schema of [StandardSync#scheduleData](airbyte-config/config-models/src/main/resources/types/StandardSync.yaml#79) that defines the actual schedule. The columns `manual` and `schedule` are deprecated and should be ignored (they will be dropped soon).
+  * The `namespace_type` column configures whether the namespace for the connection should use that defined by the source, the destination, or a user-defined format (`custom`). If `custom` the `namespace_format` column defines the string that will be used as the namespace.
+  * The `status` column describes the activity level of the connector: `active` - current schedule is respected, `inactive` - current schedule is ignored (the connection does not run) but it could be switched back to active, and `deprecated` - the connection is permanently off (cannot be moved to active or inactive).
+* `state`


A couple of extra columns:

The type column describes the type of the state of the row. type can be STREAM, GLOBAL or LEGACY

The connection_id is a foreign key to the connection for which we are tracking state

alovew · 2022-09-08T22:14:15Z

docs/understanding-airbyte/database-data-catalog.md

+  * The `actor_definition_id` column is a foreign key to the connector definition that this record is implementing.
+  * The `configuration` column is a JSON blob. The schema of this JSON blob matches the schema specified in the `spec` column in the `connectionSpecification` field of the JSON blob. Keep in mind this schema is specific to each connector (e.g. the schema of Postgres and Salesforce are different), which is why this column has to be a JSON blob.
+* `actor_catalog`
+  * Each record contains a catalog for an actor. The records in this table are meant to be immutable.


does this mean a connector catalog - as in, the streams/fields that are defined for a particular connector? Are they actually immutable? What if streams are added/removed/modified?

we can always fetch a new catalog for a particular connector. when we do, we will then use the new catalog instead of the old one. this is saying that we will keep around the old catalogs. we never mutate the old ones.

alovew · 2022-09-08T22:16:13Z

docs/understanding-airbyte/database-data-catalog.md

+  * The `catalog` column is a JSON blob. The schema of this JSON blob matches the [catalog](airbyte-protocol.md#catalog) model in the Airbyte Protocol. Because the protocol object is JSON, this has to be a JSON blob. The `catalog_hash` column is a 32-bit murmur3 hash ( x86 variant) of the `catalog` field to make comparisons easier.
+  * todo (cgardens) - should we remove the `modified_at` column? These records should be immutable.
+* `actor_catalog_fetch_event`
+  * Each record represents an attempt to fetch the catalog for an actor. The records in this table are meant to be immutable.


is a catalog fetch event the same as a source schema refresh?

yeah. it's any time we fetch a catalog. so the original time we fetch the schema for a connector and source schema refreshes.

davinchia · 2022-09-09T16:39:56Z

docs/understanding-airbyte/database-data-catalog.md

+* `actor_catalog`
+  * Each record contains a catalog for an actor. The records in this table are meant to be immutable.
+  * The `catalog` column is a JSON blob. The schema of this JSON blob matches the [catalog](airbyte-protocol.md#catalog) model in the Airbyte Protocol. Because the protocol object is JSON, this has to be a JSON blob. The `catalog_hash` column is a 32-bit murmur3 hash ( x86 variant) of the `catalog` field to make comparisons easier.
+  * todo (cgardens) - should we remove the `modified_at` column? These records should be immutable.


+1 created at should be good enough if each record is not meant to be modified.

davinchia

Thanks Charles! Very important to document this as we scale.

Left some comments. Generally looks good to me.

We also have some entity diagrams:

I'm not sure those are updated or public. So I don't think we can include them in the public docs. I do think we want to figure out a set up where the diagrams are auto-generated. Leaving a note here for folks on reviewing this PR.

cgardens · 2022-09-09T19:16:51Z

I do think we want to figure out a set up where the diagrams are auto-generated. Leaving a note here for folks on reviewing this PR.

+1 this would be great. if we could keep the image up to date in source code and then just link to it that would be great.

cgardens · 2022-10-09T04:11:03Z

Merging over branch protections because the comments in the "request changes" have been addressed.

…vation * master: (22 commits) Update full-refresh-append.md (#17784) Update full-refresh-overwrite.md (#17783) Update incremental-append.md (#17785) Update incremental-deduped-history.md (#17786) Update cdc.md (#17787) 🪟 🔧 Ignore classnames during jest snapshot comparison (#17773) feat: replace openjdk with amazoncorretto:17.0.4 on connectors for seсurity compliance (#17511) Start testing buildpulse. (#17712) Add missing types to the registry (#17763) jobs db descriptions (#16543) config db data catalog (#16427) Update lowcode docs (#17752) db migrations to support new webhook operations (#17671) Bump Airbyte version from 0.40.13 to 0.40.14 (#17762) September Release Notes (#17754) Revert "Use java-datadog-tracer-base image (#17625)" (#17759) Add connection migrations for schema changes (#17651) Connection Form Refactor - Part Two (#16748) Improve E2E testing around the Connection Form (#17577) Bump strict encrypt version (#17747) ...

cgardens added 2 commits September 7, 2022 11:28

wip

e055780

first draft of config database

b06ef73

github-actions bot added the area/documentation Improvements or additions to documentation label Sep 8, 2022

add to sidebar

d4c6a55

cgardens requested review from alovew, gosusnp and salima-airbyte September 8, 2022 05:07

cgardens commented Sep 8, 2022

View reviewed changes

cgardens requested a review from jdpgrailsdev September 8, 2022 19:03

gosusnp requested changes Sep 8, 2022

View reviewed changes

alovew reviewed Sep 8, 2022

View reviewed changes

davinchia reviewed Sep 9, 2022

View reviewed changes

davinchia approved these changes Sep 9, 2022

View reviewed changes

cgardens mentioned this pull request Sep 9, 2022

db data catalog jobs #16543

Merged

1 task

cgardens added 2 commits October 8, 2022 22:57

Merge branch 'master' into cgardens/db-data-catalog

e0a5bf8

pr feedback

720e0bc

cgardens merged commit 2237654 into master Oct 9, 2022

cgardens deleted the cgardens/db-data-catalog branch October 9, 2022 04:11

This was referenced Oct 13, 2022

Bump Airbyte version from 0.40.14 to 0.40.15 #17961

Closed

Bump Airbyte version from 0.40.14 to 0.40.15 #17970

Merged

jhammarstedt pushed a commit to jhammarstedt/airbyte that referenced this pull request Oct 31, 2022

config db data catalog (airbytehq#16427)

1c88d33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config db data catalog #16427

config db data catalog #16427

cgardens commented Sep 8, 2022 •

edited

Loading

cgardens Sep 8, 2022

gosusnp Sep 8, 2022

cgardens Oct 9, 2022

cgardens Sep 8, 2022

sherifnada Sep 8, 2022

cgardens Oct 9, 2022

cgardens Sep 8, 2022

sherifnada Sep 8, 2022

sherifnada Sep 8, 2022

davinchia Sep 9, 2022 •

edited

Loading

benmoriceau Sep 9, 2022

cgardens Sep 9, 2022

cgardens Sep 8, 2022

alovew Sep 8, 2022

cgardens Sep 8, 2022

cgardens Oct 9, 2022

gosusnp left a comment

gosusnp Sep 8, 2022

alovew Sep 8, 2022

cgardens Oct 9, 2022

alovew Sep 8, 2022

cgardens Oct 9, 2022

davinchia Sep 9, 2022

davinchia left a comment

cgardens commented Sep 9, 2022

cgardens commented Oct 9, 2022

config db data catalog #16427

config db data catalog #16427

Conversation

cgardens commented Sep 8, 2022 • edited Loading

What

How

To do

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinchia Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gosusnp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinchia left a comment

Choose a reason for hiding this comment

cgardens commented Sep 9, 2022

cgardens commented Oct 9, 2022

cgardens commented Sep 8, 2022 •

edited

Loading

davinchia Sep 9, 2022 •

edited

Loading