-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
config db data catalog #16427
config db data catalog #16427
Conversation
* The `schedule_type` column defines what type of schedule is being used. If the `type` is manual, then `schedule_data` will be null. Otherwise, `schedule_data` column is a JSON blob with the schema of [StandardSync#scheduleData](airbyte-config/config-models/src/main/resources/types/StandardSync.yaml#79) that defines the actual schedule. The columns `manual` and `schedule` are deprecated and should be ignored (they will be dropped soon). | ||
* The `namespace_type` column configures whether the namespace for the connection should use that defined by the source, the destination, or a user-defined format (`custom`). If `custom` the `namespace_format` column defines the string that will be used as the namespace. | ||
* The `status` column describes the activity level of the connector: `active` - current schedule is respected, `inactive` - current schedule is ignored (the connection does not run) but it could be switched back to active, and `deprecated` - the connection is permanently off (cannot be moved to active or inactive). | ||
* `state` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gosusnp can you sanity check that everything I said about the state table is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of extra columns:
- The
type
column describes the type of the state of the row.type
can beSTREAM
,GLOBAL
orLEGACY
- The
connection_id
is a foreign key to the connection for which we are tracking state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
* Operations are scoped by workspace, using the `workspace_id` column. | ||
* `connection_operation` | ||
* This table joins the `operation` table to the `connection` for which it is configured. | ||
* `workspace_service_account` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sherifnada do you know what this table is all about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@subodh1810 is this the one used for staging by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@subodh1810 bump
* `secrets` | ||
* This table is used to store secrets in open-source versions of the platform that have not set some other secrets store. This table allows us to use the same code path for secrets handling regardless of whether an external secrets store is set or not. This table is used by default for the open-source product. | ||
* `airbyte_configs_migrations` is metadata table used by Flyway (our database migration tool). It is not used for any application use cases. | ||
* `airbyte_configs` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sherifnada or @davinchia do you remember what this table was for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where we kept raw json configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's used anymore. @benmoriceau or @subodh1810 do you remember why this wasn't removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is where we kept everything before @subodh1810 did the work to normalise tables.
The latest created_at/modified_at timestamps in prod is early Jan, so I think we can drop this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be kept to keep the history of the configs before a the normalization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay. so clarification:
airbyte_configs
in the config db is empty (so there's nothing to save here anyway).airbyte_configs
in the cloud db still has some old data (not inclined to think it is useful at this point).
Given that, I'm inclined to drop both.
* `stream` - this column is a JSON blob that is a blackbox to the platform and known only to the connector that generated it. | ||
* `global` - this column is a JSON blob that is a blackbox to the platform and known only to the connector that generated it. This is true for both the states for each stream and the shared state. | ||
* `legacy` - this column is a JSON blob with a top-level key called `state`. Within that `state` is a blackbox to the platform and known only to the connector that generated it. | ||
* `stream_reset` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alovew does this description accurately represent this table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, basically. I might say 'that is enqueued to be reset or is currently being reset'
* Each record in this table configures a connection (`source_id`, `destination_id`, and relevant configuration). | ||
* The `resource_requirements` field sets a default resource requirement for the connection. This overrides the default we set for all connector definitions and the default set for the connector definitions. The column is a JSON blob with the schema defined in [ResourceRequirements.yaml](airbyte-config/config-models/src/main/resources/types/ResourceRequirements.yaml). | ||
* The `source_catalog_id` column is a foreign key to the `sourc_catalog` table and represents the catalog that was used to configure the connection. This should not be confused with the `catalog` column which contains the [ConfiguredCatalog](airbyte-protocol.md#catalog) for the connection. | ||
* The `schedule_type` column defines what type of schedule is being used. If the `type` is manual, then `schedule_data` will be null. Otherwise, `schedule_data` column is a JSON blob with the schema of [StandardSync#scheduleData](airbyte-config/config-models/src/main/resources/types/StandardSync.yaml#79) that defines the actual schedule. The columns `manual` and `schedule` are deprecated and should be ignored (they will be dropped soon). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfsiega-airbyte does this description accurately represent how we save schedules in the db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfsiega-airbyte bump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of fixes on the state part.
* The `schedule_type` column defines what type of schedule is being used. If the `type` is manual, then `schedule_data` will be null. Otherwise, `schedule_data` column is a JSON blob with the schema of [StandardSync#scheduleData](airbyte-config/config-models/src/main/resources/types/StandardSync.yaml#79) that defines the actual schedule. The columns `manual` and `schedule` are deprecated and should be ignored (they will be dropped soon). | ||
* The `namespace_type` column configures whether the namespace for the connection should use that defined by the source, the destination, or a user-defined format (`custom`). If `custom` the `namespace_format` column defines the string that will be used as the namespace. | ||
* The `status` column describes the activity level of the connector: `active` - current schedule is respected, `inactive` - current schedule is ignored (the connection does not run) but it could be switched back to active, and `deprecated` - the connection is permanently off (cannot be moved to active or inactive). | ||
* `state` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of extra columns:
- The
type
column describes the type of the state of the row.type
can beSTREAM
,GLOBAL
orLEGACY
- The
connection_id
is a foreign key to the connection for which we are tracking state
* The `actor_definition_id` column is a foreign key to the connector definition that this record is implementing. | ||
* The `configuration` column is a JSON blob. The schema of this JSON blob matches the schema specified in the `spec` column in the `connectionSpecification` field of the JSON blob. Keep in mind this schema is specific to each connector (e.g. the schema of Postgres and Salesforce are different), which is why this column has to be a JSON blob. | ||
* `actor_catalog` | ||
* Each record contains a catalog for an actor. The records in this table are meant to be immutable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean a connector catalog - as in, the streams/fields that are defined for a particular connector? Are they actually immutable? What if streams are added/removed/modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can always fetch a new catalog for a particular connector. when we do, we will then use the new catalog instead of the old one. this is saying that we will keep around the old catalogs. we never mutate the old ones.
* The `catalog` column is a JSON blob. The schema of this JSON blob matches the [catalog](airbyte-protocol.md#catalog) model in the Airbyte Protocol. Because the protocol object is JSON, this has to be a JSON blob. The `catalog_hash` column is a 32-bit murmur3 hash ( x86 variant) of the `catalog` field to make comparisons easier. | ||
* todo (cgardens) - should we remove the `modified_at` column? These records should be immutable. | ||
* `actor_catalog_fetch_event` | ||
* Each record represents an attempt to fetch the catalog for an actor. The records in this table are meant to be immutable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a catalog fetch event the same as a source schema refresh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. it's any time we fetch a catalog. so the original time we fetch the schema for a connector and source schema refreshes.
* `actor_catalog` | ||
* Each record contains a catalog for an actor. The records in this table are meant to be immutable. | ||
* The `catalog` column is a JSON blob. The schema of this JSON blob matches the [catalog](airbyte-protocol.md#catalog) model in the Airbyte Protocol. Because the protocol object is JSON, this has to be a JSON blob. The `catalog_hash` column is a 32-bit murmur3 hash ( x86 variant) of the `catalog` field to make comparisons easier. | ||
* todo (cgardens) - should we remove the `modified_at` column? These records should be immutable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 created at should be good enough if each record is not meant to be modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Charles! Very important to document this as we scale.
Left some comments. Generally looks good to me.
We also have some entity diagrams:
- https://lucid.app/lucidchart/c9958c78-feac-4a01-9104-9530e6996ad2/edit?invitationId=inv_96efb302-e60d-44c9-b687-0246fa0b7328&page=0_0#
- https://lucid.app/lucidchart/5284ccc7-f8d4-4b40-9d0b-4cb8ffeec395/edit?invitationId=inv_c504870d-b06b-4126-bdc9-77414c7914fb&page=0_0#
I'm not sure those are updated or public. So I don't think we can include them in the public docs. I do think we want to figure out a set up where the diagrams are auto-generated. Leaving a note here for folks on reviewing this PR.
+1 this would be great. if we could keep the image up to date in source code and then just link to it that would be great. |
Merging over branch protections because the comments in the "request changes" have been addressed. |
…vation * master: (22 commits) Update full-refresh-append.md (#17784) Update full-refresh-overwrite.md (#17783) Update incremental-append.md (#17785) Update incremental-deduped-history.md (#17786) Update cdc.md (#17787) 🪟 🔧 Ignore classnames during jest snapshot comparison (#17773) feat: replace openjdk with amazoncorretto:17.0.4 on connectors for seсurity compliance (#17511) Start testing buildpulse. (#17712) Add missing types to the registry (#17763) jobs db descriptions (#16543) config db data catalog (#16427) Update lowcode docs (#17752) db migrations to support new webhook operations (#17671) Bump Airbyte version from 0.40.13 to 0.40.14 (#17762) September Release Notes (#17754) Revert "Use java-datadog-tracer-base image (#17625)" (#17759) Add connection migrations for schema changes (#17651) Connection Form Refactor - Part Two (#16748) Improve E2E testing around the Connection Form (#17577) Bump strict encrypt version (#17747) ...
What
Got some feedback that our database schema is poorly documented (especially the places where we are storing JSON blobs in the db). Goal here is to document each table. Specifically make sure that all cases where we store a JSON blob are clearly described and link to the schema of the JSON blob. Also add explanation around what each table and (most) columns mean.
How
To do