Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config db data catalog #16427

Merged
merged 5 commits into from
Oct 9, 2022
Merged

config db data catalog #16427

merged 5 commits into from
Oct 9, 2022

Conversation

cgardens
Copy link
Contributor

@cgardens cgardens commented Sep 8, 2022

What

Got some feedback that our database schema is poorly documented (especially the places where we are storing JSON blobs in the db). Goal here is to document each table. Specifically make sure that all cases where we store a JSON blob are clearly described and link to the schema of the JSON blob. Also add explanation around what each table and (most) columns mean.

How

  • Took a first pass at this for the config db. Want to get this reviewed and then can do jobs db and cloud db next (or someone else can jump on it).

To do

  • get feedback from subject matter experts to confirm accuracy.

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Sep 8, 2022
* The `schedule_type` column defines what type of schedule is being used. If the `type` is manual, then `schedule_data` will be null. Otherwise, `schedule_data` column is a JSON blob with the schema of [StandardSync#scheduleData](airbyte-config/config-models/src/main/resources/types/StandardSync.yaml#79) that defines the actual schedule. The columns `manual` and `schedule` are deprecated and should be ignored (they will be dropped soon).
* The `namespace_type` column configures whether the namespace for the connection should use that defined by the source, the destination, or a user-defined format (`custom`). If `custom` the `namespace_format` column defines the string that will be used as the namespace.
* The `status` column describes the activity level of the connector: `active` - current schedule is respected, `inactive` - current schedule is ignored (the connection does not run) but it could be switched back to active, and `deprecated` - the connection is permanently off (cannot be moved to active or inactive).
* `state`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gosusnp can you sanity check that everything I said about the state table is true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of extra columns:

  • The type column describes the type of the state of the row. type can be STREAM, GLOBAL or LEGACY
  • The connection_id is a foreign key to the connection for which we are tracking state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

* Operations are scoped by workspace, using the `workspace_id` column.
* `connection_operation`
* This table joins the `operation` table to the `connection` for which it is configured.
* `workspace_service_account`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sherifnada do you know what this table is all about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@subodh1810 is this the one used for staging by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* `secrets`
* This table is used to store secrets in open-source versions of the platform that have not set some other secrets store. This table allows us to use the same code path for secrets handling regardless of whether an external secrets store is set or not. This table is used by default for the open-source product.
* `airbyte_configs_migrations` is metadata table used by Flyway (our database migration tool). It is not used for any application use cases.
* `airbyte_configs`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sherifnada or @davinchia do you remember what this table was for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where we kept raw json configs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's used anymore. @benmoriceau or @subodh1810 do you remember why this wasn't removed?

Copy link
Contributor

@davinchia davinchia Sep 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where we kept everything before @subodh1810 did the work to normalise tables.

The latest created_at/modified_at timestamps in prod is early Jan, so I think we can drop this now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be kept to keep the history of the configs before a the normalization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. so clarification:

  • airbyte_configs in the config db is empty (so there's nothing to save here anyway).
  • airbyte_configs in the cloud db still has some old data (not inclined to think it is useful at this point).

Given that, I'm inclined to drop both.

* `stream` - this column is a JSON blob that is a blackbox to the platform and known only to the connector that generated it.
* `global` - this column is a JSON blob that is a blackbox to the platform and known only to the connector that generated it. This is true for both the states for each stream and the shared state.
* `legacy` - this column is a JSON blob with a top-level key called `state`. Within that `state` is a blackbox to the platform and known only to the connector that generated it.
* `stream_reset`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alovew does this description accurately represent this table?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, basically. I might say 'that is enqueued to be reset or is currently being reset'

* Each record in this table configures a connection (`source_id`, `destination_id`, and relevant configuration).
* The `resource_requirements` field sets a default resource requirement for the connection. This overrides the default we set for all connector definitions and the default set for the connector definitions. The column is a JSON blob with the schema defined in [ResourceRequirements.yaml](airbyte-config/config-models/src/main/resources/types/ResourceRequirements.yaml).
* The `source_catalog_id` column is a foreign key to the `sourc_catalog` table and represents the catalog that was used to configure the connection. This should not be confused with the `catalog` column which contains the [ConfiguredCatalog](airbyte-protocol.md#catalog) for the connection.
* The `schedule_type` column defines what type of schedule is being used. If the `type` is manual, then `schedule_data` will be null. Otherwise, `schedule_data` column is a JSON blob with the schema of [StandardSync#scheduleData](airbyte-config/config-models/src/main/resources/types/StandardSync.yaml#79) that defines the actual schedule. The columns `manual` and `schedule` are deprecated and should be ignored (they will be dropped soon).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfsiega-airbyte does this description accurately represent how we save schedules in the db?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@gosusnp gosusnp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of fixes on the state part.

docs/understanding-airbyte/database-data-catalog.md Outdated Show resolved Hide resolved
* The `schedule_type` column defines what type of schedule is being used. If the `type` is manual, then `schedule_data` will be null. Otherwise, `schedule_data` column is a JSON blob with the schema of [StandardSync#scheduleData](airbyte-config/config-models/src/main/resources/types/StandardSync.yaml#79) that defines the actual schedule. The columns `manual` and `schedule` are deprecated and should be ignored (they will be dropped soon).
* The `namespace_type` column configures whether the namespace for the connection should use that defined by the source, the destination, or a user-defined format (`custom`). If `custom` the `namespace_format` column defines the string that will be used as the namespace.
* The `status` column describes the activity level of the connector: `active` - current schedule is respected, `inactive` - current schedule is ignored (the connection does not run) but it could be switched back to active, and `deprecated` - the connection is permanently off (cannot be moved to active or inactive).
* `state`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of extra columns:

  • The type column describes the type of the state of the row. type can be STREAM, GLOBAL or LEGACY
  • The connection_id is a foreign key to the connection for which we are tracking state

* The `actor_definition_id` column is a foreign key to the connector definition that this record is implementing.
* The `configuration` column is a JSON blob. The schema of this JSON blob matches the schema specified in the `spec` column in the `connectionSpecification` field of the JSON blob. Keep in mind this schema is specific to each connector (e.g. the schema of Postgres and Salesforce are different), which is why this column has to be a JSON blob.
* `actor_catalog`
* Each record contains a catalog for an actor. The records in this table are meant to be immutable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean a connector catalog - as in, the streams/fields that are defined for a particular connector? Are they actually immutable? What if streams are added/removed/modified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can always fetch a new catalog for a particular connector. when we do, we will then use the new catalog instead of the old one. this is saying that we will keep around the old catalogs. we never mutate the old ones.

* The `catalog` column is a JSON blob. The schema of this JSON blob matches the [catalog](airbyte-protocol.md#catalog) model in the Airbyte Protocol. Because the protocol object is JSON, this has to be a JSON blob. The `catalog_hash` column is a 32-bit murmur3 hash ( x86 variant) of the `catalog` field to make comparisons easier.
* todo (cgardens) - should we remove the `modified_at` column? These records should be immutable.
* `actor_catalog_fetch_event`
* Each record represents an attempt to fetch the catalog for an actor. The records in this table are meant to be immutable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a catalog fetch event the same as a source schema refresh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. it's any time we fetch a catalog. so the original time we fetch the schema for a connector and source schema refreshes.

* `actor_catalog`
* Each record contains a catalog for an actor. The records in this table are meant to be immutable.
* The `catalog` column is a JSON blob. The schema of this JSON blob matches the [catalog](airbyte-protocol.md#catalog) model in the Airbyte Protocol. Because the protocol object is JSON, this has to be a JSON blob. The `catalog_hash` column is a 32-bit murmur3 hash ( x86 variant) of the `catalog` field to make comparisons easier.
* todo (cgardens) - should we remove the `modified_at` column? These records should be immutable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 created at should be good enough if each record is not meant to be modified.

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Charles! Very important to document this as we scale.

Left some comments. Generally looks good to me.

We also have some entity diagrams:

  1. https://lucid.app/lucidchart/c9958c78-feac-4a01-9104-9530e6996ad2/edit?invitationId=inv_96efb302-e60d-44c9-b687-0246fa0b7328&page=0_0#
  2. https://lucid.app/lucidchart/5284ccc7-f8d4-4b40-9d0b-4cb8ffeec395/edit?invitationId=inv_c504870d-b06b-4126-bdc9-77414c7914fb&page=0_0#

I'm not sure those are updated or public. So I don't think we can include them in the public docs. I do think we want to figure out a set up where the diagrams are auto-generated. Leaving a note here for folks on reviewing this PR.

@cgardens cgardens mentioned this pull request Sep 9, 2022
1 task
@cgardens
Copy link
Contributor Author

cgardens commented Sep 9, 2022

I do think we want to figure out a set up where the diagrams are auto-generated. Leaving a note here for folks on reviewing this PR.

+1 this would be great. if we could keep the image up to date in source code and then just link to it that would be great.

@cgardens
Copy link
Contributor Author

cgardens commented Oct 9, 2022

Merging over branch protections because the comments in the "request changes" have been addressed.

@cgardens cgardens merged commit 2237654 into master Oct 9, 2022
@cgardens cgardens deleted the cgardens/db-data-catalog branch October 9, 2022 04:11
letiescanciano added a commit that referenced this pull request Oct 10, 2022
…vation

* master: (22 commits)
  Update full-refresh-append.md (#17784)
  Update full-refresh-overwrite.md (#17783)
  Update incremental-append.md (#17785)
  Update incremental-deduped-history.md (#17786)
  Update cdc.md (#17787)
  🪟 🔧 Ignore classnames during jest snapshot comparison (#17773)
  feat: replace openjdk with amazoncorretto:17.0.4 on connectors for seсurity compliance (#17511)
  Start testing buildpulse. (#17712)
  Add missing types to the registry (#17763)
  jobs db descriptions (#16543)
  config db data catalog (#16427)
  Update lowcode docs (#17752)
  db migrations to support new webhook operations (#17671)
  Bump Airbyte version from 0.40.13 to 0.40.14 (#17762)
  September Release Notes (#17754)
  Revert "Use java-datadog-tracer-base image (#17625)" (#17759)
  Add connection migrations for schema changes (#17651)
  Connection Form Refactor - Part Two (#16748)
  Improve E2E testing around the Connection Form (#17577)
  Bump strict encrypt version (#17747)
  ...
jhammarstedt pushed a commit to jhammarstedt/airbyte that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants