Add migrations to support progress bar. #19191

davinchia · 2022-11-09T00:44:42Z

What

Follow up to #18953.

Implement all the DB migrations required for a progress bar.

The main change here is to support saving:

the estimated records/bytes at the sync level
the estimated records/bytes and emitted records/bytes at the stream level

After this, I'll put up a PR for the persistence layer changes, which will writing to and reading from these columns.

Finally, I'll wire this into the API changes, which are currently stubs.

To see a version of how this will be consumed please look at

here to see how/when stats will be saved.
here to see how the persistence layer will interact with these columns.
obviously the above code isn't clean so expect better code!

How

add the estimated_records and estimated_bytes columns to the SyncStats table.
~~add the stream_stats column to the SyncStats table. This will contain a JSON blob with per-stream stats.~~
create a stream_stats table
- estimated and emitted records/bytes column
- contains attempt_id and stream_name columns. Unique constraints on these two columns.
- foreign key to the attempt_id table.
- this table hopefully sets us up for the parallel sync work.

🚨 User Impact 🚨

Are there any breaking changes? What is the end result perceived by the user? If yes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
If new credentials are required for use in CI, add them to GSM. Instructions.
/test connector=connectors/<name> command is passing
New Connector version released on Dockerhub by running the /publish command described here
After the connector is published, connector added to connector index as described here
Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

Updating a connector

Community member or Airbyter

Grant edit access to maintainers (instructions)
Secrets in the connector's spec are annotated with airbyte_secret
Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
Code reviews completed
Documentation updated
- Connector's README.md
- Connector's bootstrap.md. See description and examples
- Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
If new credentials are required for use in CI, add them to GSM. Instructions.
/test connector=connectors/<name> command is passing
New Connector version released on Dockerhub and connector version bumped by running the /publish command described here

Connector Generator

Issue acceptance criteria met
PR name follows PR naming conventions
If adding a new generator, add it to the list of scaffold modules being tested
The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
Documentation which references the generator is updated as needed

Tests

Unit

Put your unit tests output here.

Integration

Put your integration tests output here.

Acceptance

Put your acceptance tests output here.

benmoriceau · 2022-11-09T01:02:23Z

...yte/db/instance/jobs/migrations/V0_40_18_001__AddEstimatedAndStreamStatsColumnSyncStats.java

+        .add(
+            DSL.field("estimated_records", SQLDataType.BIGINT.nullable(true)),
+            DSL.field("estimated_bytes", SQLDataType.BIGINT.nullable(true)),
+            DSL.field("stream_stats", SQLDataType.JSONB.nullable(true)))


What will this blob contains? Does it worth adding it in a new table?

I wrote out some detail here.

Essentially this will contain a blob that is a list of per-stream stats.

There is argument to split the per-stream stats into another table. I don't think it's necessary since all the current access patterns I know would always aggregate each record over a sync key so I don't see much value to splitting it out now. I am happy to split this out if people feel strongly.

I am seeing it in a the context of partial stats which could be parallel. I think that those stats will be an issue because one can erase the blob of another job. WDYT?

benmoriceau · 2022-11-09T01:03:45Z

...yte/db/instance/jobs/migrations/V0_40_18_001__AddEstimatedAndStreamStatsColumnSyncStats.java

+    // As database schema changes, the generated jOOQ code can be deprecated. So
+    // old migration may not compile if there is any generated code.
+    final DSLContext ctx = DSL.using(context.getConnection());
+    ctx.alterTable("sync_stats")


When the estimation will be updated? I feel it weird to have entry of this table being created before the jobs run and some after the job runs (from what I understand from the hack presentation)

the current plan is for this to be updated by the workers as the sync runs - see this branch for an example.

so this entry is created as the job runs - not before!

Thanks for the link. I still think that there will need some updates about how we write the stats currently, it might erase or duplicate the stats for this hob https://github.com/airbytehq/airbyte/blob/master/airbyte-persistence/job-persistence/src/main/java/io/airbyte/persistence/job/DefaultJobPersistence.java#L369

I wonder if we will need to rate limit the update later on since we potentially emit a state per record and thus put a lot of load on the DB but that's probably not a concern for this PR.

...yte/db/instance/jobs/migrations/V0_40_18_001__AddEstimatedAndStreamStatsColumnSyncStats.java

davinchia · 2022-11-09T01:59:47Z

@gosusnp @benmoriceau from your comments I realised I could have added more details into the PR description. I have updated it to include more context on how the changes here will be used. Sorry and thanks!

davinchia · 2022-11-11T01:29:22Z

.../src/main/java/io/airbyte/db/instance/jobs/migrations/V0_40_18_001__AddProgressBarStats.java

+    // old migration may not compile if there is any generated code.
+    try (final DSLContext ctx = DSL.using(context.getConnection())) {
+      addEstimatedColumnsToSyncStats(ctx);
+      addStreamStatsTable(ctx);


heads up @gosusnp I spoke to @benmoriceau and decided creating a new table now is lower effort than trying to sidestep the potential migration when we have parallel syncs. Letting you know in case you want to re-review.

gosusnp · 2022-11-11T01:40:01Z

.../src/main/java/io/airbyte/db/instance/jobs/migrations/V0_40_18_001__AddProgressBarStats.java

+    // Metadata Columns
+    final Field<UUID> id = field("id", SQLDataType.UUID.nullable(false));
+    final Field<Integer> attemptId = field("attempt_id", SQLDataType.INTEGER.nullable(false));
+    final Field<String> streamName = field("stream_name", SQLDataType.VARCHAR.nullable(false));


We should add stream_namespace as well since stream_name could collision.

good point!

@evantahler stream_namespace is something we might need to change on the FE too.

…tehq/airbyte into davinchia/progress-bar-persistence

…nWorker. (#19360) #19191 made me realise the DefaultReplicationWorker's metric tracking today has a bug where we aren't accounting for namespace when tracking metrics today. i.e. Streams with the same name and duplicate namespace will merge metrics. While reading the code to figure out a fix, I realised we don't have a good conceptual representation of stream namespace <> name pairs within the platform today. We use a concatenated string. Though this works, it will become harder and harder to read/track as we do more operations that involve namespace i.e. progress bars and column selection. This PR introduces the AirbyteStreamNameNamespacePair object into the platform code to make it more convenient to work with Streams in the future. (Especially if we proceed with the project to make streams a first-class citizen!) The AirbyteStreamNameNamespacePair object was written to deal with the same issue of namespace <> name pair manipulation within the Java destination code. It implements the Comparable interface, which makes it convenient to use for Collections operations. For an example of how this is consumed, see #19361.

Follow up to #18953. Implement all the DB migrations required for a progress bar. The main change here is to support saving: the estimated records/bytes at the sync level the estimated records/bytes and emitted records/bytes at the stream level After this, I'll put up a PR for the persistence layer changes, which will writing to and reading from these columns. Finally, I'll wire this into the API changes, which are currently stubs. - add the estimated_records and estimated_bytes columns to the SyncStats table. - create a stream_stats table - estimated and emitted records/bytes column - contains attempt_id and stream_name columns. Unique constraints on these two columns. - foreign key to the attempt_id table. - this table hopefully sets us up for the parallel sync work.

…nWorker. (#19360) #19191 made me realise the DefaultReplicationWorker's metric tracking today has a bug where we aren't accounting for namespace when tracking metrics today. i.e. Streams with the same name and duplicate namespace will merge metrics. While reading the code to figure out a fix, I realised we don't have a good conceptual representation of stream namespace <> name pairs within the platform today. We use a concatenated string. Though this works, it will become harder and harder to read/track as we do more operations that involve namespace i.e. progress bars and column selection. This PR introduces the AirbyteStreamNameNamespacePair object into the platform code to make it more convenient to work with Streams in the future. (Especially if we proceed with the project to make streams a first-class citizen!) The AirbyteStreamNameNamespacePair object was written to deal with the same issue of namespace <> name pair manipulation within the Java destination code. It implements the Comparable interface, which makes it convenient to use for Collections operations. For an example of how this is consumed, see #19361.

Implement the persistence layer changes following #19191. This PR handles writing and reading stats to the new stream stat_table and columns in the existing sync_stats table. At the same time we introduce upserts of stats records - i.e. merge updates into a single record - in preparation for real time stats updates vs the current approach where a new stat record is always written. There will be two remaining PRs after this: - First PR will be to fully wire up and test the API. - Second PR will be to actually save stats while jobs are running.

davinchia added 2 commits November 8, 2022 16:43

Add migrations to support progress bar.

38f08db

Merge branch 'master' into davinchia/progress-bar-persistence

8a365d4

davinchia temporarily deployed to more-secrets November 9, 2022 00:46 Inactive

davinchia marked this pull request as ready for review November 9, 2022 00:48

davinchia requested review from gosusnp and benmoriceau November 9, 2022 00:49

benmoriceau reviewed Nov 9, 2022

View reviewed changes

gosusnp reviewed Nov 9, 2022

View reviewed changes

...yte/db/instance/jobs/migrations/V0_40_18_001__AddEstimatedAndStreamStatsColumnSyncStats.java Outdated Show resolved Hide resolved

gosusnp approved these changes Nov 9, 2022

View reviewed changes

Merge branch 'master' into davinchia/progress-bar-persistence

f3fb8f3

davinchia temporarily deployed to more-secrets November 10, 2022 00:02 Inactive

Add a new stream table.

6e2c6e3

davinchia temporarily deployed to more-secrets November 11, 2022 01:27 Inactive

davinchia commented Nov 11, 2022

View reviewed changes

benmoriceau approved these changes Nov 11, 2022

View reviewed changes

gosusnp reviewed Nov 11, 2022

View reviewed changes

Merge branch 'master' into davinchia/progress-bar-persistence

b944c15

davinchia temporarily deployed to more-secrets November 11, 2022 05:07 Inactive

davinchia added 2 commits November 10, 2022 21:12

Add stream_namespace column to new table.

a3ca709

Merge branch 'davinchia/progress-bar-persistence' of github.com:airby…

afca3d2

…tehq/airbyte into davinchia/progress-bar-persistence

davinchia temporarily deployed to more-secrets November 11, 2022 05:15 Inactive

Dedup on versions.

c64f8d0

davinchia temporarily deployed to more-secrets November 11, 2022 05:23 Inactive

Update test.

5463458

davinchia temporarily deployed to more-secrets November 11, 2022 05:48 Inactive

davinchia merged commit 8a0d12f into master Nov 11, 2022

davinchia deleted the davinchia/progress-bar-persistence branch November 11, 2022 06:35

davinchia mentioned this pull request Nov 12, 2022

Introduce the AirbyteStreamNamespaceNamePair to the DefaultReplicationWorker. #19360

Merged

37 tasks

github-actions bot mentioned this pull request Nov 15, 2022

Bump helm chart version reference to 0.40.43 #19453

Closed

octavia-squidington-iii mentioned this pull request Nov 18, 2022

Bump Airbyte version from 0.40.18 to 0.40.19 #19579

Merged

davinchia mentioned this pull request Dec 27, 2022

Implement Progress Bar Persistence Read/Write #20787

Merged

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add migrations to support progress bar. #19191

Add migrations to support progress bar. #19191

davinchia commented Nov 9, 2022 •

edited

Loading

benmoriceau Nov 9, 2022

davinchia Nov 9, 2022

benmoriceau Nov 10, 2022

benmoriceau Nov 9, 2022

davinchia Nov 9, 2022

benmoriceau Nov 10, 2022

davinchia commented Nov 9, 2022

davinchia Nov 11, 2022

gosusnp Nov 11, 2022

davinchia Nov 11, 2022

Add migrations to support progress bar. #19191

Add migrations to support progress bar. #19191

Conversation

davinchia commented Nov 9, 2022 • edited Loading

What

How

Recommended reading order

🚨 User Impact 🚨

Pre-merge Checklist

Community member or Airbyter

Airbyter

Community member or Airbyter

Airbyter

Tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinchia commented Nov 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinchia commented Nov 9, 2022 •

edited

Loading