Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Database Source SELECT Batch Size #19514

Merged
merged 17 commits into from
Nov 28, 2022
Merged

Conversation

evantahler
Copy link
Contributor

@evantahler evantahler commented Nov 16, 2022

#12400 added dynamic batch sizes to our SQL sources to prevent OOM problems. However, I believe that we are currently creating batch sizes that are too small for optimal speed. This is validated in 2 ways:

  • We have shown that Airbyte's speed increases with wider rows
  • We are not seeing OOM errors on our database sources any more, and we are not utilizing more than 60% of available memory in our source containers

unnamed

This PR does 2 things:

  1. Bumps the total allocation of the source's available memory for the JDBC buffer from 50% to 60%
  2. Rather than assuming the worst-case UTF8 byte-size for RECORD strings (4-byte/char), we move to 3-byte/char which is likely more common - assuming most chars fit in the ASCII space.

@evantahler evantahler temporarily deployed to more-secrets November 16, 2022 23:58 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 17, 2022 00:29 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 17, 2022 01:26 Inactive
@evantahler
Copy link
Contributor Author

@edgao I'm network-bound at home testing this out... what's the process for getting a connector tested on cloud or a dev server?

@edgao
Copy link
Contributor

edgao commented Nov 17, 2022

I think roughly this? https://internal-docs.airbyte.io/Things-To-Know/Cloud-Development-Environments - you'd need to push your connector image to dockerhub under some special tag + update the cloud connector version mask to point at that tag

but probably worth checking with the cloud folks for more details, I actually haven't done this before either :/

@@ -10,7 +10,7 @@ public final class FetchSizeConstants {
// This size is not enforced. It is only used to calculate a proper
// fetch size. The max row size the connector can handle is actually
// limited by the heap size.
public static final double TARGET_BUFFER_SIZE_RATIO = 0.5;
public static final double TARGET_BUFFER_SIZE_RATIO = 0.6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, why not go higher?

if primary memory usage is due to this buffer, I think we can go as high as 80%. Usual java overhead is ~~10% so a 20% buffer seems good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep ramping this up... but in an environment where I can't robustly test this, I want to go step-by-step

@evantahler evantahler temporarily deployed to more-secrets November 21, 2022 19:06 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 21, 2022 19:06 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 21, 2022 22:28 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 21, 2022 22:28 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 21, 2022 23:11 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 21, 2022 23:11 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 21, 2022 23:17 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 21, 2022 23:17 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 21, 2022 23:37 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 21, 2022 23:37 Inactive
@evantahler evantahler marked this pull request as ready for review November 21, 2022 23:49
@evantahler
Copy link
Contributor Author

I've done some local testing and thing seem ok... @edgao / @davinchia please let me know what I should do to be confident that this is safe to merge in

@github-actions
Copy link
Contributor

github-actions bot commented Nov 22, 2022

Affected Connector Report

NOTE ⚠️ Changes in this PR affect the following connectors. Make sure to do the following as needed:

  • Run integration tests
  • Bump connector or module version
  • Add changelog
  • Publish the new version

⚠ Sources (5)

Connector Version Changelog Publish
source-alloydb 1.0.17
source-alloydb-strict-encrypt 1.0.17
(not in seed)
source-mysql 1.0.14
source-mysql-strict-encrypt 1.0.14
(not in seed)
source-postgres-strict-encrypt 1.0.28
(not in seed)
  • See "Actionable Items" below for how to resolve warnings and errors.

✅ Destinations (0)

Connector Version Changelog Publish
  • See "Actionable Items" below for how to resolve warnings and errors.

✅ Other Modules (0)

Actionable Items

(click to expand)

Category Status Actionable Item
Version
mismatch
The version of the connector is different from its normal variant. Please bump the version of the connector.

doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.
Changelog
doc not found
The connector does not seem to have a documentation file. This can be normal (e.g. basic connector like source-jdbc is not published or documented). Please double-check to make sure that it is not a bug.

changelog missing
There is no chnagelog for the current version of the connector. If you are the author of the current version, please add a changelog.
Publish
not in seed
The connector is not in the seed file (e.g. source_definitions.yaml), so its publication status cannot be checked. This can be normal (e.g. some connectors are cloud-specific, and only listed in the cloud seed file). Please double-check to make sure that it is not a bug.

diff seed version
The connector exists in the seed file, but the latest version is not listed there. This usually means that the latest version is not published. Please use the /publish command to publish the latest version.

@evantahler evantahler temporarily deployed to more-secrets November 22, 2022 19:14 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 22, 2022 19:14 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 28, 2022 19:31 Inactive
@evantahler evantahler temporarily deployed to more-secrets November 28, 2022 19:31 Inactive
@evantahler
Copy link
Contributor Author

evantahler commented Nov 28, 2022

/publish connector=connectors/source-mysql

🕑 Publishing the following connectors:
connectors/source-mysql
https://github.com/airbytehq/airbyte/actions/runs/3568069054


Connector Did it publish? Were definitions generated?
connectors/source-mysql

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@evantahler
Copy link
Contributor Author

evantahler commented Nov 28, 2022

/publish connector=connectors/source-mysql-strict-encrypt

🕑 Publishing the following connectors:
connectors/source-mysql-strict-encrypt
https://github.com/airbytehq/airbyte/actions/runs/3568069503


Connector Did it publish? Were definitions generated?
connectors/source-mysql-strict-encrypt

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@evantahler
Copy link
Contributor Author

evantahler commented Nov 28, 2022

/publish connector=connectors/source-postgres

🕑 Publishing the following connectors:
connectors/source-postgres
https://github.com/airbytehq/airbyte/actions/runs/3568194289


Connector Did it publish? Were definitions generated?
connectors/source-postgres

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@evantahler
Copy link
Contributor Author

evantahler commented Nov 28, 2022

/publish connector=connectors/source-postgres-strict-encrypt

🕑 Publishing the following connectors:
connectors/source-postgres-strict-encrypt
https://github.com/airbytehq/airbyte/actions/runs/3568196333


Connector Did it publish? Were definitions generated?
connectors/source-postgres-strict-encrypt

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets November 28, 2022 20:48 Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets November 28, 2022 20:48 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants