Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly close Storage API batch connections #31710

Merged
merged 3 commits into from
Jun 28, 2024

Conversation

ahmedabu98
Copy link
Contributor

@ahmedabu98 ahmedabu98 commented Jun 28, 2024

Storage API connections in batch are left open and not closed properly.
This is because we pin the underlying StreamAppendClient twice: once for the bundle and once for the cache
When we are finished with the stream however, we only unpin once for the bundle (and not for the cache).

Batch mode already creates a lot of streams and connections (one stream/connection per destination per bundle) compared to streaming mode. Leaving connections unclosed leads to many concurrent connections and can quickly exhaust the quota.

This change adds a line to invalidate the cached client after we finish using it in a bundle.

Also creates a counter to keep track of active connections.

@ahmedabu98
Copy link
Contributor Author

ahmedabu98 commented Jun 28, 2024

I ran two identical pipelines (writing 10B records) before and after this change to measure the difference (this is in a project with large quota):

Job before fix (2024-06-27_22_27_46-11409968544737429803)

  • 330 workers

  • Creates 1,950 connections and never closes them (they eventually time out).

    image

  • ~290 GiB append rows throughput

  • Finishes in 22.5 min

  • Dataflow cost estimate: $10.00

Job after fix (2024-06-27_22_53_21-1016146269936299010):

  • 283 workers

  • Up to 550 connections at a time. All connections get cleaned up before the pipeline is finished

    image

  • ~250 GiB append rows throughput

  • Finishes in 21.5 min

  • Dataflow cost estimate: $6.93

Side by side comparison (the jobs were run sequentially):

image

There is a significant reduction (roughly ~70%) in concurrent connections. We can reliably expect the number of concurrent connections per destination to be capped at the number of parallel DoFns (or vCPUs). In other words, concurrent connections <= (num vCPUs) x (num destinations)


Note that AppendRows throughput stayed roughly the same:

image

@ahmedabu98 ahmedabu98 marked this pull request as ready for review June 28, 2024 06:39
@ahmedabu98
Copy link
Contributor Author

R: @reuvenlax
R: @Abacn

Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@ahmedabu98 ahmedabu98 changed the title Properly close Storage API connections Properly close Storage API batch connections Jun 28, 2024
@Abacn
Copy link
Contributor

Abacn commented Jun 28, 2024

thanks, LGTM, this is a great improvement!

@reuvenlax
Copy link
Contributor

reuvenlax commented Jun 28, 2024 via email

@ahmedabu98 ahmedabu98 merged commit 20aa916 into apache:master Jun 28, 2024
18 checks passed
@liferoad
Copy link
Collaborator

@ahmedabu98 can we add this to CHANGES.md given it is a quite important fix?

@ahmedabu98
Copy link
Contributor Author

@ahmedabu98 can we add this to CHANGES.md given it is a quite important fix?

Yup forgot to add it here. Adding it in #31721

acrites pushed a commit to acrites/beam that referenced this pull request Jul 17, 2024
* properly close connections; add active connection counter

* only invalidate stream at teardown for PENDING type

* cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants