Properly close Storage API batch connections #31710

ahmedabu98 · 2024-06-28T05:41:06Z

Storage API connections in batch are left open and not closed properly.
This is because we pin the underlying StreamAppendClient twice: once for the bundle and once for the cache
When we are finished with the stream however, we only unpin once for the bundle (and not for the cache).

Batch mode already creates a lot of streams and connections (one stream/connection per destination per bundle) compared to streaming mode. Leaving connections unclosed leads to many concurrent connections and can quickly exhaust the quota.

This change adds a line to invalidate the cached client after we finish using it in a bundle.

Also creates a counter to keep track of active connections.

ahmedabu98 · 2024-06-28T06:37:26Z

I ran two identical pipelines (writing 10B records) before and after this change to measure the difference (this is in a project with large quota):

Job before fix (2024-06-27_22_27_46-11409968544737429803)

330 workers
Creates 1,950 connections and never closes them (they eventually time out).
~290 GiB append rows throughput
Finishes in 22.5 min
Dataflow cost estimate: $10.00

Job after fix (2024-06-27_22_53_21-1016146269936299010):

283 workers
Up to 550 connections at a time. All connections get cleaned up before the pipeline is finished
~250 GiB append rows throughput
Finishes in 21.5 min
Dataflow cost estimate: $6.93

Side by side comparison (the jobs were run sequentially):

There is a significant reduction (roughly ~70%) in concurrent connections. We can reliably expect the number of concurrent connections per destination to be capped at the number of parallel DoFns (or vCPUs). In other words, concurrent connections <= (num vCPUs) x (num destinations)

Note that AppendRows throughput stayed roughly the same:

ahmedabu98 · 2024-06-28T06:39:45Z

R: @reuvenlax
R: @Abacn

github-actions · 2024-06-28T06:40:58Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

Abacn · 2024-06-28T14:04:47Z

thanks, LGTM, this is a great improvement!

reuvenlax · 2024-06-28T17:05:24Z

To be honest I'm not sure why we need the APPEND_CLIENTS cache at all in batch (non default stream) mode. However this does seem to be a simpler fix than removing the cache.

…

On Fri, Jun 28, 2024 at 7:05 AM Yi Hu ***@***.***> wrote: thanks, LGTM, this is a great improvement! — Reply to this email directly, view it on GitHub <#31710 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFAYJVP5L6WDCVT2RKCPPJ3ZJVURJAVCNFSM6AAAAABKBF3IQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXGAYDSOBVGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

liferoad · 2024-06-29T00:09:27Z

@ahmedabu98 can we add this to CHANGES.md given it is a quite important fix?

ahmedabu98 · 2024-06-29T00:10:54Z

@ahmedabu98 can we add this to CHANGES.md given it is a quite important fix?

Yup forgot to add it here. Adding it in #31721

* properly close connections; add active connection counter * only invalidate stream at teardown for PENDING type * cleanup

properly close connections; add active connection counter

d95f283

github-actions bot added java io gcp labels Jun 28, 2024

ahmedabu98 marked this pull request as ready for review June 28, 2024 06:39

ahmedabu98 added 2 commits June 28, 2024 03:28

only invalidate stream at teardown for PENDING type

26b61d4

cleanup

6cb487e

ahmedabu98 changed the title ~~Properly close Storage API connections~~ Properly close Storage API batch connections Jun 28, 2024

ahmedabu98 merged commit 20aa916 into apache:master Jun 28, 2024
18 checks passed

This was referenced Jul 4, 2024

Add options to control number of Storage API connections when using multiplexing #31721

Merged

Increase retry backoff for Storage API batch to survive AppendRows quota refill #31837

Merged

acrites pushed a commit to acrites/beam that referenced this pull request Jul 17, 2024

Properly close Storage API batch connections (apache#31710)

710d4a8

* properly close connections; add active connection counter * only invalidate stream at teardown for PENDING type * cleanup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly close Storage API batch connections #31710

Properly close Storage API batch connections #31710

ahmedabu98 commented Jun 28, 2024 •

edited

Loading

ahmedabu98 commented Jun 28, 2024 •

edited

Loading

ahmedabu98 commented Jun 28, 2024

github-actions bot commented Jun 28, 2024

Abacn commented Jun 28, 2024

reuvenlax commented Jun 28, 2024 via email

liferoad commented Jun 29, 2024

ahmedabu98 commented Jun 29, 2024

Properly close Storage API batch connections #31710

Properly close Storage API batch connections #31710

Conversation

ahmedabu98 commented Jun 28, 2024 • edited Loading

ahmedabu98 commented Jun 28, 2024 • edited Loading

Job before fix (2024-06-27_22_27_46-11409968544737429803)

Job after fix (2024-06-27_22_53_21-1016146269936299010):

Side by side comparison (the jobs were run sequentially):

Note that AppendRows throughput stayed roughly the same:

ahmedabu98 commented Jun 28, 2024

github-actions bot commented Jun 28, 2024

Abacn commented Jun 28, 2024

reuvenlax commented Jun 28, 2024 via email

liferoad commented Jun 29, 2024

ahmedabu98 commented Jun 29, 2024

ahmedabu98 commented Jun 28, 2024 •

edited

Loading

ahmedabu98 commented Jun 28, 2024 •

edited

Loading