cluster: expose cloud storage usage accross the entire cluster #9305

VladLazar · 2023-03-07T17:13:08Z

This PR introduces infrastructure that exposes the total cloud storage usage across
the cluster. The intention is to provide this as input for the billing service.

The central component here is the cloud_storage_size_reducer. It performs a map-reduce
operation over the cluster by iterating over the topic table in batches and performing the
following operations on each batch:

Find the first live replica of each partition in the batch
Prepare cloud_storage_usage RPC requests for each node in the cluster. The request will contain the partitions being queried by shard.
Reduce the responses and update the accumulator.

The semantics of the usage returned by cloud_storage_size_reducer::reduce are as follows:
the sum of all segment sizes above the start offset in any node-local partition manifest.
"Above the start offset" is relevant because the returned usage can be ahead of the actual
size in the bucket. Retention in cloud storage is a two step process: the start offset is advanced
first, and then segments below the start offset are removed. The reason for this approach
is to avoid over-reporting if the delete requests fail.

Also note that the metadata stored in the cloud along with the actual segment files is not
included in the usage reporting. The amount of metadata in the cloud is very small when
compared to the actual user data (~1MiB per 3000 segments; 375GiB with the default cloud
segment size) and the ratio will become even smaller when the manifest encoding changes
for v23.2. This greatly simplifies the implementation.

TODO: Extend tests to include leadership changes.

Backports Required

Release Notes

none

This commit adds a method to `cluster::partition` that exposes a new `cloud_log_size` method. It returns the sum of all *tracked* log segments that have beeb uploaded to cloud storage. A segment is tracked if it is present in the manifest and has not been slated for removal.

graphcareful

Nice job +1

src/v/cluster/types.h

src/v/cluster/service.cc

src/v/cluster/topic_table_partition_generator.h

src/v/cluster/cloud_storage_size_reducer.h

src/v/redpanda/admin_server.cc

src/v/cluster/cloud_storage_size_reducer.h

VladLazar · 2023-03-08T12:14:26Z

Changes in force-push and force-push:

Addressed Rob's review comments

VladLazar · 2023-03-08T14:44:11Z

Failures are:

VladLazar · 2023-03-08T16:03:02Z

Changes in force-push:

Added a partition movement test

VladLazar · 2023-03-08T17:05:55Z

Changes in force-push:

Prefer targeting partition leaders for the cloud_storage_usage RPC

VladLazar · 2023-03-08T17:44:01Z

Changes in force-push:

The cloud_storage_usage request includes only a list of ntps now. Repliers query the shard table to figure
out the shard placement. See this comment for a discussion.

graphcareful

LGTM

src/v/cluster/service.cc

src/v/cluster/cloud_storage_size_reducer.h

A new controller RPC is introduced in this commit: `cloud_storage_usage`. The request takes a list of partitions broken down by shard, and the response contains the total number of bytes used by the cloud storage log of the specified partitions and the partitions that could not be found. This RPC is a building block for cluster wide map-reduce operation that computes the total size of all cloud logs.

This commit introduce a utility class that walks the topic table and generates batches of partitions and their current replicas. This operation only makes sense if the topic table is stable throughout. An exception is thrown if the topic table mutates between batches.

This commit introduce a utility class that performs a map-reduce operation across the cluster in order to determine the sum of the cloud log sizes for all partitions.

This commit introduces a new debug route: `/v1/debug/cloud_storage_usage` which returns the total number of bytes used by the cloud log for all partitions in the cluster. This route is only to be used for testing purposes.

graphcareful

Sweet, LGTM

VladLazar · 2023-03-10T09:38:46Z

Failure is:

CI Failure (TimeoutError: failed to wait until status condition) in PartitionBalancerTest.test_fuzz_admin_ops #9315

VladLazar · 2023-03-10T09:46:29Z

/backport v23.1.x

vbotbuildovich · 2023-03-10T09:47:21Z

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x bdab9f23c8687b8ab0966b320bce0727ffcb6860 a23d7e6ac9e02f24d4d6fcc3caadd0d88b9dcd5b 864e64ed6eda5d4d6c2362578a3976fe4e9ff609 167c50bd98c84ed3e365f0d696993ca3e76ef79e fa4a637f5bbefb438b8a0bbbc9421cf7924b4b94 dfaa00c0ccce4741a0b5fd9f5e1f39db981ca7da

Workflow run logs.

dotnwat

looks great

github-actions bot added the area/redpanda label Mar 7, 2023

VladLazar force-pushed the cloud-storage-usage branch 2 times, most recently from 2290e3d to aac78e3 Compare March 7, 2023 17:14

VladLazar added the area/cloud-storage Shadow indexing subsystem label Mar 7, 2023

VladLazar marked this pull request as ready for review March 7, 2023 17:18

VladLazar requested review from jcsp and graphcareful March 7, 2023 17:25

graphcareful reviewed Mar 7, 2023

View reviewed changes

VladLazar force-pushed the cloud-storage-usage branch 2 times, most recently from e930af4 to dde792d Compare March 8, 2023 12:13

VladLazar force-pushed the cloud-storage-usage branch from dde792d to ac4327f Compare March 8, 2023 16:02

VladLazar force-pushed the cloud-storage-usage branch from ac4327f to 463bad1 Compare March 8, 2023 17:05

VladLazar force-pushed the cloud-storage-usage branch from 463bad1 to 768a01f Compare March 8, 2023 17:42

VladLazar requested a review from graphcareful March 8, 2023 17:57

graphcareful previously approved these changes Mar 9, 2023

View reviewed changes

src/v/cluster/service.cc Outdated Show resolved Hide resolved

src/v/cluster/service.cc Show resolved Hide resolved

src/v/cluster/cloud_storage_size_reducer.h Outdated Show resolved Hide resolved

Vlad Lazar added 5 commits March 9, 2023 17:29

cluster: get cloud storage usage across cluster

167c50b

This commit introduce a utility class that performs a map-reduce operation across the cluster in order to determine the sum of the cloud log sizes for all partitions.

admin: expose total cloud storage usage

fa4a637

This commit introduces a new debug route: `/v1/debug/cloud_storage_usage` which returns the total number of bytes used by the cloud log for all partitions in the cluster. This route is only to be used for testing purposes.

rptest: add test for cloud storage usage

dfaa00c

VladLazar dismissed graphcareful’s stale review via dfaa00c March 9, 2023 17:29

VladLazar force-pushed the cloud-storage-usage branch from 768a01f to dfaa00c Compare March 9, 2023 17:29

VladLazar requested a review from graphcareful March 9, 2023 18:29

graphcareful approved these changes Mar 9, 2023

View reviewed changes

jcsp merged commit a2f38a7 into redpanda-data:dev Mar 10, 2023

redpanda-data deleted a comment from vbotbuildovich Mar 10, 2023

VladLazar mentioned this pull request Mar 10, 2023

[v23.1.x] cluster: expose cloud storage usage accross the entire cluster #9377

Merged

dotnwat reviewed Mar 13, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: expose cloud storage usage accross the entire cluster #9305

cluster: expose cloud storage usage accross the entire cluster #9305

VladLazar commented Mar 7, 2023 •

edited

Loading

graphcareful left a comment

VladLazar commented Mar 8, 2023

VladLazar commented Mar 8, 2023

VladLazar commented Mar 8, 2023

VladLazar commented Mar 8, 2023

VladLazar commented Mar 8, 2023

graphcareful left a comment

graphcareful left a comment

VladLazar commented Mar 10, 2023

VladLazar commented Mar 10, 2023

vbotbuildovich commented Mar 10, 2023

dotnwat left a comment

cluster: expose cloud storage usage accross the entire cluster #9305

cluster: expose cloud storage usage accross the entire cluster #9305

Conversation

VladLazar commented Mar 7, 2023 • edited Loading

Backports Required

Release Notes

graphcareful left a comment

Choose a reason for hiding this comment

VladLazar commented Mar 8, 2023

VladLazar commented Mar 8, 2023

VladLazar commented Mar 8, 2023

VladLazar commented Mar 8, 2023

VladLazar commented Mar 8, 2023

graphcareful left a comment

Choose a reason for hiding this comment

graphcareful left a comment

Choose a reason for hiding this comment

VladLazar commented Mar 10, 2023

VladLazar commented Mar 10, 2023

vbotbuildovich commented Mar 10, 2023

dotnwat left a comment

Choose a reason for hiding this comment

VladLazar commented Mar 7, 2023 •

edited

Loading