Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-1198] cluster: cloud storage self test #17586

Merged

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Apr 3, 2024

Adds a cloud storage check to the cluster self test, as requested in #9225.
Depending on read/write permissions, the cloud storage test uses the cloud_storage::remote object configured at the application layer to:

  • Upload an object to S3.
  • List objects in a bucket.
  • Download an object from the bucket.
  • Delete the original object, if it was uploaded.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Features

  • A cloud storage check as part of the cluster's self test.

To run, use rpk cluster self-test start. The following flags have also been added to rpk for use with the cloud storage test:

  • --cloud-backoff-ms uint, the backoff in milliseconds for a cloud storage request.
  • --cloud-timeout-ms uint, the timeout in milliseconds for a cloud storage request.
  • --only-cloud-test, in order to run only the cloud storage test.

@WillemKauf WillemKauf changed the title Cluster cloud storage validation testing Cluster cloud storage self test Apr 3, 2024
@WillemKauf WillemKauf requested a review from andrwng April 3, 2024 16:24
@WillemKauf WillemKauf force-pushed the cluster_cloud_storage_validation_testing branch 3 times, most recently from 6adb3db to f867c05 Compare April 3, 2024 18:11
@WillemKauf WillemKauf changed the title Cluster cloud storage self test cluster: cloud storage self test Apr 3, 2024
@dotnwat dotnwat requested review from dotnwat and graphcareful April 3, 2024 22:27
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to read through it more detail, but its looking good. it will need to have tests added.

src/go/rpk/pkg/adminapi/api_debug.go Show resolved Hide resolved
src/v/cluster/self_test/metrics.h Show resolved Hide resolved
Comment on lines 362 to 363
const cloud_storage_clients::bucket_name& bucket,
const cloud_storage_clients::object_key& key) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general best practice unless performance sensitive is to have coroutines take parameters by value.

Comment on lines 194 to 201
// Amount of fibers to run per shard
uint16_t parallelism{10};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be unused, did i miss it? separately, is it is desired for the cloud self-test to perform a benchmark vs only verify that connectivity and basic operations are working? i'm not sure what the right answer is.

Copy link
Contributor Author

@WillemKauf WillemKauf Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently this is unused- I also had the open question of whether more rigorous benchmarking across multiple shards was desired, or if just verifying connectivity with cloud storage was ideal.

@WillemKauf WillemKauf force-pushed the cluster_cloud_storage_validation_testing branch 2 times, most recently from 45be6ca to 2ebaec8 Compare April 4, 2024 15:23
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Looking pretty solid

src/v/cluster/topic_table.h Outdated Show resolved Hide resolved
src/v/cluster/self_test/cloudcheck.cc Outdated Show resolved Hide resolved
src/v/cluster/self_test/cloudcheck.cc Outdated Show resolved Hide resolved
src/v/cluster/self_test/cloudcheck.cc Outdated Show resolved Hide resolved
src/v/cluster/self_test/cloudcheck.cc Outdated Show resolved Hide resolved
src/v/cluster/self_test/cloudcheck.cc Show resolved Hide resolved
src/go/rpk/pkg/adminapi/api_debug.go Outdated Show resolved Hide resolved
src/v/cluster/self_test_rpc_types.h Show resolved Hide resolved
@WillemKauf WillemKauf force-pushed the cluster_cloud_storage_validation_testing branch 3 times, most recently from 0015d8e to bb13248 Compare April 5, 2024 14:47
@dotnwat dotnwat changed the title cluster: cloud storage self test CORE-1198: cluster: cloud storage self test Apr 5, 2024
@@ -2740,17 +2740,20 @@ admin_server::self_test_start_handler(std::unique_ptr<ss::http::request> req) {
r.dtos.push_back(cluster::diskcheck_opts::from_json(obj));
} else if (test_type == "network") {
r.ntos.push_back(cluster::netcheck_opts::from_json(obj));
} else if (test_type == "cloud") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd recommend splitting out the changes to the admin server into a commit the comes right before the rpk changes that use those new http interfaces. makes it much easier to review.

src/v/cluster/self_test_backend.cc Show resolved Hide resolved
src/v/cluster/self_test_frontend.cc Show resolved Hide resolved
case self_test_stage::cloud:
return "cloud";
default:
__builtin_unreachable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probably drop builtin_unreachable if the cases are a covering set for the enum. if they aren't a covering set and you want to fail on default, then use vassert. if they are a covering set but you get a compiler warning, move builtin_unreachable outside the switch statement.

@WillemKauf WillemKauf force-pushed the cluster_cloud_storage_validation_testing branch 3 times, most recently from 4459db0 to c399af3 Compare April 10, 2024 17:38
@WillemKauf
Copy link
Contributor Author

still need to read through it more detail, but its looking good. it will need to have tests added.

The cloud storage self test will be ran in the existing self_test_test.py ducktape test. I have added a commit that modifies it slightly in order to assert the correct number of self test reports.

Let me know what you think!

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 16, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/47869#018ee7eb-f4d9-40b0-a136-68e114f85b8e:

"rptest.tests.self_test_test.SelfTestTest.test_self_test"

new failures in https://buildkite.com/redpanda/redpanda/builds/47869#018ee7eb-f4d4-4c5e-a17a-ecc57728cc49:

"rptest.tests.self_test_test.SelfTestTest.test_self_test_node_crash"

new failures in https://buildkite.com/redpanda/redpanda/builds/47869#018ee7eb-f4dc-4598-99e9-0d0a7e4bbec5:

"rptest.tests.self_test_test.SelfTestTest.test_self_test_cancellable"

new failures in https://buildkite.com/redpanda/redpanda/builds/47885#018ee86a-e740-4d78-9d33-f46f75ac8b0f:

"rptest.tests.self_test_test.SelfTestTest.test_self_test"

new failures in https://buildkite.com/redpanda/redpanda/builds/47885#018ee86a-e73a-429f-9e6c-08f703304971:

"rptest.tests.self_test_test.SelfTestTest.test_self_test_node_crash"

new failures in https://buildkite.com/redpanda/redpanda/builds/47895#018ee8e3-a7b3-4196-b91b-35895ff4e5e5:

"rptest.tests.self_test_test.SelfTestTest.test_self_test"

new failures in https://buildkite.com/redpanda/redpanda/builds/47895#018ee8e3-a7ae-40be-9fc1-6f5a5dba3224:

"rptest.tests.self_test_test.SelfTestTest.test_self_test_node_crash"

new failures in https://buildkite.com/redpanda/redpanda/builds/47895#018ee8eb-3320-407d-8c7a-de7d02439a7d:

"rptest.tests.self_test_test.SelfTestTest.test_self_test"

new failures in https://buildkite.com/redpanda/redpanda/builds/47895#018ee8eb-331b-4625-887a-0e177fa527eb:

"rptest.tests.self_test_test.SelfTestTest.test_self_test_node_crash"

@WillemKauf WillemKauf force-pushed the cluster_cloud_storage_validation_testing branch from c399af3 to 79983dc Compare April 16, 2024 18:32
andrwng
andrwng previously approved these changes Apr 29, 2024
Comment on lines +82 to +72
assert report['error'] == error_msg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we also add an assertion that cloud_storage is among the reports we've received?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added assert statement to confirm number of received cloud storage reports.

assert_fail(
report,
'Remote read is not enabled for this cluster.')
elif report['info'] in ['upload', 'delete']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe make this an else and then assert report['info'] in ['upload', 'delete']? Just so any future cloud_storage report types that get added necessarily have to be considered here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed!

Comment on lines 33 to 49
remote_read = False
remote_write = False
if hasattr(ctx, 'injected_args') \
and ctx.injected_args is not None:
if 'cloud_storage_enable_remote_read' in ctx.injected_args:
remote_read = ctx.injected_args[
'cloud_storage_enable_remote_read']
if 'cloud_storage_enable_remote_write' in ctx.injected_args:
remote_write = ctx.injected_args[
'cloud_storage_enable_remote_write']

super(SelfTestTest, self).__init__(
test_context=ctx,
si_settings=SISettings(
test_context=ctx,
cloud_storage_enable_remote_read=remote_read,
cloud_storage_enable_remote_write=remote_write))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, there are other test base classes that allow you to start Redpanda in the test body. EndToEndTest is one example.

Feel free to keep this as is, but in case you want to avoid depending on low level things like injected_args, you could switch SelfTestTest over to using it and calling start_redpanda() in the test body

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know about this! It's much nicer than dealing with injected_args.

Fixed.

@WillemKauf WillemKauf force-pushed the cluster_cloud_storage_validation_testing branch from 006c70b to 38631cc Compare April 29, 2024 23:17
@WillemKauf WillemKauf requested a review from Deflaimun as a code owner April 29, 2024 23:17
@WillemKauf WillemKauf force-pushed the cluster_cloud_storage_validation_testing branch from 38631cc to 4d1376e Compare April 29, 2024 23:34
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeh this looks awesome. i think if we squash the fixes back into previous commits this is looking like its pretty much ready.

src/v/cluster/self_test/cloudcheck.cc Show resolved Hide resolved
src/v/cluster/self_test/cloudcheck.cc Show resolved Hide resolved
Comment on lines 54 to 61
* Cloud tests:
* Latency test: 1024-bit object.
* Depending on read/write permissions, a series of cloud storage operations are performed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r-vasquez IIUC the way this will work is that if the cluster is already set up with TS then these tests will use those configurations.

src/v/cluster/self_test_backend.h Show resolved Hide resolved
src/v/cluster/self_test_rpc_types.cc Show resolved Hide resolved
src/v/cluster/self_test/cloudcheck.cc Show resolved Hide resolved
tests/rptest/tests/self_test_test.py Show resolved Hide resolved
The cloud storage check performs the following operations:
 - Upload an object to the configured S3 bucket
 - List from the bucket
 - Download from the bucket
 - Delete from the bucket

The cluster self-test contains disk and network tests to validate
and benchmark those subsystems. A test for cloud storage helps to
ensure credentials and permissions have been correctly configured
in redpanda.
Adds the `self_test::cloudcheck` object to `cluster::self_test_backend`
so that it can be invoked alongside the existing self-test routines.
To allow for customization of cloud storage self-test options,
`admin_server::self_test_start_handler()` will now be able to
read options set in the JSON request created on the `rpk` side.
Adds cloud storage self-test bindings to `rpk`, allowing users
to invoke the self-test.

Also adds the following flags for use with the cloud storage test:
- `--cloud-timeout-ms`, the timeout in ms for a cloud storage request
- `--cloud-backoff-ms`, the backoff in ms for a cloud storage request
- `--only-cloud-test`, in order to run only the cloud storage test
Add 'self_test_stage' as an indicator to user of which self-test
routine is currently running. This commit appends the `self_test_stage`
to the `self_test_backend` and `self_test_frontend`.

Currently, a user can request the status of self-test in 'rpk' using
'cluster self-test status'. However, this only indicates whether a
test is running on a node or if the tests are finished, but not which
test is being ran.

After adding bindings on the `rpk` side, this will enable users to see
which self-test is currently running.
`admin_server` will now set the self test stage in its
status reports presented to user.
`rpk` will now generate a report of running nodes with node ID,
as well as the current stage of the self test on those running nodes.

This results in more detailed status updates presented to the user
when `cluster self-test status` is ran in `rpk`.

User will now see which test is currently being run on which node,
if a test is running.
Adds cloud storage self test result parsing to `SelfTestTest`.

Also adds configuration for the cloud storage self-test to
`clients/rpk.py`.
@WillemKauf WillemKauf force-pushed the cluster_cloud_storage_validation_testing branch from 4d1376e to 3ab7798 Compare April 30, 2024 12:52
@WillemKauf WillemKauf requested a review from dotnwat April 30, 2024 12:58
@dotnwat dotnwat requested a review from andrwng April 30, 2024 16:04
@@ -48,6 +51,15 @@ of the cluster. Available tests to run:
* Unique pairs of Redpanda nodes each act as a client and a server.
* The test pushes as much data over the wire, within the test parameters.

* Cloud tests:
* Latency test: 1024-bit object.
* Depending on cluster read/write permissions (cloud_storage_enable_remote_read, cloud_storage_enable_remote_write), a series of cloud storage operations are performed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Depending on cluster read/write permissions (cloud_storage_enable_remote_read, cloud_storage_enable_remote_write), a series of cloud storage operations are performed.
* Depending on cluster read/write permissions (cloud_storage_enable_remote_read, cloud_storage_enable_remote_write), a series of cloud storage operations are performed:

Feediver1
Feediver1 previously approved these changes Apr 30, 2024
Copy link

@Feediver1 Feediver1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor punctuation update.

Extended the self test ducktape test coverage for all combinations of
`cloud_storage_enable_remote_read` and `cloud_storage_enable_remote_write`
permissions.
@WillemKauf WillemKauf force-pushed the cluster_cloud_storage_validation_testing branch from 3ab7798 to 62dbcd7 Compare April 30, 2024 16:19
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything looks good. just wondering about the necessity of using the e2e test fixture in ducktape.

tests/rptest/tests/self_test_test.py Show resolved Hide resolved
@dotnwat dotnwat merged commit ab18cc6 into redpanda-data:dev Apr 30, 2024
25 checks passed
@WillemKauf WillemKauf deleted the cluster_cloud_storage_validation_testing branch May 2, 2024 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants