Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self-test disk test enhancements #20590

Merged
merged 3 commits into from
Jul 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 66 additions & 6 deletions src/go/rpk/pkg/cli/cluster/selftest/start.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,11 @@ Available tests to run:
* Disk tests:
** Throughput test: 512 KB messages, sequential read/write
*** Uses a larger request message sizes and deeper I/O queue depth to write/read more bytes in a shorter amount of time, at the cost of IOPS/latency.
** Latency test: 4 KB messages, sequential read/write
*** Uses smaller request message sizes and lower levels of parallelism to achieve higher IOPS and lower latency.
** Latency and io depth tests: 4 KB messages, sequential read/write, varying io depth
*** Uses small IO sizes and varying levels of parallelism to determine the relationship between io depth and IOPS
*** Includes one test without using dsync (fdatasync) on each write to establish the cost of dsync
** 16 KB test
*** One high io depth test at 16 KB to reflect performance at Redpanda's default chunk size
* Network tests:
** Throughput test: 8192-bit messages
*** Unique pairs of Redpanda nodes each act as a client and a server.
Expand Down Expand Up @@ -123,7 +126,7 @@ func assembleTests(onlyDisk bool, onlyNetwork bool, onlyCloud bool, durationDisk
diskcheck := []any{
// One test weighted for better throughput results
rpadmin.DiskcheckParameters{
Name: "512KB sequential r/w throughput disk test",
Name: "512KB sequential r/w",
DSync: true,
SkipWrite: false,
SkipRead: false,
Expand All @@ -133,16 +136,73 @@ func assembleTests(onlyDisk bool, onlyNetwork bool, onlyCloud bool, durationDisk
Parallelism: 4,
Type: rpadmin.DiskcheckTagIdentifier,
},
// .. and another for better latency/iops results
// .. and then a series of 4KB write-only tests at increasing io depth
rpadmin.DiskcheckParameters{
Name: "4KB sequential r/w latency/iops disk test",
Name: "4KB sequential r/w, low io depth",
DSync: true,
SkipWrite: false,
SkipRead: false,
DataSize: 1 * units.GiB,
RequestSize: 4 * units.KiB,
DurationMs: durationDisk,
Parallelism: 2,
Parallelism: 1,
Type: rpadmin.DiskcheckTagIdentifier,
},
rpadmin.DiskcheckParameters{
Name: "4KB sequential write, medium io depth",
DSync: true,
SkipWrite: false,
SkipRead: true,
DataSize: 1 * units.GiB,
RequestSize: 4 * units.KiB,
DurationMs: durationDisk,
Parallelism: 8,
Type: rpadmin.DiskcheckTagIdentifier,
},
rpadmin.DiskcheckParameters{
Name: "4KB sequential write, high io depth",
DSync: true,
SkipWrite: false,
SkipRead: true,
DataSize: 1 * units.GiB,
RequestSize: 4 * units.KiB,
DurationMs: durationDisk,
Parallelism: 64,
Type: rpadmin.DiskcheckTagIdentifier,
},
rpadmin.DiskcheckParameters{
Name: "4KB sequential write, very high io depth",
DSync: true,
SkipWrite: false,
SkipRead: true,
DataSize: 1 * units.GiB,
RequestSize: 4 * units.KiB,
DurationMs: durationDisk,
Parallelism: 256,
Type: rpadmin.DiskcheckTagIdentifier,
},
// ... and a 4KB test as above but with dsync off
rpadmin.DiskcheckParameters{
Name: "4KB sequential write, no dsync",
DSync: false,
SkipWrite: false,
SkipRead: true,
DataSize: 1 * units.GiB,
RequestSize: 4 * units.KiB,
DurationMs: durationDisk,
Parallelism: 64,
Type: rpadmin.DiskcheckTagIdentifier,
},
// ... and a 16KB test as above as another important size for redpanda
rpadmin.DiskcheckParameters{
Name: "16KB sequential r/w, high io depth",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intentionally not add something like 4k @ 256 iodepth?

Copy link
Member Author

@travisdowns travisdowns Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking more about "why not 4K" or "why not 256 iodepth"?

In any case it was intentional but open to ideas here. One thing to note is that the parallelism factor here is then multiplied by the shard count, so on modest 8 shard nodes we are already at a very high 512 io depth for parallelism=64, which IME is larger than what you need to get max throughput even on large local SSD configurations (though of course this may not be the case on some other storage configuations, especially high throughput, longer latency network attached storage).

I don't actually like this multiplication because it (a) adds a confounding factor when comparing results against different clusters which may have different shard counts (but at least now we see the effective iodepth in the output) and (b) it means you can't run an iodepth=1 test except on a cluster with 1-shard nodes.

About 4K vs 16K, my goal was to add a 16K test to see the difference between 4K and 16K, i.e., how much performance varies in the range of block sizes Redpanda is already writing with default settings. Then I also wanted to add a "series" of varying iodepth tests, which I sort of arbitrarily chose to be 16K one. I didn't want to do both to keep the number of tests down, and I think maybe I favored 16K over 4K in part because 4K already had parallelism=2, and I wanted 1 and didn't want to charge the existing 4K test to keep some continuity with old results.

That said, very open to changing it. What is your view on the ideal series of tests to run?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that the parallelism factor here is then multiplied by the shard count

Wait but right now this all happens on shard zero only. Are you saying we still multiply it by the shard count?

That said, very open to changing it. What is your view on the ideal series of tests to run?

I don't feel strongly. Just really coming from the classic 4k test and I guess it matches the min amount we write.

I guess the 512Kib test is actually the least relevant one for RP as we never write sizes bigger than 16Kib (only when fetching from TS).

Copy link
Member Author

@travisdowns travisdowns Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait but right now this all happens on shard zero only. Are you saying we still multiply it by the shard count?

No I was simply mistaken. I thought this ran on all shards, but as you say it seems to run only one shard. I was thrown off especially by this comnent and also this code and comment. Perhaps vestigial?

So I will adjust the numbers to hit higher io depths, and maybe add 1 more test.

Just really coming from the classic 4k test and I guess it matches the min amount we write.

I'll change it to 4K.

I guess the 512Kib test is actually the least relevant one for RP as we never write sizes bigger than 16Kib (only when fetching from TS).

It's definitely the least useful for evaluating RP performance at the default settings. As a test to understand more about the disk, especially disks with characteristics different than the most common ones we run on I think it's fine because it is a "max throughput" test, and if it gets a much higher number than the other tests with small blocks then we've learned something.

DSync: false,
SkipWrite: false,
SkipRead: false,
DataSize: 1 * units.GiB,
RequestSize: 16 * units.KiB,
DurationMs: durationDisk,
Parallelism: 64,
Type: rpadmin.DiskcheckTagIdentifier,
},
}
Expand Down
3 changes: 2 additions & 1 deletion src/v/cluster/self_test/diskcheck.cc
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,8 @@ diskcheck::run_configured_benchmarks(ss::file& file) {
auto write_metrics = co_await do_run_benchmark<read_or_write::write>(file);
auto result = write_metrics.to_st_result();
result.name = _opts.name;
result.info = "write run";
result.info = fmt::format(
"write run (iodepth: {}, dsync: {})", _opts.parallelism, _opts.dsync);
result.test_type = "disk";
if (_cancelled) {
result.warning = "Run was manually cancelled";
Expand Down
4 changes: 2 additions & 2 deletions src/v/cluster/self_test_rpc_types.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ struct diskcheck_opts
: serde::
envelope<diskcheck_opts, serde::version<0>, serde::compat_version<0>> {
/// Descriptive name given to test run
ss::sstring name{"512K sequential r/w disk test"};
ss::sstring name{"unspecified"};
/// Where files this benchmark will read/write to exist
std::filesystem::path dir{config::node().disk_benchmark_path()};
/// Open the file with O_DSYNC flag option
Expand All @@ -56,7 +56,7 @@ struct diskcheck_opts
/// Set to true to disable the read portion of the benchmark
bool skip_read{false};
/// Total size of all benchmark files to exist on disk
uint64_t data_size{10ULL << 30}; // 1GiB
uint64_t data_size{10ULL << 30}; // 10GiB
/// Size of individual read and/or write requests
size_t request_size{512 << 10}; // 512KiB
/// Total duration of the benchmark
Expand Down
6 changes: 3 additions & 3 deletions tests/rptest/tests/self_test_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def all_idle():
return not any([x['status'] == 'running'
for x in node_reports]), node_reports

return wait_until_result(all_idle, timeout_sec=30, backoff_sec=1)
return wait_until_result(all_idle, timeout_sec=90, backoff_sec=1)

@cluster(num_nodes=3)
@matrix(remote_read=[True, False], remote_write=[True, False])
Expand Down Expand Up @@ -101,9 +101,9 @@ def assert_fail(report, error_msg):
# on specific results, but rather what tests are oberved to have run
reports = flat_map(lambda node: node['results'], node_reports)

# Ensure 4 disk tests per node, read/write & latency/throughput
# Ensure 10 disk tests per node (see the RPK code for the full list)
disk_results = [r for r in reports if r['test_type'] == 'disk']
expected_disk_results = num_nodes * 4
expected_disk_results = num_nodes * 10
assert len(
disk_results
) == expected_disk_results, f"Expected {expected_disk_results} disk reports observed {len(disk_results)}"
Expand Down
Loading