Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tls: improve write performance by reducing copying #14053

Closed
wants to merge 11 commits into from

Conversation

ggreenway
Copy link
Contributor

Signed-off-by: Greg Greenway ggreenway@apple.com

Commit Message: This change tries to heuristically decide what block size to
write. 16kb blocks are most efficient in the TLS layer, but there were
cases where Envoy would memcpy nearly all the data in order to create
blocks of this size. This change improves performance by sometimes
writing smaller blocks in order to reduce memcpy.

Additional Description:
Risk Level: Medium
Testing: Added UT, all existing tests pass
Docs Changes: none
Release Notes: not needed; no functional change
Platform Specific Features: none
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]

This change tries to heuristically decide what block size to
write. 16kb blocks are most efficient in the TLS layer, but there were
cases where Envoy would memcpy nearly all the data in order to create
blocks of this size. This change improves performance by sometimes
writing smaller blocks in order to reduce memcpy.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
@@ -266,6 +266,30 @@ void* OwnedImpl::linearize(uint32_t size) {
return slices_.front()->data();
}

RawSlice OwnedImpl::maybeLinearize(uint32_t max_size, uint32_t desired_min_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation only look at the 1st and the 2nd slice in the buffer, would it makes more sense if we return linearized a group of whole slices, just before it exceed the max_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. This is very much a heuristic, and there are a lot of ways this could potentially be improved. This improved performance of the case I was benchmarking, and didn't show any degradation for low-throughput (smaller request/response) traffic patterns, and was pretty simple and easy to reason about and/or predict what it will do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but I suggested the above because then we don't need the second parameter but keep pretty much same behavior for the rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what you mean by "a linearized group of whole slices"? I don't understand what you're suggesting.

Copy link
Member

@lizan lizan Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like:

uint64_t size = 0;
for (slice : slices_) {
  if (size + slice_->dataSize() > max_size) break;
  size += slice_->dataSize();
}

return {linearize(size), size};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some threshold the memcpy of linearization exceeds the overhead of the extra TLS record. For instance, if all slices were 16383 bytes (1 less than a full record), I don't believe it's faster to linearize everything to 16384; emitting slightly smaller records will be faster.

I made a wild guess at picking 25% of a record as the threshold at which we should definitely linearize, and perf tests indicated that helped. But maybe I need to come up with a small benchmark to try to quantify this relationship.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but here we're comparing emitting many small TLS records vs combining many small buffers a single TLS record, not copying data between slices to completely fill records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example I was considering, I was asking about slices in the 4-8kb range. In that range it's not clear to me whether the memcpy cost will be more than the overhead of generating a TLS record.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between 2x 8KiB vs 1x 16KiB is probably negligible, but I'm pretty sure that memcpy overhead is smaller than writing additional TLS record(s) to the wire.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but that's an educated guess, feel free to benchmark this (e.g. by comparing proxy throughput, not userland microbenchmarks).

@@ -84,6 +84,10 @@ ContextImpl::ContextImpl(Stats::Scope& scope, const Envoy::Ssl::ContextConfig& c
int rc = SSL_CTX_set_app_data(ctx.ssl_ctx_.get(), this);
RELEASE_ASSERT(rc == 1, Utility::getLastCryptoError().value_or(""));

constexpr uint32_t mode = SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER;
rc = SSL_CTX_set_mode(ctx.ssl_ctx_.get(), mode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this needed? maybeLinearize returns either unmodified slice or linearized slice, but I don't think there is a case when unmodified slice would be later linearized (assuming desired_min_size <= max_size), so the buffer shouldn't move... or am I missing something?

Copy link
Contributor Author

@ggreenway ggreenway Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably isn't strictly necessary, and with the current implementation I don't think the buffer will move. But the maybeLinearize interface doesn't guarantee this property, and I didn't want to keep the existing book-keeping to ensure the write buffer doesn't change. It doesn't look to me like anything is more expensive in boringSSL when this mode is set, it just removes a check that isn't gaining anything for how we use the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but if maybeLinearize implementation changes enough to require moving buffers, then we can enable this mode. Right now, it removes the default sanity check for no reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, looking into this I realize I lost part of the change (I did a bunch of moving code between branches to chop an originally big change into manageable pieces). To simplify this code, I removed bytes_to_retry_. I subsequent call can then end up with a larger buffer from maybeLinearize if data was added to the buffer since the last attempt at SSL_write. That's why I was setting this option here.

I'm trying to remember why I made that change originally; I think it may have been to make it easier to read and reason about. I don't think it had a measurable performance impact. Any preference on whether I make that change or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I subsequent call can then end up with a larger buffer from maybeLinearize if data was added to the buffer since the last attempt at SSL_write. That's why I was setting this option here.

I don't believe that SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER allows for buffer to grow between retries. AFAIK, the buffer data has to stay the same, but it can be available at a different address than before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say it is allowed:

// In TLS, a non-blocking |SSL_write| differs from non-blocking |write| in that
// a failed |SSL_write| still commits to the data passed in. When retrying, the
// caller must supply the original write buffer (or a larger one containing the
// original as a prefix). By default, retries will fail if they also do not
// reuse the same |buf| pointer. This may be relaxed with
// |SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER|, but the buffer contents still must be
// unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last sentence literally says the buffer contents still must be unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but there is also (or a larger one containing the original as a prefix), hmm. Maybe it's allowed after all.

@antoniovicente antoniovicente self-assigned this Nov 18, 2020
@antoniovicente antoniovicente self-requested a review November 18, 2020 21:14

// The next slice will already be of the desired size, so don't copy and
// return the front slice.
if (slices_.size() >= 2 && slices_[1]->dataSize() >= max_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth considering a generalization of this logic so we refuse to copy if we ever encounter that the next slice is larger than some copy threshold?

That way a buffer containing: {1, 1, 1, 1, 1, 16kb} would only end up copying 5 bytes when called with parameters like: linearize(16kb, 4000)

@@ -248,15 +248,17 @@ Network::IoResult SslSocket::doWrite(Buffer::Instance& write_buffer, bool end_st
while (bytes_to_write > 0) {
// TODO(mattklein123): As it relates to our fairness efforts, we might want to limit the number
// of iterations of this loop, either by pure iterations, bytes written, etc.
const auto slice = write_buffer.maybeLinearize(16384, 4096);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the concerns about resume from a different region could be addressed by skipping the call to linearize when doing a retry. When retrying, we know that the first slice contains roughly bytes_to_retry_

@@ -248,15 +248,17 @@ Network::IoResult SslSocket::doWrite(Buffer::Instance& write_buffer, bool end_st
while (bytes_to_write > 0) {
// TODO(mattklein123): As it relates to our fairness efforts, we might want to limit the number
// of iterations of this loop, either by pure iterations, bytes written, etc.
const auto slice = write_buffer.maybeLinearize(16384, 4096);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend setting the copy threshold to 4000 bytes instead of 4096. This is related to the buffer default slice size being 4096 - sizeof(OwnedSlice) which is about 4032 bytes. Setting it to 4096 will result in a lot of spurious copies. See also #14054 (comment)

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
@ggreenway
Copy link
Contributor Author

Here's the results of the benchmark code I just pushed. The testparams are: full_linearize/short_slice_size/num_short_slices.

full_linearize==0 means use the new maybeLinearize(), ==1 means use linearize(16384).
short_slice_size is the size of short slices that are put at the beginning of the buffer
num_short_slices is how many short slices are added.

The short slices were to try to force degenerate behavior.

Please review the test code and make sure it's measuring what we want to measure.

One thing I went back and forth on was whether to use a real socket, or use a mem BIO to communicate between client and server. The real sockets capture a real cost (kernel mode transitions), but I had some inconsistency between test runs, and I wonder if the syscalls contributed.

$ sudo chrt -f 99 ./bazel-bin/test/extensions/transport_sockets/tls/tls_throughput_benchmark -- --benchmark_min_time=2
2020-11-20 15:00:33
Running ./bazel-bin/test/extensions/transport_sockets/tls/tls_throughput_benchmark
Run on (24 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x12)
  L1 Instruction 32 KiB (x12)
  L2 Unified 256 KiB (x12)
  L3 Unified 15360 KiB (x2)
Load Average: 1.29, 1.23, 1.02
----------------------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------
testThroughput/0/0/0          97.7 us         97.7 us        28015 throughput=1.84405G/s writes_per_iteration=11
testThroughput/1/0/0          99.2 us         99.1 us        28303 throughput=1.81789G/s writes_per_iteration=11
testThroughput/0/1/1           100 us          100 us        27779 throughput=1.7994G/s writes_per_iteration=12
testThroughput/1/1/1           100 us          100 us        28316 throughput=1.79611G/s writes_per_iteration=11
testThroughput/0/128/1        99.4 us         99.4 us        28503 throughput=1.81276G/s writes_per_iteration=12
testThroughput/1/128/1         100 us          100 us        28521 throughput=1.79555G/s writes_per_iteration=11
testThroughput/0/4096/1       90.5 us         90.5 us        30447 throughput=1.99064G/s writes_per_iteration=12
testThroughput/1/4096/1       99.2 us         99.2 us        27864 throughput=1.81592G/s writes_per_iteration=11
testThroughput/0/1/2           101 us          101 us        27687 throughput=1.77774G/s writes_per_iteration=11
testThroughput/1/1/2           100 us          100 us        27704 throughput=1.80209G/s writes_per_iteration=11
testThroughput/0/128/2        98.5 us         98.5 us        28761 throughput=1.83011G/s writes_per_iteration=12
testThroughput/1/128/2         101 us          101 us        28496 throughput=1.79126G/s writes_per_iteration=11
testThroughput/0/4096/2       93.9 us         93.9 us        30517 throughput=1.91974G/s writes_per_iteration=13
testThroughput/1/4096/2       99.4 us         99.4 us        27828 throughput=1.81344G/s writes_per_iteration=11
testThroughput/0/1/3          99.8 us         99.8 us        27712 throughput=1.80647G/s writes_per_iteration=11
testThroughput/1/1/3          99.9 us         99.9 us        26832 throughput=1.80452G/s writes_per_iteration=11
testThroughput/0/128/3        99.7 us         99.7 us        28752 throughput=1.8075G/s writes_per_iteration=12
testThroughput/1/128/3         101 us          101 us        26020 throughput=1.77846G/s writes_per_iteration=11
testThroughput/0/4096/3        101 us          101 us        30035 throughput=1.78075G/s writes_per_iteration=14
testThroughput/1/4096/3       98.8 us         98.8 us        27776 throughput=1.82495G/s writes_per_iteration=11

@ggreenway
Copy link
Contributor Author

@antoniovicente The benchmark in this PR might help measure effects in #14111

num_writes++;
}

state.counters["writes_per_iteration"] = num_writes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we capture a count on the number of times linearize did something?

Copy link
Contributor Author

@ggreenway ggreenway Nov 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

These results were run under bazel, so ignore timing. But the num_times_linearize_did_something is deterministic.

----------------------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------
testThroughput/0/0/0           330 us          328 us            1 num_times_linearize_did_something=10 throughput=548.771M/s writes_per_iteration=11
testThroughput/1/0/0           276 us          275 us            1 num_times_linearize_did_something=10 throughput=655.961M/s writes_per_iteration=11
testThroughput/0/1/1           285 us          285 us            1 num_times_linearize_did_something=10 throughput=633.111M/s writes_per_iteration=12
testThroughput/1/1/1           342 us          341 us            1 num_times_linearize_did_something=11 throughput=528.143M/s writes_per_iteration=11
testThroughput/0/128/1         290 us          289 us            1 num_times_linearize_did_something=9 throughput=623.632M/s writes_per_iteration=12
testThroughput/1/128/1         294 us          293 us            1 num_times_linearize_did_something=10 throughput=614.242M/s writes_per_iteration=11
testThroughput/0/4096/1        257 us          256 us            1 num_times_linearize_did_something=1 throughput=704.526M/s writes_per_iteration=12
testThroughput/1/4096/1        285 us          285 us            1 num_times_linearize_did_something=11 throughput=633.298M/s writes_per_iteration=11
testThroughput/0/1/2           283 us          283 us            1 num_times_linearize_did_something=11 throughput=637.848M/s writes_per_iteration=11
testThroughput/1/1/2           288 us          287 us            1 num_times_linearize_did_something=11 throughput=628.116M/s writes_per_iteration=11
testThroughput/0/128/2         316 us          316 us            1 num_times_linearize_did_something=7 throughput=571.135M/s writes_per_iteration=12
testThroughput/1/128/2         271 us          270 us            1 num_times_linearize_did_something=10 throughput=667.464M/s writes_per_iteration=11
testThroughput/0/4096/2        272 us          271 us            1 num_times_linearize_did_something=1 throughput=664.256M/s writes_per_iteration=13
testThroughput/1/4096/2        282 us          282 us            1 num_times_linearize_did_something=11 throughput=640.216M/s writes_per_iteration=11
testThroughput/0/1/3           283 us          282 us            1 num_times_linearize_did_something=11 throughput=639.23M/s writes_per_iteration=11
testThroughput/1/1/3           275 us          274 us            1 num_times_linearize_did_something=11 throughput=656.697M/s writes_per_iteration=11
testThroughput/0/128/3         281 us          280 us            1 num_times_linearize_did_something=5 throughput=642.652M/s writes_per_iteration=12
testThroughput/1/128/3         272 us          271 us            1 num_times_linearize_did_something=10 throughput=664.379M/s writes_per_iteration=11
testThroughput/0/4096/3        261 us          260 us            1 num_times_linearize_did_something=1 throughput=693.156M/s writes_per_iteration=14
testThroughput/1/4096/3        273 us          273 us            1 num_times_linearize_did_something=11 throughput=660.858M/s writes_per_iteration=11

Copy link
Contributor

@antoniovicente antoniovicente Nov 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_times_linearize_did_something looks high in the cases where maybeLinearize is used. The number of copies done by both implementations is the same most of the time.

"if (slices_.size() >= 2 && slices_[1]->dataSize() >= max_size) {" should be using desired_min_size

Results after this change:

Benchmark                        Time             CPU   Iterations UserCounters...
testThroughput/0/0/0          60.5 us         60.5 us        11572 num_times_linearize_did_something=0 throughput=2.97744G/s writes_per_iteration=12
testThroughput/1/0/0          62.9 us         62.9 us        11116 num_times_linearize_did_something=10 throughput=2.86485G/s writes_per_iteration=11
testThroughput/0/1/1          62.1 us         62.1 us        11279 num_times_linearize_did_something=0 throughput=2.90119G/s writes_per_iteration=13
testThroughput/1/1/1          64.0 us         64.0 us        10919 num_times_linearize_did_something=11 throughput=2.81688G/s writes_per_iteration=11
testThroughput/0/128/1        64.4 us         64.4 us        10880 num_times_linearize_did_something=0 throughput=2.79695G/s writes_per_iteration=14
testThroughput/1/128/1        63.3 us         63.3 us        11091 num_times_linearize_did_something=10 throughput=2.84636G/s writes_per_iteration=11
testThroughput/0/4096/1       62.6 us         62.6 us        11181 num_times_linearize_did_something=0 throughput=2.87997G/s writes_per_iteration=13
testThroughput/1/4096/1       63.5 us         63.5 us        11035 num_times_linearize_did_something=11 throughput=2.83919G/s writes_per_iteration=11
testThroughput/0/1/2          60.9 us         60.9 us        11496 num_times_linearize_did_something=1 throughput=2.96085G/s writes_per_iteration=12
testThroughput/1/1/2          63.9 us         63.8 us        10956 num_times_linearize_did_something=11 throughput=2.82267G/s writes_per_iteration=11
testThroughput/0/128/2        63.3 us         63.2 us        11093 num_times_linearize_did_something=1 throughput=2.85011G/s writes_per_iteration=13
testThroughput/1/128/2        63.1 us         63.1 us        11066 num_times_linearize_did_something=10 throughput=2.85719G/s writes_per_iteration=11
testThroughput/0/4096/2       64.4 us         64.4 us        10890 num_times_linearize_did_something=0 throughput=2.80025G/s writes_per_iteration=14
testThroughput/1/4096/2       63.5 us         63.5 us        11029 num_times_linearize_did_something=11 throughput=2.83725G/s writes_per_iteration=11
testThroughput/0/1/3          61.0 us         60.9 us        11461 num_times_linearize_did_something=1 throughput=2.957G/s writes_per_iteration=12
testThroughput/1/1/3          64.0 us         64.0 us        10950 num_times_linearize_did_something=11 throughput=2.81811G/s writes_per_iteration=11
testThroughput/0/128/3        63.8 us         63.8 us        11085 num_times_linearize_did_something=1 throughput=2.82643G/s writes_per_iteration=13
testThroughput/1/128/3        63.1 us         63.0 us        11112 num_times_linearize_did_something=10 throughput=2.85861G/s writes_per_iteration=11
testThroughput/0/4096/3       64.7 us         64.7 us        10804 num_times_linearize_did_something=1 throughput=2.78513G/s writes_per_iteration=14
testThroughput/1/4096/3       63.3 us         63.3 us        11051 num_times_linearize_did_something=11 throughput=2.84711G/s writes_per_iteration=11

@antoniovicente
Copy link
Contributor

./bazel-bin/test/extensions/transport_sockets/tls/tls_throughput_benchmark

When I try to run tls_throughput_benchmark, I get:
assert failure: err > 0. Details: SSL_CTX_use_certificate_file

antoniovicente added a commit to antoniovicente/envoy that referenced this pull request Nov 24, 2020
Signed-off-by: Antonio Vicente <avd@google.com>
@ggreenway
Copy link
Contributor Author

./bazel-bin/test/extensions/transport_sockets/tls/tls_throughput_benchmark

When I try to run tls_throughput_benchmark, I get:
assert failure: err > 0. Details: SSL_CTX_use_certificate_file

Does it also fail when you run it under bazel, eg bazel test -c opt //test/extensions/transport_sockets/tls:tls_throughput_benchmark_test?

@antoniovicente
Copy link
Contributor

./bazel-bin/test/extensions/transport_sockets/tls/tls_throughput_benchmark

When I try to run tls_throughput_benchmark, I get:
assert failure: err > 0. Details: SSL_CTX_use_certificate_file

Does it also fail when you run it under bazel, eg bazel test -c opt //test/extensions/transport_sockets/tls:tls_throughput_benchmark_test?

That passes, but runs with the test flags to reduce number of iterations. I think that the binary ran successfully if I cd to directory where the binary lives.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
}

// The next slice will already be of the desired size, so don't copy and
// return the front slice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an error in the next line, should be:
if (slices_.size() >= 2 && slices_[1]->dataSize() >= desired_min_size) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that it's a heuristic, I wouldn't say it's an error, just a choice. If the next slice is slightly larger than 4k, and the current slice is 1 byte, what's the best behavior?

It turns out the slice sizes are terrible, as you've noted, due to the inline storage of the OwnedSlice. The second slice contains just slightly less than 16k (I think it's 64 bytes less), which results in a bunch of copies on subsequent slices.

I think the next step is to remove the inline-storage from the slice (#14111), then re-evaluate this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my read of the comment made me think that you intended to compare against desired_min_size.

I think an interesting case is write behavior for HTTP2 which involves a 9 byte data frame header followed by up to 16kb of data. The change in #14111 will have some consequences to how said writes interact with linearize, but I think would result in little to no performance consequences since both the versions of the buffer class would end up copying about the same amount of data during linearize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my read of the comment made me think that you intended to compare against desired_min_size.

I see the confusion now. That comment was written when the parameter had a different (less clear) name. I'll clarify the comment.

num_writes++;
}

state.counters["writes_per_iteration"] = num_writes;
Copy link
Contributor

@antoniovicente antoniovicente Nov 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_times_linearize_did_something looks high in the cases where maybeLinearize is used. The number of copies done by both implementations is the same most of the time.

"if (slices_.size() >= 2 && slices_[1]->dataSize() >= max_size) {" should be using desired_min_size

Results after this change:

Benchmark                        Time             CPU   Iterations UserCounters...
testThroughput/0/0/0          60.5 us         60.5 us        11572 num_times_linearize_did_something=0 throughput=2.97744G/s writes_per_iteration=12
testThroughput/1/0/0          62.9 us         62.9 us        11116 num_times_linearize_did_something=10 throughput=2.86485G/s writes_per_iteration=11
testThroughput/0/1/1          62.1 us         62.1 us        11279 num_times_linearize_did_something=0 throughput=2.90119G/s writes_per_iteration=13
testThroughput/1/1/1          64.0 us         64.0 us        10919 num_times_linearize_did_something=11 throughput=2.81688G/s writes_per_iteration=11
testThroughput/0/128/1        64.4 us         64.4 us        10880 num_times_linearize_did_something=0 throughput=2.79695G/s writes_per_iteration=14
testThroughput/1/128/1        63.3 us         63.3 us        11091 num_times_linearize_did_something=10 throughput=2.84636G/s writes_per_iteration=11
testThroughput/0/4096/1       62.6 us         62.6 us        11181 num_times_linearize_did_something=0 throughput=2.87997G/s writes_per_iteration=13
testThroughput/1/4096/1       63.5 us         63.5 us        11035 num_times_linearize_did_something=11 throughput=2.83919G/s writes_per_iteration=11
testThroughput/0/1/2          60.9 us         60.9 us        11496 num_times_linearize_did_something=1 throughput=2.96085G/s writes_per_iteration=12
testThroughput/1/1/2          63.9 us         63.8 us        10956 num_times_linearize_did_something=11 throughput=2.82267G/s writes_per_iteration=11
testThroughput/0/128/2        63.3 us         63.2 us        11093 num_times_linearize_did_something=1 throughput=2.85011G/s writes_per_iteration=13
testThroughput/1/128/2        63.1 us         63.1 us        11066 num_times_linearize_did_something=10 throughput=2.85719G/s writes_per_iteration=11
testThroughput/0/4096/2       64.4 us         64.4 us        10890 num_times_linearize_did_something=0 throughput=2.80025G/s writes_per_iteration=14
testThroughput/1/4096/2       63.5 us         63.5 us        11029 num_times_linearize_did_something=11 throughput=2.83725G/s writes_per_iteration=11
testThroughput/0/1/3          61.0 us         60.9 us        11461 num_times_linearize_did_something=1 throughput=2.957G/s writes_per_iteration=12
testThroughput/1/1/3          64.0 us         64.0 us        10950 num_times_linearize_did_something=11 throughput=2.81811G/s writes_per_iteration=11
testThroughput/0/128/3        63.8 us         63.8 us        11085 num_times_linearize_did_something=1 throughput=2.82643G/s writes_per_iteration=13
testThroughput/1/128/3        63.1 us         63.0 us        11112 num_times_linearize_did_something=10 throughput=2.85861G/s writes_per_iteration=11
testThroughput/0/4096/3       64.7 us         64.7 us        10804 num_times_linearize_did_something=1 throughput=2.78513G/s writes_per_iteration=14
testThroughput/1/4096/3       63.3 us         63.3 us        11051 num_times_linearize_did_something=11 throughput=2.84711G/s writes_per_iteration=11

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jan 15, 2021
Base automatically changed from master to main January 15, 2021 23:01
@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants