Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve prom write requests decode performance #3478

Merged
merged 2 commits into from
Mar 12, 2024

Conversation

v0y4g3r
Copy link
Contributor

@v0y4g3r v0y4g3r commented Mar 11, 2024

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

N/A

What's changed and what's your intention?

This PR aims to improve the decode performance of PromWriteRequest, which has been addressed in #3425, but still we observe some room for optimization.

Decoding WriteRequest

  • First we can eliminate one copy_to_bytes invocation by inling prost::encoding::bytes::merge
  • Then we dive into copy_to_bytes, and we observe that copy_to_bytes is way slower than Golang's byte slice operation
    • For a bytes = bytes[..idx] operation repeated for 120k times (which is the case when decoding lables from 10k timeseries in a WriteRequest), Rust's Bytes takes 1.2ms, while Golang's byte slice takes 30us
    • The reason for this performance difference is that Bytes also handles reference counting when built from Vec<u8>, while Golang's byte slice only has ptr, len, cap fields and Garbage collector handles the reference counting, which bring little overhead when bytes are pooled.
    • Considering the bytes fields of PromLabel are short-lived and are converted to string and added to TableBuilder soon after a PromTimeseries decoding is finished, so that the original decompressed Bytes always outlive PromLabel::name and PromLabel::value, we can introduce some unsafe operation, such as directly construct a new Bytes from raw pointer and len/cap fields, that's what servers::proto::copy_to_bytes does.
    • For the same test mentioned above, servers::proto::copy_to_bytes takes 200us, still slower than Golang, but fairly acceptable.
    • With this optimization, decoding a 10k timeseries WriteRequest cost time further improved from 3.0ms to 1.8ms. For refence, VictoriaMetrics' WriteRequets decoding takes 1.2ms.

Decode-only benchmark result:

Running benches/bench_prom.rs (target/release/deps/bench_prom-9768faca31abd429)
Gnuplot not found, using plotters backend 
decode/write_request  time:     [7.1515 ms 7.1673 ms 7.1832 ms]
                      change:   [-0.7329% -0.3899% -0.0453%]  (p=0.03<0.05)
                      Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

decode/prom_write_request
                        time:   [1.8225 ms 1.8308 ms 1.8379 ms]
                        change: [-1.2093% -0.5679% +0. 0813%] (p = 0.08 › 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  15 (15.00%) low mild
  1 (1.00%) high mild

Building RowInsertRequests

Aside from decoding, we also find out that PromWriteRequest::as_row_insert_requests takes more time than decoding.

  • During decoding PromTimeseries, we need to build the schemas for each metric (table), which involves two hashmap lookups per label, that sums to 120k hashmap lookup for a 10k timeseries request.
  • We can use ahasher hashmap instead of Rust's default siphasher hashmap, to get a roughly 40% improvement in terms of hashtable lookups (Rust's default hashmap is quite slow btw).
  • Compared to siphasher, ahasher is not a cryptographically secure hasher, but it doesn't matter in our case.
  • I also compared other hashmaps like fxhashmap which is used internally in rustc, but did not notice significant improvent.

Results

Still, decoding a WriteRequest with 10k timeseries and 60k labels and convert it to RowInsertRequests:

     Running benches/bench_prom.rs (target/release/deps/bench_prom-ac6fa2abc4d24957)
decode/write_request    time:   [13.816 ms 13.882 ms 13.954 ms]
                        change: [-0.9281% -0.3477% +0.2455%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
decode/prom_write_request
                        time:   [5.7520 ms 5.7591 ms 5.7666 ms]
                        change: [+0.5254% +0.7116% +0.9096%] (p = 0.00 < 0.05)
                        Change within noise threshold.

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR does not require documentation updates.

@github-actions github-actions bot added the docs-not-required This change does not impact docs. label Mar 11, 2024
@v0y4g3r v0y4g3r marked this pull request as ready for review March 11, 2024 08:22
@v0y4g3r v0y4g3r requested review from evenyag and waynexia March 11, 2024 08:25
Copy link

codecov bot commented Mar 11, 2024

Codecov Report

Attention: Patch coverage is 73.68421% with 15 lines in your changes are missing coverage. Please review.

Project coverage is 84.93%. Comparing base (a9d42f7) to head (07710c6).
Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3478      +/-   ##
==========================================
- Coverage   85.44%   84.93%   -0.51%     
==========================================
  Files         895      900       +5     
  Lines      147093   149647    +2554     
==========================================
+ Hits       125685   127105    +1420     
- Misses      21408    22542    +1134     

@killme2008
Copy link
Contributor

Great job! Do you think we could submit these enhancements to the main prost project?

src/servers/src/prom_row_builder.rs Show resolved Hide resolved
src/servers/src/proto.rs Outdated Show resolved Hide resolved
src/servers/src/proto.rs Outdated Show resolved Hide resolved
src/servers/src/proto.rs Outdated Show resolved Hide resolved
@v0y4g3r
Copy link
Contributor Author

v0y4g3r commented Mar 11, 2024

Great job! Do you think we could submit these enhancements to the main prost project?

No, this optimization is safe if and only if:

  • Buf is backed by a continuous byte array
  • callers can ensure the decode result never outlive original Bytes.

src/servers/src/proto.rs Outdated Show resolved Hide resolved
src/servers/src/proto.rs Show resolved Hide resolved
src/servers/src/proto.rs Show resolved Hide resolved
@waynexia waynexia enabled auto-merge March 12, 2024 11:45
@waynexia waynexia added this pull request to the merge queue Mar 12, 2024
Merged via the queue into GreptimeTeam:main with commit 9afe327 Mar 12, 2024
18 checks passed
}
if buf.remaining() != limit {
return Err(DecodeError::new("delimited length exceeded"));
}
self.series.add_to_table_data(&mut self.table_data);
}
3u32 => {
// we can ignore metadata for now.
prost::encoding::skip_field(wire_type, tag, &mut buf, ctx.clone())?;
// todo(hl): metadata are skipped.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly the TODO is?

May we create an issue for it with a bit description?

}
}
}

#[inline(always)]
fn copy_to_bytes(data: &mut Bytes, len: usize) -> Bytes {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this is already the impl of bytes::Bytes?

We copy it here for inline?

https://github.com/tokio-rs/bytes/blob/ca004117f86afccd36148dee7c8413cfaf9de6a4/src/bytes.rs#L583-L591

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required This change does not impact docs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants