Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fully-reproducible online tracer for banking #29196

Merged
merged 43 commits into from
Jan 25, 2023

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Dec 10, 2022

Problem

It's hard to reproduce/examine past banking-related performance problems which are observed only at mainnet-beta.

The underlying problem is that it's not easy to create a similar load on test/dev environments. Worse, we also don't have good visibility of actual load due to limited set of metrics, which are indirect by nature and aggregated in ad-hoc definitions.
This is especially so for banking, say unlike replay performance. That's because any discarded packets are unavailable afterwards like later investigating/benchmarking.

Lastly, the performance profile of banking isn't linear by any means as well. There's various factors affecting its throughput (account access pattern, qos, resource contentions across banking threads...)

Summary of Changes

So, let's get to bottom of this.

Namely, start to collect all of REAL trace data fed into BankingStage on production environment.
Once provided from validator-node operators, we will be able to consume these trace data later with solana-ledger-tool simulate-leader-blocks (not included in this pr)

The obvious concerns is of any overhead. So, background trace writer thread is introduced and its work is rather simple: just writing packet blobs. Casual testing shows 1GB/s write throughput is possible.

On the other hand, any critical-code path's overhead will be creating/cloning Arcs and sending PacketBatches off over crossbeam_channel. Specifically, this results in additional 100ns per Vec<PacketBatch>, which should be negligible.

Also, the trace file will be discarded 14days old or more than 14G is written. So, ram/disk usage is capped.
Hard to imagine 10G link (if employed) is fully saturated under quic, even if all of three banking channels (non-vote, tpu-vote, gossip-vote) are combined.

So, i think attack are correctly mitigated as well.

Currently, banking tracing is opt-in. but planned to be promoted to opt-out.

along side the obvious packets tracing, banking start and end events are recorded as well.

Also, backport to mb is desired. the pr is prepared in that mind (I'll spin off a preparatory cleanup pr...).
Unlike all of this other draft my prs, this pr is ready for review (except the diff size, warranting the cleanup pr....) and I've polished it up to the extreme.

spin-off pr statuses/tasks

Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a short look this morning, leaving some initial comments. Still need to make it through the remaining 90%.

Side note: from the looks of it, I'm guessing you and I will have a lot of conflicts in the near future as I tried splitting up BankingStage into different sub-components (DecisionMaking, Consuming, Forwarding, Committing). As I post PRs for those I'll be sure to include you so we can make coordinating easier.

core/src/banking_trace.rs Outdated Show resolved Hide resolved
core/src/banking_trace.rs Show resolved Hide resolved
core/src/banking_trace.rs Outdated Show resolved Hide resolved
core/Cargo.toml Outdated Show resolved Hide resolved
core/src/banking_trace.rs Outdated Show resolved Hide resolved
core/src/packet_deserializer.rs Outdated Show resolved Hide resolved
core/src/packet_deserializer.rs Outdated Show resolved Hide resolved
@ryoqun
Copy link
Member Author

ryoqun commented Dec 10, 2022

Took a short look this morning, leaving some initial comments. Still need to make it through the remaining 90%.

appreciate quick feedback! however, please remind that I'm not forcing you a weekend code reviews.. it's just that I'm working 24/7 whenever i find some spare time. ;)

Side note: from the looks of it, I'm guessing you and I will have a lot of conflicts in the near future as I tried splitting up BankingStage into different sub-components (DecisionMaking, Consuming, Forwarding, Committing). As I post PRs for those I'll be sure to include you so we can make coordinating easier.

got it. although there will be some conflicts, i don't think this tracer thing is duplicating your works. if wanted, i can bite the intense git merge conflict session. :)

@apfitzge apfitzge self-requested a review December 12, 2022 15:33
core/src/banking_trace.rs Outdated Show resolved Hide resolved
core/src/banking_trace.rs Outdated Show resolved Hide resolved
core/src/banking_trace.rs Outdated Show resolved Hide resolved
core/src/banking_trace.rs Outdated Show resolved Hide resolved
core/benches/banking_trace.rs Show resolved Hide resolved
sdk/src/packet.rs Outdated Show resolved Hide resolved
validator/src/cli.rs Show resolved Hide resolved
validator/src/cli.rs Outdated Show resolved Hide resolved
validator/src/cli.rs Outdated Show resolved Hide resolved
core/src/banking_trace.rs Show resolved Hide resolved
@ryoqun
Copy link
Member Author

ryoqun commented Dec 15, 2022

@apfitzge hey, thanks for very detailed review comments! I'll address these shortly after with joy! also, this is very hot press but the simulation code is here as requested: ryoqun#2 (it's VERY rough...)

@ryoqun ryoqun force-pushed the banking-tracer branch 2 times, most recently from ca94c60 to db4732c Compare December 20, 2022 14:19
@ryoqun
Copy link
Member Author

ryoqun commented Dec 20, 2022

@apfitzge i haven't fully replied to your all comments. however, could you look again at your convenient time as i pushed numerous commits to address them? also, i put bunch of my 2 cents. ;)

@apfitzge apfitzge self-requested a review December 20, 2022 23:26
@ryoqun ryoqun force-pushed the banking-tracer branch 3 times, most recently from 1482637 to b76ed3d Compare December 21, 2022 13:03
bors bot added a commit to jonasbb/serde_with that referenced this pull request Dec 24, 2022
536: Mention "Bytes" on README.md and lib.rs r=jonasbb a=ryoqun

I think #277 can be more promoted. :)

As I wrote the current status of serialization of const generics like bytes (solana-labs/solana#29196 (comment)), i think `"Bytes"` is best solution so far.

`@jonasbb` hey, this functionary saved us. however it took a while to spot this. I think `Bytes` can be promoted. :)

Co-authored-by: Ryo Onodera <ryoqun@gmail.com>
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 6, 2023
@mschneider
Copy link
Contributor

Really cool idea, I actually though we might get a similar trace by just capturing the quic network stream on a switch and stream copy it to another machine for recording. Having this as a software feature is of course way more accessible

@github-actions github-actions bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 13, 2023
@apfitzge
Copy link
Contributor

Really cool idea, I actually though we might get a similar trace by just capturing the quic network stream on a switch and stream copy it to another machine for recording. Having this as a software feature is of course way more accessible

Also has an added benefit of somewhat isolating it from sigverify time, as we get the time packets enter banking stage after being verified 😄

#[error("Integer Cast Error: {0}")]
IntegerCastError(#[from] std::num::TryFromIntError),

#[error("Trace directory's byte limit is too small (must be larger than {1}): {0}")]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

apfitzge
apfitzge previously approved these changes Jan 25, 2023
Copy link
Contributor

@apfitzge apfitzge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - happy with how this ended up and thanks for all the hard work @ryoqun. Love the commit links you leave for resolving each conversation

@mergify mergify bot dismissed apfitzge’s stale review January 25, 2023 07:35

Pull request has been modified.

@ryoqun ryoqun force-pushed the banking-tracer branch 2 times, most recently from 947701c to 84e9520 Compare January 25, 2023 08:26
@ryoqun ryoqun merged commit 40bbf99 into solana-labs:master Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants