Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BankingStage packet starvation (1.9) #25236

Merged
merged 5 commits into from
May 16, 2022

Conversation

buffalu
Copy link
Contributor

@buffalu buffalu commented May 15, 2022

Problem

  • BankingStage gets starved from new packets if it gets stuck with transactions that exceeds the cost model or repeatedly hit AccountInUse/other errors that mark packets as retry-able that take awhile to get drained from the buffer.
  • Issue: Performance degradation (small blocks) after updating from 1.8 to 1.9 #24163
  • BankingStage should attempt to read in new packets more frequently.

Summary of Changes

  • Remove the looping for process_buffered_packets.
  • BankingStage will attempt to go through each threads backlog once before reading new packets.
  • If there's packets, set the recv_timeout to 0s so we perform the equivalent of a try_recv

NOTE:

  • We are still wasting time trying to pack batches for txs where the cost model is already saturated (serialization, QoS, account locking, etc.) but this seems like a major improvement over the status quo.

@mergify mergify bot added the community Community contribution label May 15, 2022
@mergify mergify bot requested a review from a team May 15, 2022 20:25
@buffalu buffalu changed the title Make BankingStage::process_buffered_packets run once per outer loop Fix BankingStage packet starvation May 15, 2022
@buffalu
Copy link
Contributor Author

buffalu commented May 15, 2022

from @carllin:

yeah I think have to make two decisions:

  1. How to split the buffer for cost model throttled transactions that are stuck in PacketBatches in 1.9 vs 1.10/master where every packet is independent
  2. How often to read from the sigverify pipe into BankingStage. Most straightforward decision is as @buffalu suggested and do one iteration of receive_and_buffer after one round of attempted execution

this should solve 2.

note that this doesn't solve 1, but it should definitely make starvation way better


let (_, slot_metrics_checker_check_slot_boundary_time) = Measure::this(
|_| {
let current_poh_bank = {
// TODO (B): constant spinning here may have an impact on poh, discuss in PR
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something worth discussing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have one packet in buffered_packet_batches that exceeds cost model, it'll probably spin super fast grabbing this lock a ton, so might be good idea to rate limit it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is a good point, we can wrap it in a check that only checks this once every SLOT_BOUNDARY_CHECK_MS: usize = 10

@ckamm
Copy link
Contributor

ckamm commented May 15, 2022

Does it even make sense to mark transactions that aren't processed due to cost limits as retryable in bank::load_and_execute_transactions()? Retrying in the same block can work, but only if some other tx's costs are later reverted for some reason. But most likely retrying in the same block would have the same result.

@buffalu
Copy link
Contributor Author

buffalu commented May 15, 2022

Does it even make sense to mark transactions that aren't processed due to cost limits as retryable in bank::load_and_execute_transactions()? Retrying in the same block can work, but only if some other tx's costs are later reverted for some reason. But most likely retrying in the same block would have the same result.

ideally they should probably be moved somewhere else (maybe a separate buffer?) and on a new slot inserted back into banking stage at the front. perhaps spread across packet batches to try to get better parallelism as opposed to one batch?

thats probably an issue for a separate PR. if 1.10 coming out soon, i have a feeling this will be "good enough" for 1.9

Comment on lines 958 to 972
Self::process_buffered_packets(
&my_pubkey,
&socket,
poh_recorder,
cluster_info,
&mut buffered_packet_batches,
&forward_option,
transaction_status_sender.clone(),
&gossip_vote_sender,
&banking_stage_stats,
&recorder,
data_budget,
&qos_service,
&mut slot_metrics_tracker,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also has a poh lock grabbing inside, so i think it's worth wrapping this in a if !process_buffered_packets.is_empty() check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, that's probably something we should keep an eye on too. fixed this comment


let (_, slot_metrics_checker_check_slot_boundary_time) = Measure::this(
|_| {
let current_poh_bank = {
// TODO (B): constant spinning here may have an impact on poh, discuss in PR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is a good point, we can wrap it in a check that only checks this once every SLOT_BOUNDARY_CHECK_MS: usize = 10

@carllin carllin requested a review from sakridge May 15, 2022 21:48
@buffalu buffalu changed the title Fix BankingStage packet starvation Fix BankingStage packet starvation (1.9) May 16, 2022
@sakridge
Copy link
Member

@carllin can you take a look when you get a chance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants