Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stateless_validation): Limit size of outgoing receipts to keep size of source_receipt_proofs under control #11492

Merged
merged 21 commits into from
Jun 7, 2024

Conversation

jancionear
Copy link
Contributor

@jancionear jancionear commented Jun 5, 2024

This is a basic fix for: #11295

The problem is that the size of source_receipt_proofs could be really large in some scenarios. If all 6 shards send a 4MB outgoing receipt to shard 0, then source_receipt_proofs for shard 0 will be of size 6 * 4MB = 24MB.
That's way too big, the network probably won't be able to distribute that in time. And as we add more shards to the network, the effect will get worse and worse.

This fix deals with the problem by allowing only one chosen shard to send large receipts to the other shard. All other shards are only allowed to send ~100kB of receipts. So instead of 6 shards sending 4MB, we end up with 5 shards sending 100kB and one shard sending 4MB, which adds up to 4.5MB, much more manageable.

The mapping of "who is able to send a lot of outgoing receipts to whom" changes with every block height:

image

In this example at block height 2:

  • shard 0 can send:
    • 100kB of receipts to shard 0
    • 100kB of receipts to shard 1
    • 4.5MB of receipts to shard 2
  • shard 1 can send:
    • 4.5MB of receipts to shard 0
    • 100kB of receipts to shard 1
    • 100kB of receipts to shard 2
  • shard 2 can send:
    • 100kB of receipts to shard 0
    • 4.5MB of receipts to shard 1
    • 100kB of receipts to shard 2

At every height a receiving shard will receive large receipts from only one shard, so the size of source_receipt_proofs stays under control.
The mapping changes, so every large receipt will eventually be sent out when the mapping allows to send it to the destination.

The new limits are:

  • outgoing_receipts_usual_size_limit: 102_400 (100kiB)
  • outgoing_receipts_big_size_limit: 4_718_592 (4.5MiB - a bit larger than the 4MiB receipt limit to make sure that 4MiB receipts can get through)

Flaws

This is a basic solution which has some flaws. It limits the witness size, but it affects throughput in certain scenarios.
It can serve as a starting point for further improvements, something that we can get in before the code freeze.

  • Flaw 1: big receipts can block small receipts
    Shard tries to send outgoing receipts in the order in which they were created. When a large receipt is at the front of the queue, the shard won't send anything until it can send out this large receipt. This means that the shard might not send out anything for a few blocks.
    This could be fixed by having a separate queue for large outgoing receipts.
  • Flaw 2: missing chunks
    When a chunk is missing, the next chunk receives receipts from two block heights. This means that it could receive two 4MB receipts. This could be fixed by disallowing sending large receipts to shard that just had missing chunks

TODO

The implementation is pretty much ready, I should probably write some tests, but for now I have to do other stuff.
Posting the PR as is for now.

@jancionear jancionear added the A-stateless-validation Area: stateless validation label Jun 5, 2024
@jancionear jancionear requested a review from a team as a code owner June 5, 2024 15:38
Add two new runtime parameters which describe the limits on size of outgoing receipts.

/// The standard size limit for outgoing receipts aimed at a single shard.
/// This limit is pretty small to keep the size of source_receipt_proofs under control.
pub outgoing_receipts_usual_size_limit: usize,

/// Large size limit for outgoing receipts to a shard, used when it's safe
/// to send a lot of receipts without making the state witness too large.
pub outgoing_receipts_big_size_limit: usize,
There's going to be a size limit and a gas limit,
so let's rename the function that calculates the gas limit
accordingly to avoid confusion.
…oofs under control

Main implementation, described in the PR
This test tried to send a large (~170kB) receipt and
expected the receipt to be processed immediately.
The receipt wasn't processed immediately because it's
aboce the usual outgoing size limit.
The easiest way to fix this is to increase the limit
for this particular test.
This test kept failing on the check:
```
assert!(congestion_info.receipt_bytes() > next_congestion_info.receipt_bytes());
```

It seems flaky, sometimes it passes and sometimes it doesn't.
I think it's related to the recent changes (tx and receipt size limits).

Let's change the way in which we test that the chain is doing progress
to measuring burned gas on the congested shard, this is more reliable
and makes the test pass.
@jancionear
Copy link
Contributor Author

test_in_memory_trie_node_consistency is failing in the CI, but I don't think it is because of this PR, it also fails on master: #11496

Copy link

codecov bot commented Jun 5, 2024

Codecov Report

Attention: Patch coverage is 86.60714% with 15 lines in your changes missing coverage. Please review.

Project coverage is 71.40%. Comparing base (337a81e) to head (c0ee219).
Report is 2 commits behind head on master.

Files Patch % Lines
core/primitives/src/errors.rs 0.00% 5 Missing ⚠️
runtime/runtime/src/verifier.rs 55.55% 4 Missing ⚠️
core/parameters/src/parameter_table.rs 33.33% 0 Missing and 2 partials ⚠️
core/parameters/src/view.rs 50.00% 2 Missing ⚠️
core/primitives/src/types.rs 85.71% 0 Missing and 1 partial ⚠️
runtime/runtime/src/congestion_control.rs 95.23% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11492      +/-   ##
==========================================
- Coverage   71.42%   71.40%   -0.02%     
==========================================
  Files         788      788              
  Lines      159889   159926      +37     
  Branches   159889   159926      +37     
==========================================
- Hits       114203   114202       -1     
- Misses      40719    40754      +35     
- Partials     4967     4970       +3     
Flag Coverage Δ
backward-compatibility 0.23% <0.00%> (-0.01%) ⬇️
db-migration 0.23% <0.00%> (-0.01%) ⬇️
genesis-check 1.36% <0.00%> (-0.01%) ⬇️
integration-tests 37.65% <53.57%> (-0.03%) ⬇️
linux 68.83% <42.85%> (+0.01%) ⬆️
linux-nightly 70.93% <86.60%> (-0.02%) ⬇️
macos 52.44% <21.64%> (-0.01%) ⬇️
pytests 1.57% <0.00%> (-0.01%) ⬇️
sanity-checks 1.37% <0.00%> (-0.01%) ⬇️
unittests 66.11% <82.82%> (-0.02%) ⬇️
upgradability 0.28% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but please see my comment about aligning special and allowed shards.

Comment on lines 35 to 36
OutgoingReceiptsUsualSizeLimit,
OutgoingReceiptsBigSizeLimit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you add the comments here as well?

Comment on lines 49 to 54
/// Limits for outgoing receipts to a shard.
/// Receipts are sent out until the limit is hit, after that they're buffered.
pub(crate) struct OutgoingLimit {
pub gas: Gas,
pub size: usize,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Comment on lines 106 to 119
let mut gas_limit =
other_congestion_control.outgoing_gas_limit(apply_state.shard_id);
if shard_id == apply_state.shard_id {
// No gas limits on receipts that stay on the same shard. Backpressure
// wouldn't help, the receipt takes the same memory if buffered or
// in the delayed receipts queue.
gas_limit = Gas::MAX;
}

let size_limit = if special_shard == Some(shard_id) {
apply_state.config.witness_config.outgoing_receipts_big_size_limit
} else {
apply_state.config.witness_config.outgoing_receipts_usual_size_limit
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we align the allowed shard and special shard to be the same?

  1. It reduces complexity slightly
  2. It would relieve me from thinking whether we can ever send a byteful and gasful receipt unless the stars align.

This would require us to make some change to the allowed shard - currently it's only set under congestion we would need to always set it. This is a deviation from the congestion control NEP but I think it's fine here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I second the suggestion to make them the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly should that work?
AFAIU when all shards are fully congested one shard is allowed to process receipts to avoid deadlocks.
But I can't set the special shard to this allowed shard for every sender shard, because then multiple senders would be allowed to send large receipts to this one shard, and source_receipt_proofs for the allowed shards could explode.
I don't fully understand how it works in congestion control, so maybe I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CC, every shard pucks an allowed shard from the set of other shards. This shard is allowed to send to the congested shard.

AFAIU when all shards are fully congested one shard is allowed to process receipts to avoid deadlocks.

No, allowed shard kicks in a soon as a single shard is fully congested. It guarantees ALL shards always have at least one shard allowed sending to them. Processing receipts is never stopped or even limited in our congestion control design, only the forwarding is limited.

But I can't set the special shard to this allowed shard for every sender shard, because then multiple senders would be allowed to send large receipts to this one shard

No, there is only one allowed shard per receiving shard per block height.

Does that help and clear things up?

Copy link
Contributor Author

@jancionear jancionear Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see that I misunderstood how that works. Looks like a very similar concept. I'll read into the code and try to merge them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to set the allowed shard always, not only when the shard is congested. Otherwise you should be pretty much able to use it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small difference is that special_shard can be equal to own_shard, but allowed_shard is supposed to be different from own_shard. I guess I'll just allow allowed_shard to be the same as own_shard. It might lower the throughput in super-high congestion scenarios, but it's at most 1/num_shards of a difference.

runtime/runtime/src/congestion_control.rs Show resolved Hide resolved
@@ -462,3 +488,32 @@ fn overflow_storage_err() -> StorageError {
fn safe_add_gas_to_u128(a: u128, b: Gas) -> Result<u128, IntegerOverflowError> {
a.checked_add(b as u128).ok_or(IntegerOverflowError {})
}

/// Each shard receieves receipts from all other shards, but it could happen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo receives

if *forward_limit > gas_to_forward {
let size_to_forward = receipt_size(&receipt)?;

if forward_limit.gas > gas_to_forward && forward_limit.size > size_to_forward {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanity check - We should either make really sure that the special forward limit is larger than maximum potential receipt size or make it a soft limit in the special case. I think the former is already the case but just making sure in case the receipt size is a soft limit or some other subtlety.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outgoing_receipts_big_size_limit is 4.5MiB, which was chosen to be a bit above 4MiB to allow 4MiB receipts to pass. It's a hard limit, so it has to be a bit bigger.

But now I'm starting to have doubts, do we have a size limit for receipts? I thought that there'd be something analogous to max_transaction_size, but I don't see it 0_o. Maybe we'll have to add max_receipt_size limit as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe I was aiming for doubts :)
The issue is that just one receipt over the limit would block the queue forever. Making it a soft limit for the special shard should solve this problem but there may be better alternatives. Are large receipts a more general problem that we need to solve or is it only an issue in cross shard receipts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure but I believe a single receipts can be much larger than 4MB today

To quote past Jakob, writing about potentially introducing a receipt size limit:

~5MB for bandwidth per receipts seems reasonable to me. A receipt can produce lots of network bandwidth by making a cross contract call. Today, it's possible to make many calls at once and attach an argument of up to 4MB to each. We could say the total limit per receipt is 5MB, then users can still make a single cross contract call with a 4MB argument, or two calls with 2MB each. But they couldn't do two calls with 4MB each. I don't think that's a problem for anyone.

source: #9378 (comment)

Based on that, a receipt could be around 100MB if it contains 25 function calls with 4 MB payload each. The gas cost for creating all those actions would only be below 250 Tgas, so plenty of room for the initial call and promise overhead costs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on that, a receipt could be around 100MB if it contains 25 function calls with 4 MB payload each. The gas cost for creating all those actions would only be below 250 Tgas, so plenty of room for the initial call and promise overhead costs.

Damn, that's not good. I guess we'll have to add a receipt size limit then. I hope it's safe to do it, rejecting receipts sounds dangerous 0_O

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my I sparked more doubts than I was hoping for. If the receipts can be so large then indeed even a soft limit would not be sufficient. @jancionear Do you want to share the news on zulip or at the sync meeting later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 510 to 512
if num_shards == 0 {
return None;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems off. Can you clarify under what circumstances this may happen? Also if we can avoid it this would be great but I can't tell without context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll remove it, it's a remnant of the previous version where special shard couldn't be the same as the source shard 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, num_shards comes from ApplyState::congestion_info, and I'm not 100% sure that it'll always have nonzero length. It's safer to assume that it can be zero, and just not choose a special shard then.

let num_shards = apply_state.congestion_info.len();
let special_shard: Option<ShardId> =
    choose_special_shard(apply_state.shard_id, apply_state.block_height, num_shards);

Copy link
Contributor

@jakmeier jakmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, to be frank, this seems a bit rushed to get in before the code freeze tomorrow. I have several concerns that I would like to have more time to get a better understanding.

Concern 1: Adding a receipt size limit

We don't even have a receipt limit today. Here we implicitly add one and we add even more restrictions on top.

Even just adding a simple receipt size limit is something I thought we would need approval by the working group in a formal NEP, since it restricts what users can do on Near Protocol. It might seem like no problem to us but since it's not a purely technical decisions, I don't think it's ours to make. At least as far as I understand the protocol development process.

Concern 2: Impact on throughput

In the congestion control limits, we put a lot of effort in double-checking we don't throttle throughput unnecessarily, doing simulations with traffic representing important contracts and also extreme cases. Is there an investigation or analysis to look at for this limit?

Like, how many times in the last month was a shard sending more than 100kB of receipts to another shard? This seems like important information to decide if we can tolerate blocking an entire shard on a single large receipt until it is chosen to be special.

Another question that comes to mind: Will aurora be affects by this? (I think their submit calls can have decently large receipt sizes.)

Concern 3: Interaction with congestion control

For congestion control, we fine-tuned 10 parameters that are now in a sort of balance. Changing them individually showed major differences in how it performs.

With this new factor of how buffers are filled, we might need to redo our simulations on congestion control. Buffered receipts were a very rare occasion in our simulations that only happens when the receiving shard has a congestion problem. Thus we fine-tuned for a small outgoing buffer and we apply pretty aggressive backpressure on the signal of a full outgoing buffer. This includes the rejections of ALL new incoming transactions to that shard as soon as we have >= 500 Tgas of receipt in the buffer.

I suspect your change would breach the 500 Tgas buffer almost everytime we have a slightly large receipt. In that case, we would reject all incoming transactions and even clean out all the transaction in the pool, which currently results in an unfortunate timeout on the user side without proper handling. I think we would have to incrase the outgoing buffer limit substantially, and then fine-tune again all other parameters to find a new balance.

Sorry for all the critique, I know this is only trying to solve an important issue. And sorry if these issue have already been discussed elsewhere. I don't want to block you from progress, just letting you know of what I think could go wrong if we merge the PR as it is right now.

/// Same as `RuntimeNode::new`, but allows to modify the RuntimeConfig.
pub fn new_with_modified_config(
account_id: &AccountId,
modify_config: impl Fn(&mut RuntimeConfig),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optioanl/nit: FnOnce would make it a bit more general, which could be useful if the closure needs to capture objects like CongestionControlConfig and overwrite it on the field inside.

/// Receipts are sent out until the limit is hit, after that they're buffered.
pub(crate) struct OutgoingLimit {
pub gas: Gas,
pub size: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Could we make it u32 or u64 instead of usize? I don't really like worrying about potential architecture difference. Like, what if we want to put the entire runtime inside a Wasm runtime (it has been discussed before) which has usize as 32 bits instead of 64? Suddenly we might have subtle differences. In my opinion, it's better to avoid usize in structs like this.

Comment on lines 92 to 95
// One of the shards is designated as "special". We are allowed to send more receipts to this one "special" shard.
let num_shards = apply_state.congestion_info.len();
let special_shard: Option<ShardId> =
choose_special_shard(apply_state.shard_id, apply_state.block_height, num_shards);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Can we find a better name than "special"? :D Something that indicated that the chosen shard has higher bytes throughput.

Copy link
Contributor Author

@jancionear jancionear Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "high_flow"? Because there'll be a lot of receipts flowing there.
I don't like thinking too much about names because it's hard to find a really good name and it feels a bit like bikeshedding.
Do you have any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, naming is hard and can quickly derail a discussion into bikeshedding. And yet, in my opinion, bad names are one of the highest contributors to unreadable and unmaintainable code. In this case, it's not quite as bad so I suggested it as an optional change to reconsider.

I like "high_flow" better than "special". I think it can also be a good name if we combine "allowed" from congestion control and this shard into the same. But your opinion is as valid as mine.

And again, renaming this is optional and not something to block on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can also be a good name if we combine "allowed" from congestion control and this shard into the same.

Yeah, merging with allowed should fix the problem :)

Comment on lines 106 to 119
let mut gas_limit =
other_congestion_control.outgoing_gas_limit(apply_state.shard_id);
if shard_id == apply_state.shard_id {
// No gas limits on receipts that stay on the same shard. Backpressure
// wouldn't help, the receipt takes the same memory if buffered or
// in the delayed receipts queue.
gas_limit = Gas::MAX;
}

let size_limit = if special_shard == Some(shard_id) {
apply_state.config.witness_config.outgoing_receipts_big_size_limit
} else {
apply_state.config.witness_config.outgoing_receipts_usual_size_limit
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think I second the suggestion to make them the same

@jancionear
Copy link
Contributor Author

jancionear commented Jun 6, 2024

Hm, to be frank, this seems a bit rushed to get in before the code freeze tomorrow. I have several concerns that I would like to have more time to get a better understanding.

Yeah it isn't as thought out as it should be :/

I wanted to quickly implement some sort of receipt limit before the code freeeze because it feels wrong to release SV to mainnet without limiting the receipts. Having a limit that may impact throughput is better than having a dead network if the witness gets too large. This approach seemed simple and maybe good enough.

But now with all the concerns I don't have as much confidence in this solution, I'll think about it more. Maybe we should postpone the freeze :/ or treat this limit as a bugfix.

Thanks for the feedback!

@jancionear
Copy link
Contributor Author

Well despite the flaws this solution has, the decision from the last meeting is to merge it before the code freeze, to at least have some sort of limit.
Improvements to deal with the flaws can be implemented in follow-up PRs, and added to the SV launch as needed.

I'll make an effort to make this PR mergable by tomorrow.

@jancionear
Copy link
Contributor Author

v2:

  • Fix nits/typos from review
  • Use allowed_shard to decide who can send more receipts, get rid of special_shard
    • Always set allowed_shard, even when not congested
    • allowed_shard can now be the same as own_shard
    • Disable validation checks for allowed_shards, @wacban said that there's a bug in the preexisting code, and he's working on a fix
  • Add a 4MiB receipt size limit. I scanned the last year of blockchain history and all receipts were smaller than 4MB (4_000_000 bytes), so it shouldn't break anything.
  • Don't add a new protocol version, keep all stateless validation changes in protocol 87

@wacban @jakmeier please take a look

I'll try to test this change somehow.

@jancionear
Copy link
Contributor Author

In the past year (31,536,000 blocks) there were 29,607 transactions or receipts larger than 100 kB, that's ~1 per 1000 blocks. I think the 100kB limit shouldn't cause any problems under normal mainnet traffic.

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 258 to 262
// TODO(congestion_control) validate allowed shard
// #[test]
// fn test_bad_congestion_info_corrupt_allowed_shard() {
// test_bad_congestion_info_impl(BadCongestionInfoMode::CorruptAllowedShard);
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of commenting it out it would be better to mark this test ignore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh that's nice, 👍

Comment on lines 4 to +5
max_transaction_size: {old: 4_194_304, new: 1_572_864}
max_receipt_size: {old: 999_999_999_999_999, new: 4_194_304}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use the same value for transaction and receipt limits? It seems a bit more intuitive. Do you have any data on whether there were any receipts above that limit in the history?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be possible to lower the receipt limit further, but it would break some receipts, there are some 3.9MB ones. It would require some more thinking about, for now 4MiB seems like a safe and good enough choice, we can lower it in a follow-up PR.
https://near.zulipchat.com/#narrow/stream/295306-contract-runtime/topic/How.20large.20can.20internal.20receipts.20get.3F/near/443309601

Comment on lines 35 to 37
/// The standard size limit for outgoing receipts aimed at a single shard.
/// This limit is pretty small to keep the size of source_receipt_proofs under control.
OutgoingReceiptsUsualSizeLimit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify that this is a limit on the total sum of receipts and not on every individual receipt? Same below.

pub fn outgoing_size_limit(
&self,
sender_shard: ShardId,
runtime_config: &RuntimeConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes me wonder if we should move the new parameters to the congestion control config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, moved to `CongestionControlConfig"

.get(index as usize)
.expect("`checked_rem` should have ensured array access is in bound");
}
// checked_rem failed, hence other_shards.len() is 0
// checked_rem failed, hence all_shards.len() is 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The all_shards.len() == 0 condition is messed up but it is the case right now since we get it from the congestion info. I will refactor that in a follow up.

) -> ShardId {
if congestion_level < 1.0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self - this is fine and shouldn't affect congestion control since in the method to get outgoing gas limit we also check if congestion level is equal to 1.0.

@@ -922,6 +922,15 @@ pub mod chunk_extra {
Self::V3(v3) => v3.congestion_info.into(),
}
}

// Dirty workaround for broken allowed shard validation
pub fn with_zeroed_allowed_shard(&self) -> ChunkExtra {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a todo to get rid of it? I'll do it later.

Comment on lines 830 to 834
let node = RuntimeNode::new_with_modified_config(&relayer, |runtime_config| {
// Increase the outgoing receipts limit to allow the large receipt to be processed immediately.
// Without this change the receipt would be processed somewhere in the next few blocks.
runtime_config.witness_config.outgoing_receipts_usual_size_limit = 200_000;
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Again it feels like those should be part of the congestion control config and the RuntimeConfigStore::test_congestion_control_disabled() should maybe included that override.

Comment on lines 101 to 108
let mut gas_limit =
other_congestion_control.outgoing_gas_limit(apply_state.shard_id);
if shard_id == apply_state.shard_id {
// No gas limits on receipts that stay on the same shard. Backpressure
// wouldn't help, the receipt takes the same memory if buffered or
// in the delayed receipts queue.
gas_limit = Gas::MAX;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

let gas_limit = if if shard_id != apply_state.shard_id {
  other_congestion_control.outgoing_gas_limit(apply_state.shard_id)
} else {
  // No gas limits on receipts that stay on the same shard. Backpressure
  // wouldn't help, the receipt takes the same memory if buffered or
  // in the delayed receipts queue.
  Gas::MAX
};

Comment on lines +297 to +305
let receipt_size: u64 =
borsh::to_vec(receipt).unwrap().len().try_into().expect("Can't convert usize to u64");
if receipt_size > limit_config.max_receipt_size {
return Err(ReceiptValidationError::ReceiptSizeExceeded {
size: receipt_size,
limit: limit_config.max_receipt_size,
});
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me but I'll leave that to @jakmeier to judge if this is the right way to do it in runtime.

@jancionear jancionear added this pull request to the merge queue Jun 7, 2024
Merged via the queue into near:master with commit 43f5a90 Jun 7, 2024
29 checks passed
@jancionear jancionear deleted the outsized branch June 7, 2024 17:26
Copy link
Contributor

@jakmeier jakmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I wasn't able to review again yesterday. But I reviewed it now and couldn't find anything worth mentioning.

In the past year (31,536,000 blocks) there were 29,607 transactions or receipts larger than 100 kB, that's ~1 per 1000 blocks. I think the 100kB limit shouldn't cause any problems under normal mainnet traffic.

Thanks for collecting this data! Sounds like with the traffic of the last year, we would experience a hick-up around once every 20min on average.
A couple of extra seconds latency for a few dozen transactions seems tolerable. But probably we need something to avoid the transaction pool clearing effect. I will bring it up in the congestion control meeting to see what we can do about it.

github-merge-queue bot pushed a commit that referenced this pull request Jun 11, 2024
* Refactored the Block::shards_congestion_info to block_congestion_info
-> BlockCongestionInfo
* Changed the HashMap to BTreeMap because hash map adds randomness. 
* Added back the strict allowed shard check that was removed in #11492 
* Removed some temporary hacks from the same PR. 

This restores the deterministic validation of allowed shard. This is the
minimum that we need for the release. It is not in line with the
Congestion Control NEP that leaves the chunk producer's freedom to
choose the allowed shard as they see fit. I am happy to keep it as it is
in this PR and implement the chunk producer's "free will" in subsequent
releases if we find it beneficial.

Best reviewed commit by commit. The first commit is titled refactor but
it actually has an important change - HashMap -> BTreeMap. Sorry about
that.
telezhnaya pushed a commit that referenced this pull request Jun 13, 2024
* Refactored the Block::shards_congestion_info to block_congestion_info
-> BlockCongestionInfo
* Changed the HashMap to BTreeMap because hash map adds randomness. 
* Added back the strict allowed shard check that was removed in #11492 
* Removed some temporary hacks from the same PR. 

This restores the deterministic validation of allowed shard. This is the
minimum that we need for the release. It is not in line with the
Congestion Control NEP that leaves the chunk producer's freedom to
choose the allowed shard as they see fit. I am happy to keep it as it is
in this PR and implement the chunk producer's "free will" in subsequent
releases if we find it beneficial.

Best reviewed commit by commit. The first commit is titled refactor but
it actually has an important change - HashMap -> BTreeMap. Sorry about
that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants