Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

[DNM] approval-voting: approvals signature checks optimizations #7482

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Jul 10, 2023

Overview

This implements the lazy signatures checks solution proposed here: approval-voting: approvals gossip and lazy signature checks optimizations proposed here: paritytech/polkadot-sdk#604

Problem

While profiling versi for paritytech/polkadot-sdk#732, the conclusions are that during peak time with 250 validator, approval-distribution and approval-voting need to process between 10k-15k messages per second, a rough breakdown on where the time is spent looks like this

Proposed Solution

The amount of work for assignments is already being addressed by #6782, so this PR is trying to address and reduce the amount of work we have to do when processing an approval and it is achieved by implementing a fast path for processing approval-votes, consisting of the following steps:

  1. Disable gossiping of approval-votes on the fast path.
  2. Make each validator send their approval votes to all of its peers.
  3. Do not check the signature of an approval vote if the peer_id that sent us the message is the originator of the message.
  4. In case finality is lagging use enable_aggression thresholds to fallback on gossiping the approvals(The slow_path).

FAQ

Why is it ok to not check approval votes signatures ?

If we received the message directly from the originator we can trust that the vote is legit, and since we aren't enforcing any accountability on the approvals votes, checking the signature before taking the vote in account is not really necessary and it is not security critical. For the future, we might envision a system where we lazily check the vote if accountability is required.

Note! The signature will be later checked just in the cases when we are dealing with a dispute and we have to import the votes on chain. So, the validators are incentivised to correctly sign their votes because otherwise their votes won't be imported on chain and they will miss on disputes rewards.

Wouldn't this make it risk-free for attackers to try to approve invalid blocks ?

This is already possible in our current implementation because we do not punish in any way validators that try to approve invalid blocks and we allow them to change their votes during disputes.However, that is already included in our threat model, see paritytech/polkadot-sdk#635 on why it is alright(we do slash the backers).

What happens if network is segmented and peers can not reach each other directly ?

Finality will be lagging and when the configured threshold will be passed we will enable the aggression which falls back on gossiping approvals.

Does this increases the risks of malicious nodes attacking honest nodes without the network observing ?

No, we are still gossiping assignments, so they will spread-out much faster and randomly than the approvals, so if honest nodes get DoS-ed before sending all of their approvals, they will be marked as no-show and a new tranche would kick-in

Can approvals arrive before their corresponding assignments

Yes, and to deal with this case we will have to store the approval and process it later on when an assignment is received. The buffer for storing this approval would have to be reasonably bounded, but given that we already store all the assignments and approvals we received in approval-distribution this shouldn't affect our memory profile too much. All validators get just one slot per candidate per un-finalized block.

Alternative solutions paths

There are two other paths that I explored for optimization:

Coalescing approvals paritytech/polkadot-sdk#701

This is an alternative optimization that we can go instead of this MR, which should give us major gains, but I would estimate would be lower than the ones in this MR and it will also increase the size of the votes we have to import on chain for disputes/slashing, see: paritytech/polkadot-sdk#701

Parallelization of processing approval-distribution/approval-voting

The measurements I did here: #7393 (comment), suggests that this should give us some gains, but they won't be enough for our goal of 1k validators, so we would still need either this MR or the coalescing of approvals describe above. However, the good part is that the parallelization could be stacked on top of optimisations we do for reducing the amount of work. So, once we reached consensus which of the two options(lazy checking/approval coalescing) we want to implement, I think we should see some benefit from parallelising the work done in approval-distribution and approval-voting

TODOs

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh alexggh changed the title [DNM] approval-voting: approvals gossip and lazy signature checks optimizations [DNM] approval-voting: approvals signature checks optimizations Jul 10, 2023
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
... to ease up quick testing in versi

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of concerns regarding the proposed change in the issue #7458.

The first one is minor. The current state machine in approval-distribution was carefully designed by @rphmeier with spam/memory growth (DoS) protection in mind with no malicious/wasteful sequence of messages leading to a non-negative reputation change/unbounded growth. I'm not sure this is the case now, especially when accepting approvals before assignments.

Now the second one. If we're not checking signatures of approvals when they are sent directly, does that mean an attacker gets free tries avoiding slashing of approval-voters? I guess that's fine as long as the backers are getting slashed. But that also means we're loosing incentives for approval-voting as outlined in #7229.

So I'd like to hear more from @burdges and @eskimor how these features would interact.

Comment on lines +534 to +537
gum::trace!(
target: LOG_TARGET,
"Approval checking signature was invalid, so just ignore it"
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should never happen in practice though, unless the approval-voter was malicious, no?
if so, I would suggest to elevate this to a warn

@@ -48,6 +48,20 @@ use std::{
time::Duration,
};

// TODO: Disable will be removed in the final version and will be replaced with a runtime configuration
// const ACTIVATION_BLOCK_NUMBER: u32 = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at

let disputes = match has_required_runtime(
, this function could be extracted to a common crate, e.g. node-util

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will do, that part is unfinished for now.

@@ -210,34 +224,30 @@ struct Knowledge {
// When there is no entry, this means the message is unknown
// When there is an entry with `MessageKind::Assignment`, the assignment is known.
// When there is an entry with `MessageKind::Approval`, the assignment and approval are known.
known_messages: HashMap<MessageSubject, MessageKind>,
known_messages: HashMap<MessageSubject, Vec<MessageKind>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this vec of at most 2 entries? could you elaborate why do we need a vec here? is it related to accepting approvals before assignments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the current implementation it is not needed, because it works with the correct assumption that approvals won't arrive before the assignments, so the approval will always override the assignment and then functions like contains assume that if we have an approval stored here, then we definitely have seen an assignment for this.

However, with the changes in this PR, that is not the case anymore, since we are sending directly the approvals to all nodes, so an approval might arrive before an assignment, so we can't assume anymore that if we have received an approval then a subsequent assignment is duplicate. Hence, we have to store both of them.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@sandreim
Copy link
Contributor

I have a couple of concerns regarding the proposed change in the issue paritytech/polkadot-sdk#604.

The first one is minor. The current state machine in approval-distribution was carefully designed by @rphmeier with spam/memory growth (DoS) protection in mind with no malicious/wasteful sequence of messages leading to a non-negative reputation change/unbounded growth. I'm not sure this is the case now, especially when accepting approvals before assignments.

My thinking here is that we should have a limit to how many of these orphan (potentially malicious) approvals we accept and keep in memory which could be up to the first level of distribution aggression threshold (which is 12 blocks IIRC) - 1/3*n_validators*n_cores*12.. This could still be a number that is to big and maybe we could limit even futher considering needed_approvals.

Now the second one. If we're not checking signatures of approvals when they are sent directly, does that mean an attacker gets free tries avoiding slashing of approval-voters? I guess that's fine as long as the backers are getting slashed. But that also means we're loosing incentives for approval-voting as outlined in paritytech/polkadot-sdk#635.

That's a good point, later, if the sig is found to be invalid we would just have to drop the vote as authenticity of the message cannot be proved on chain.

@sandreim
Copy link
Contributor

sandreim commented Jul 18, 2023

Could we still batch check all of these approvals before considering the candidate approved? and discard invalid signatures at this point. Doing all of them in a loop could be much faster and we could do it in a separate task to not block the hot loop of processing the messages.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor Author

alexggh commented Jul 18, 2023

I have a couple of concerns regarding the proposed change in the issue paritytech/polkadot-sdk#604.

The first one is minor. The current state machine in approval-distribution was carefully designed by @rphmeier with > spam/memory growth (DoS) protection in mind with no malicious/wasteful sequence of messages leading to a non-negative reputation change/unbounded growth. I'm not sure this is the case now, especially when accepting approvals before assignments.

My thinking here is that we should have a limit to how many of these orphan (potentially malicious) approvals we accept and keep in memory which could be up to the first level of distribution aggression threshold (which is 12 blocks IIRC) - 1/3n_validatorsn_cores*12.. This could still be a number that is to big and maybe we could limit even futher considering needed_approvals.

Looking in versi right now with 250 validators it seems there are like 15-20 approvals per second that arrive early, so I tend to think we can find a reasonable number to limit the growth of this queue.

Now the second one. If we're not checking signatures of approvals when they are sent directly, does that mean an attacker gets free tries avoiding slashing of approval-voters? I guess that's fine as long as the backers are getting slashed. But that also means we're loosing incentives for approval-voting as outlined in paritytech/polkadot-sdk#635.

That's a good point, later, if the sig is found to be invalid we would just have to drop the vote as authenticity of the message cannot be proved on chain.

Yeah, that was one of my concern as well, so I think there are two worlds here:

  1. The current world where we don't punish approvals for approving invalid candidates and in that case, checking the signature doesn't bring us any value, since we don't import it anywhere, so it is still free for approval checkers to try to approve invalid candidates.

  2. The world where we implement Slash approval voters on approving invalid blocks - dynamically polkadot-sdk#635 and try to punish approvers of invalid block, and in that case we have to either incentives approvals to correctly sign their votes or we could use the solution proposed by @sandreim :

Could we still batch check all of these approvals before considering the candidate approved? and discard invalid signatures at this point. Doing all of them in a loop could be much faster and we could do it in a separate task to not block the hot loop of processing the messages.

@burdges
Copy link
Contributor

burdges commented Jul 18, 2023

  1. Disable gossiping of approval-votes on the fast path.
  2. Make each validator send their approval votes to all of its peers.

Are we sure gossip can be disabled here? I'm unsure if finality could safely block on a non-gossiped messages, aka messages which malicious actors could send to only some peers.

It's likely fine since nodes woould simply see different no-shows, which likely causes only liveness problems, but this depends upon doing some clearer analysis. cc @AlistairStewart

  1. Do not check the signature of an approval vote if the peer_id that sent us the message is the originator of the message.

We expect skipping checks saves more than merging multiple approval votes, which itself saves more signature checks than core groups, although core groups produce savings elsewhere too, but do we have any idea how big are these gaps?

If say, core groups saves 3x, merging saves another 2x beyond core groups, and skipping saves another 2x beyond merging, then actually skipping saves relatively little beyond merging, when compared with our current usage.

  1. In case finality is lagging use enable_aggression thresholds to fallback on gossiping the approvals(The slow_path).

Again fast vs slow paths risk an adversary pushing us onto the slow path (and this one sounds particularly irksome).

@alexggh
Copy link
Contributor Author

alexggh commented Jul 19, 2023

We expect skipping checks saves more than merging multiple approval votes, which itself saves more signature checks than core groups, although core groups produce savings elsewhere too, but do we have any idea how big are these gaps?

If say, core groups saves 3x, merging saves another 2x beyond core groups, and skipping saves another 2x beyond merging, then actually skipping saves relatively little beyond merging, when compared with our current usage.

Skipping signature check + no-gossip is proposed as an alternative to merging, the possible ball-park gains are:

  1. From the measurements here (approval-distribution: process assignments and votes in parallel polkadot-sdk#732) what we found is that currently approval-voting spends 35% of its time doing just signature checks. So, that will be time we gain.
  2. Network load will be drastically reduced because approval-distribution will send and receive a lot less messages, I don't have any cpu improvements numbers here, but just looking at the rate of messages approval-distribution has to process we can see a 40% reduction in the total messages.
Screenshot 2023-07-19 at 09 35 53

And that seems to also reflect in some cpu reduction for the network-worker(disclaimer this are preliminary tests).

Screenshot 2023-07-18 at 14 59 25

The merits of this approach(if it can safely be implemented) against coalescing of approvals :

  1. I think the gains would be higher since it would reduce the load on both approval-voting and network.
  2. With coalescing of approvals importing them on chain would require more block and space and would make the implementation a bit more convoluted, see here approval-voting: coalesce multiple approvals in a single signed message polkadot-sdk#701, albeit disabling gossiping is convoluted in itself, so maybe this is not an argument.

Are we sure gossip can be disabled here? I'm unsure if finality could safely block on a non-gossiped messages, aka messages which malicious actors could send to only some peers.

It's likely fine since nodes woould simply see different no-shows, which likely causes only liveness problems, but this depends upon doing some clearer analysis. cc @AlistairStewart

Could you develop a bit more, what you would want to see here, is it something I can address by myself?

@burdges
Copy link
Contributor

burdges commented Jul 19, 2023

Could you develop a bit more, what you would want to see here, is it something I can address by myself?

It's not performance related. It's a conversation between @chenda-w3f and @AlistairStewart and I probably.

@AlistairStewart
Copy link

Again fast vs slow paths risk an adversary pushing us onto the slow path (and this one sounds particularly irksome).

Yes, this should be easy. So easy, it might happen accidentally.

Are we sure gossip can be disabled here? I'm unsure if finality could safely block on a non-gossiped messages, aka messages which malicious actors could send to only some peers.

Basically if approvals from some people only get trhough to 99% of recipients, then without gossip, we'll end up in the sutuation where for each candidate, when most validators see it as having enough approvals, 1% of validators will not. They will see a no show or two, but it's unliekly that any of the 1% will assign themselves to replace them. Then we are left waitinf for 100 or so tranches until some of the 1% assign themselves.

If this happens for different 1%s of validators for 40 different candidates in 1 block, then this will be over the 34% of validators required to halt finality. So we'd fallback if that kicks in before 100 tranches.

@burdges
Copy link
Contributor

burdges commented Jul 19, 2023

In fact, the eclipsed 1% would eventually assign themselves. I'm unsure if we treat our own approval vote as overriding what others do, but all those empty tranches count as overcoming the no-shows they observe. If I recall, an empty tranche cannot override being below needed_approvals though. We could change this but doing so risk eclipse attacks.

@alexggh
Copy link
Contributor Author

alexggh commented Jul 20, 2023

Basically if approvals from some people only get trhough to 99% of recipients, then without gossip, we'll end up in the sutuation where for each candidate, when most validators see it as having enough approvals, 1% of validators will not. They will see a no show or two, but it's unliekly that any of the 1% will assign themselves to replace them. Then we are left waitinf for 100 or so tranches until some of the 1% assign themselves.

If this happens for different 1%s of validators for 40 different candidates in 1 block, then this will be over the 34% of validators required to halt finality. So we'd fallback if that kicks in before 100 tranches.

Took me a while to understand why we need the 1% percent to assign themselves. So, because the 99% percent think the candidate is approved they won't trigger their tranche anymore, hence we would have to wait till the 1% self assign themselves.

In that case does it make sense for us to build some resilience in there, by just making the nodes that are missing the approvals to ask for it from other topology peers or random-nodes.

In that case, I think we not would need the slow path at all, we could just let nodes that are potentially missing approvals to ask for it on-demand from peers, I could envision something like this:

  1. At NO_SHOW/2 ask for the approval from a few random nodes
  2. At 3 * NO_SHOW / 4 ask for the approval from more nodes.
    ...

N. Ask for the approval from all topology peers.

The benefit here is that we don't increase the number of messages in the un-necessarily on the fast path and on the slow path we just give the opportunity to honest nodes to obtain the needed approval, without increasing unnecessarily the number of messages we process.

What do you think ?

@burdges
Copy link
Contributor

burdges commented Jul 20, 2023

We should discuss if our own vote overrides what we see from others: Alice approved X in R, nobody else approved X but every other Y in R is approved by others. Can Alice vote for R in grandpa? I suspect currently no, but maybe the answer should be yes. This is really a corner case, but also wholly orthogonal.

I think fast vs slow path optimization create compounding risks: An adversary could typically push nodes onto the slow path, so this becomes fast vs secure path. We ourselves never optimize the slow path enough. We'll never review or audit the slow path as carefully, so we'll miss more serious vulnerabilities there. Imagine the odds of a security critical bug increases like paths^2 or something. We risk fast path optimizations that're really pessimisations for the slow secure path.

I've therefore always held not having fast v slow paths as an overarching design principle. I always expected we'd add some later of course, but I wanted to delay that until we felt the system was basically sound. I do not mind the availability fast v slow paths because we kinda understand everything which could happen there and other optimizations really should happen anyways. I'm more worried about lazy signature checks though..

As concrete objections:

  • We likely cannot reconcile lazy signature checks with slashing approval checker ala Slash approval voters on approving invalid blocks - dynamically polkadot-sdk#635, which adds considerable security. We'd consider giving this up anyways, but maybe only if we could analyze a wholly non-gossiped approval scheme.
  • We'd architectural trouble merging signatures, which signals an architectural flaw. I'd expect laziness complicates this further, which makes merging signature even harder architecturally, while also hiding the necessity that the secure path be speedy. aka laziness a pessimisation for the secure path.

Anyways I think this sits in a problematic place that's kinda the worst of two world, gossiped vs ungossiped approvals. I'm fine to think about each world separately, but right now I do not feel like we can realistically mix these two worlds.

... add option to malus nodes that witholds messages sent by
approval-distribution to simulate approvals couldn't be sent to some of
the peers and test finality is not lagging

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh alexggh force-pushed the feature/approval_voting_lazy_signature_checking branch from 2020bf8 to 8abed2a Compare July 20, 2023 12:59
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable
Logs: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/3229285

@alexggh
Copy link
Contributor Author

alexggh commented Jul 20, 2023

We should discuss if our own vote overrides what we see from others: Alice approved X in R, nobody else approved X but every other Y in R is approved by others. Can Alice vote for R in grandpa? I suspect currently no, but maybe the answer should be yes. This is really a corner case, but also wholly orthogonal.

@burdges
This one was easy, the code is here https://github.com/paritytech/polkadot/blob/master/node/core/approval-voting/src/lib.rs#L2094, by looking at it, it seems when counting we treat our vote like any other vote. So, if we are dealing with a no-show, other tranches would still have to kick in.

... still processing the rest of the message :)

@alexggh
Copy link
Contributor Author

alexggh commented Jul 20, 2023

I think fast vs slow path optimization create compounding risks: An adversary could typically push nodes onto the slow path, so this becomes fast vs secure path. We ourselves never optimize the slow path enough. We'll never review or audit the slow path as carefully, so we'll miss more serious vulnerabilities there. Imagine the odds of a security critical bug increases like paths^2 or something. We risk fast path optimizations that're really pessimisations for the slow secure path.

I've therefore always held not having fast v slow paths as an overarching design principle. I always expected we'd add some later of course, but I wanted to delay that until we felt the system was basically sound. I do not mind the availability fast v slow paths because we kinda understand everything which could happen there and other optimizations really should happen anyways. I'm more worried about lazy signature checks though..

I do agree having two paths increases the security risk.
One thing to note here is that even with gossip we already have two paths here, the normal path and aggression path for the unlikely situation when approval-checking lags and we start gossiping even more.

We likely cannot reconcile lazy signature checks with slashing approval checker ala paritytech/polkadot-sdk#635, which adds considerable security. We'd consider giving this up anyways, but maybe only if we could analyze a wholly non-gossiped approval scheme.

If we are lazily checking the signature before counting an approval and discard if it is not signed, it would force everyone to sign, so we can reconcile it with paritytech/polkadot-sdk#635, am I missing something ?

We'd architectural trouble merging signatures, which signals an architectural flaw. I'd expect laziness complicates this further, which makes merging signature even harder architecturally, while also hiding the necessity that the secure path be speedy. aka laziness a pessimisation for the secure path.

Agree it would be more complicate, if I understand it correctly your suggestion is to go down this path paritytech/polkadot-sdk#701, rather than on the paths from this PR.

Anyways I think this sits in a problematic place that's kinda the worst of two world, gossiped vs ungossiped approvals. I'm fine to think about each world separately, but right now I do not feel like we can realistically mix these two worlds.

What, about the above idea where we wouldn't gossip the approval vote at all(#7482 (comment)), and just allow nodes to fetch it from peers if they didn't get it directly, that would be clean enough and it has the advantage to better utilize the network and cpu, rather than gossip approval votes blindly.

@alexggh
Copy link
Contributor Author

alexggh commented Jul 25, 2023

Putting this on-hold in favor of paritytech/polkadot-sdk#701

@eskimor
Copy link
Member

eskimor commented Jul 28, 2023

If this happens for different 1%s of validators for 40 different candidates in 1 block, then this will be over the 34% of validators required to halt finality. So we'd fallback if that kicks in before 100 tranches.

What kind of scenario is that? How would that even be possible? Connections that work for one candidate, but not for the other and this on a significant number of nodes?

Also in principle we do have a reliable transport mechanism: TCP: This means we know to which peers we have a working open connection with and which not and we should also be able to know what messages got through. We can either have application level confirmations or just track TCP status - as long as the connection was never closed, we can assume messages got through. If we detect a re-connect, we can resend our approvals.

If we want to be more robust against direct connections not being possible, we could also think of techniques like tunneling via some random node. This would require some work as well, but the nice thing is, it could be part of the networking stack: We just tell it, we want a connection, then it tries direct connection, if that fails tries tunneling via a random peer, .. Not a short-term solution, but longer term relying on direct connections should give us way better performance than sending the same message multiple times always.

@burdges
Copy link
Contributor

burdges commented Jul 29, 2023

Assignment VRFs must not be sent like this, so there is a politeness failure which you've never acknowledged, aka we must retain all votes related to all imported blocks, without VRFs limiting their accumulation. If we've many relay chain forks, then adversaries voting for all candidates adds up fast.

Connections that work for one candidate, but not for the other and this on a significant number of nodes?

If we do not gossip votes then adversaries could choose which nodes see their signatures, so only a small portion of nodes see them no-show. As nobody replaces them quickly, I'd expect those nodes cover the adversarial no-shows by empty tranches, assuming rob copied that "odd" feature of my prototype.

It's possible rob changed this empty no-show tranche rule. It's also possible someone argues successfully against the empty no-show tranche rule, like maybe @AlistairStewart or @chenda-w3f. If so, then we'd need the self override rule I mentioned above, but this adds another sort of fragility, and anyways risks the long delays which worried @AlistairStewart there. This proposal is not orthogonal to how no-shows work, hence fragile.

In this, I've only discussed the simplest most direct attack, but we've no idea what's really possible with say BGP attacks, etc. or what happens by accident.

Also in principle we do have a reliable transport mechanism: TCP

I'd typically assume TCP makes everything worse, in that an ideal adversary can halt the connection exactly when they desire, and anyways adversaries know TCP far better than we do. In principle, an "unreliable" UDP transport could've reliability designed around our specific problems, but dong so requires mountains of work.

If we detect a re-connect, we can resend our approvals.

I'd expect this incurs considerably latency. I doubt TCP exposes what succeeded or failed, so we'd need our own ack-of-ack layer on top.

If we want to be more robust against direct connections not being possible, we could also think of techniques like tunneling via some random node.

Again more special case complexity which we'll never know if we've properly implemented or debugged.. worse routing is rarely a simple task.

In this proposal, we'd need a new politeness design which become problematic, special TCP handling which breaks other transports, and some special extra routing layer. We need some really heroic testing effort since none of these trigger curing normal execution.

Not a short-term solution, but longer term relying on direct connections should give us way better performance than sending the same message multiple times always.

I do not rule this out entirely, and we'll happily discuss it more with @chenda-w3f etc, but afaik it requires re-inventing too much of how distributed systems work both in practice & theory. This proposal is hard!

It demands trashing paritytech/polkadot-sdk#635 near-term for these reasons.

We need topology fixes anyways so assuming those work, then I'd wildly guesstimate:

  1. Approval votes by hash inversion vs faster assignment VRFs polkadot-sdk#593 - Scary but not excessively hard, only one fall back mode for relay chain equivocations, performance equivalent to this, except under relay chain equivocations.
  2. Approve multiple candidates with a single signature #7554 - Sound straightforward in some sense, no fall back logic, performance is maybe 70% of this.
  3. Rabin-Williams signatures - Major key store changes, some of which we'd love anyways, no fall back logic, performance is maybe 80% of this, unless bandwidth costs more than expected. It demands some outside research contractors to explain the risks. Also 100% Rust costs lots in contractors & auditors.

We cannot have 1+3, but 1+2 and 2+3 work, but doing two maybe already excessive.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants