-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Asynchronous Backing Spec & Tracking Issue #3779
Comments
@rphmeier I have a question. If I'm correct, when a node produces a block, it also needs to import it after (which is done before propagating it). If we increase execution time to 4s (as it was discussed) in the case of 1 block delay contextual execution, there will be 8s (production + import) before the block is propagated, isn't it too long and bringing the risk of missing the relay block timings. |
This is a great idea. I don't fully understand the implementation details regarding parachain block-authoring, but currently all parachains are running at 12 second block times, and ideally we want to improve that to 6 seconds. |
|
@crystalin The idea is that the block is propagated immediately after authoring, which also means we can use more time for transferring larger PoVs. With 1 block delay we have 12s from the time we import the relay-parent R to the relay-block R2 where the parachain block is backed. We'd split this time something like:
Steps 2 and 3 can be sort of combined and will really be able to be combined once we have paritytech/polkadot-sdk#968 or something like it. Also note that after step 2 or the import of R1 (whichever comes 2nd), a collator can already start to build a block on top of R1 which extends the same chain. |
Thank you @rphmeier , that sounds good. |
@crystalin This depends on the type of parachain. If there's a plausibly unique block author, the ideal honest node would broadcast to validators at the same time as gossiping to other parachain nodes. If not, you need some credential of legitimacy or anti-spam - that'd be a |
You are right, I forgot the PoV was not included in that case. |
The underlying "contextual execution" is correct, but the explanations on the left are wrong. A parachain candidate needs to be available to be included. The approval phase is coming later and doesn't influence inclusion. |
I added a bunch more info to this issue, and will file some more concrete issues to build a roadmap to completion. |
@bkchr one interesting byproduct of this model is that we don't necessarily require the relay-parent to increase from candidate to candidate. We could accept 2 candidates in a row with the same relay-parent. But on the Cumulus side this would probably break things, because we use relay-parent number as a 'slot' for BABE or SASSAFRAS or Aura. If the relay parent doesn't need to increase, then existing parachains could be hijacked for a few blocks in a row by the same author. However I'd expect the relay chain to have a minimum relay-parent, and that already forces it to increase because the window of acceptable relay parents will move upwards over time. I see 3 options:
Not urgent, but worth thinking about |
I'm not sure how impactful those options are, but one thing to consider, is that parachains (at least Moonbeam) can rely on the relay block for its author selection consensus (as part of the random input, to ensure the author changes for each relay block even if no parachain block has been produced) |
Yes, that is what I'm saying. If the relay parent doesn't advance, the same author may be selected multiple times in a row on a parachain, so we may need parachains to adapt to that. |
DMP currently requires that if the queue is non-empty each parachain block must process at least one message. This is an issue with the current model of asynchronous backing because we want to base acceptance by the relay-chain off of the relay-chain state at time of backing, not the state at the context. Are there any ways we can change this condition so it works better with asynchronous backing? We could track the first unprocessed DMP message height for each queue, but that'd take a runtime migration. |
Can you fill me in a bit on where this requirement comes from right now? I checked code and guide, the restriction is mentioned a couple of times, but not motivated. I guess we just want to ensure that parachains process their messages eventually. If it is not actually an issue with relay chain consistency, but merely an issue with validity of the parachain execution, we could probably just check this as part of the PVF execution (which happens in the right context/has access to the context), instead of here. |
We want parachain messages to be reliable because adding financial reliability one layer up becomes way more complex than adding mere network reliability, like think state channels, locked balanced, etc. We'll eventually develop SPREEs which remain insulated from the parachain, meaning the parachain cannot alter the SPREE code or state. Yes, it's likely SPREEs could simplify implementing reliable messaging, but SPREEs seem far more complex overall. We could discuss the relative complexity before doing anything either way however. We already have delays in parachains enacting code, but yes we'll want to prevent parachains from changing their code to alter behavior on a specific message, which also becomes complex. |
@burdges here we talk about DMP though not the lateral messaging passing which you seem to refer. @eskimor DMP, where messages go down from the relay-chain to the parachain, was designed with the unbounded buffer. That makes the relay-chain be able to send message anytime. That makes pallets that send messages free from introducing fallbacks on what to do if the queue is full. To give you the example of thinking back then: A parachain opens up an HRMP channel with another parachain. When that happens the recipient must receive a DMP notification about this. What happens if the queue of the recipient is clogged? Does the message is just silently dropped or the whole HRMP open request is cancelled before executing because the last step, sending DMP, is not possible? However, I must note that since then our vision changed quite a bit and I am not entirely sure if this construction meets today's requirements. |
Our consensus was to change the DMP rule to require advancement if and only if the earliest message in the queue was present at the relay-parent. |
@rphmeier What's the current direction concerning the requirement for a parachain block to have a unique relay block (and so to not allow parachain blocks to "recover" skipped blocks) ? This is becoming important to us as not only our authoring consensus is based on that assumption, but we are also designing a more robust randomness source which would include relay block information as input. (note that I'm not against it, but it would require significant change in our design. Specially considering an attacker could even pretend to not have the latest relay block and build on top of a previous one to influence the seed for our random source) (cc @4meta5 ) |
@crystalin At the moment, we're enforcing that the relay-parent must increase from parachain block to parachain block. I believe that we could relax this in the future and quite possibly gain from it, but it would have to be preceded by changes to parachains' own consensus rules or it'd be vulnerable to attack. As you mention. |
@crystalin if you require that you are building always on a new relay chain block, the you need to add a check in your runtime. You can register a listener for |
#4786 introduced a buffer of relay parents available for building candidates on top of polkadot/runtime/parachains/src/shared.rs Lines 46 to 61 in 0f3ef99
This buffer gets reset every session change. Note that we also define "is async backing enabled?" by the ParachainHost version here (a)Now what we want during an upgrade is that this ancestry tracker never contains relay chain block for which async backing is disabled and never crosses session boundaries. Nodes learn about allowed relay parent from Given the above, the proposed upgrade sequence is as follows:
Nodes should of course be backwards-compatible and upgrade in advance. |
Tracking
Board: https://github.com/orgs/paritytech/projects/37
This is a long-standing idea which has been pending a writeup for some time.
I previously referred to this as "Contextual Execution", but I'd prefer to rename it to "Asynchronous Backing", which is the goal, rather than "contextual execution" which doesn't describe the purpose very well and is fact an implementation detail.
Terminology
base_parent(relay_parent) -> Parachain Block Hash
P
andbase_parent(R)
. This could be described as a functioncontext(R, [intermediate_blocks]) -> Constraints
.Motivation
Currently, the relay-chain runtime accepts only parachain blocks anchored to the most recent relay-chain block as a
relay_parent
. This means that the runtime only accepts parachain blocks which havecontext(most_recent_rp, [])
. This implicitly creates a requirement for a parachain block to be authored, submitted to relay-chain validators, executed, and backed all within a 6-second window. Practically, this leaves only around 0.5 seconds for the collator to author a block, which is not enough time to include a meaningful amount of transactions. Backing, at the moment, is heavily synchronized with the progress of the relay chain, and we'd like to make it more asynchronous, and therefore, less dependent.We'd like to expand the set of permitted
relay_parent
s accepted by the relay-chain to include the most recent K blocks of the relay chain so collators can begin building their blocks earlier with the intent to include them into the relay chain later on. Even doubling the 6-second window to a 12-second window might quadruple the time given to execution and therefore quadruple throughput on parachains.Some diagrams to show the difference follow. "R*" denotes a relay chain block, "P*(b)" denotes a parachain block that is backed in a particular relay-chain block. "P*(i)" denotes a parachain block that is included in a particular relay-chain block. We only need to consider 1 parachain for the purposes of these diagrams, because the same algorithm can be run in parallel for all parachains at once.
In the 'without contextual execution' example, we have only a 1-block window to begin building a parachain block to getting it backed in a relay-chain block. Also note that this example assumes that we begin working on a backed parent before it is included, something which we currently don't do. Still, we only have 6 seconds (1 relay-chain blocktime) to author, distribute, and back the parachain block.
In the contextual execution example, we start building parachain blocks earlier. P1 is in
context(R, [])
. P2 is incontext(R1, [P1])
. P3 is incontext(R2, [P2])
. The 'x's mark the points where the respective parachain block is actually built & signed, which is the earliest point that the parachain block can be built upon. Typically, we expect that the next relay-chain block and the parablock being accepted by the first validators will happen at around the same time. In the contextual execution example, we always have around 12 seconds to author, distribute, and back the parachain block. Doubling this period actually more than doubles the amount of execution and data that can be done in a single parachain block, because there are constant overheads associated with the parachain backing process.Contexts
The context of a parachain candidate dictates the constraints on the inputs and outputs of the candidate. The context dictates:
Note that a context is created from a relay chain state and a series of amendments The context of a relay-chain block is the context, drawn from the post-state of that relay-chain block. Each amendment updates the context by altering message queues, updating the parent head, and so on.
Contexts are independent from availability. They only dictate constraints on the inputs and outputs of a PVF execution, not whether it is prudent to vouch for a particular parachain execution or even whether that PVF execution is valid.
Expressed in pseudocode
Another way to think about contexts is as a stack. We have a base context, plus a stack of prospective changes to that context. From the perspective of the runtime, in a state S we only need to accept candidates that are valid under the context drawn from state S. So the runtime only deals with 'flat' contexts. The job of the node-side code is to anticipate the future state of the relay chain so it can build stacks of prospective contexts that it believes may equal the relay chain's own state at some point in the near future.
We say that a context underestimates another, that is
A < B
when every parablock b valid under A would also be valid under B. Stated differently, we denote the 'validity set' of a context C as the set of parablocks valid under a context.A < B
iffvalidity_set(A)
is a subset ofvalidity_set(B)
Underestimates are useful because they let us maximize the number of possible future relay-chain blocks a candidate can be included in.
Considerations
1. Prospective Backing
Asynchronous backing means that we seek to de-couple the extension of parachains from the extension of the relay-chain. This can mean a lot of different things, but one property that is beneficial to the network that we'd like to pursue is "whenever a relay-chain block is authored, there should be a parachain block ready to be backed on-chain". This means that relay-chain validators need to build prospective parachains, likely in-memory.
While this is aimed at efficiency gains, it can also be the source of wasted effort. If there is a parachain A<-B<-C<-D where only A is referenced by the relay-chain, if it turns out that B is unavailable then the work backing C and D was wasted. Likewise if an alternate block B' ends up actually being on-chain, as C and D are no longer possible to be accepted.
When a backing group is majority-malicious, there is little we can do to prevent wasted effort. As is already done, we can avoid accepting more candidates than would reasonably be produced by such a backing group. The best way for a parachain to avoid validators accidentally wasting effort (and thereby spending less time on legitimate validation) is for the parachain to avoid forks with a plausibly-unique authoring mechanism like BABE, Aura, SASSAFRAS, or PoW.
2. Message Queues
Parachains have both incoming and outgoing message queues. Our solution should allow prospective parachains to both add and remove upwards, downwards, and horizontal messages without surpassing queue size or weight limits.
Upwards, downwards, and horizontal messages are intentionally designed to be hash-chain ranging from a head to a tail. The head is the last unprocessed message and the tail is the first unprocessed message. We'd denote such a list as
[head, tail]
.Each parachain is the only producer of messages for its outgoing message queues. Therefore, as long as the context under-estimates the amount of space remaining in the message queue, it will never write too many messages into the queue to be accepted by the relay-chain at a later point.
Likewise, each parachain is the only consumer of messages for its incoming message queues. Therefore, as long as the context under-estimates the amount of messages available to be processed, it will never read more messages than exist in the queue.
Here's an example. Imagine the base context of parachain P at relay-parent R2 has space for 5 messages in its upward message queue. At R2, the latest parachain block included is P1. Off-chain, there's P2 building on R0 with 4 upward mesasges and P3 building on R2 with 1 upward message. By the time P3 is backed in R4, there might be more message space is available. But that's fine, because our underestimate ensured that we didn't use more space than there was available.
The downside of this approach is that it introduces additional latency between parachains. If 2 parachains are communicating but the chains are built asynchronously with an N block delay between being backed and actually included on the relay-chain, that means that there's a 2N*6s latency for messages to pass between them.
To alleviate the downside of higher latency for message-passing, we could relax the 'underestimate' condition for nodes to choose which chains to build on, and instead allow them to assume that by the time a particular candidate is included, the message queues of 1 or more other parachains will have a particular prefix. This increases the risk of wasted effort: if all parachains expect that all others' prospective chains will, in fact, be included, one parachain block being different than expected would be able to set off a cascade of un-backable prospective chains. This bears further research.
3. Limitations from Approval Checking
Although modifications to allow prospective chains to be built may in theory allow us to extend execution time and PoV size almost indefinitely in a tradeoff for cross-chain message latency, the approval checking protocol has to run more or less synchronously with the growth of the relay chain. Approval-checking will become the ultimate guide for maximum PoV size and execution time limits.
4. Cross-session blocks
At the moment, parablocks never cross session boundaries. That is, if a parablock is backed in session S but is not included before the end of S, it is immediately dropped. This has deeper implications for prospective chains - we don't propose to change this restriction at the moment, but it does mean that there will be more wasted effort at session boundaries or potentially a period of 2 or 3 relay chain blocks at the beginning of each session where there is no parachain block.
5. Prospective Availability
The initial goal of contextual execution is to make backing networking & execution no longer a bottleneck for parachain progression. However, when there is a candidate pending availability, that also impedes progress of the parachain. We might look to extend availability distribution to have validators fetch chunks for candidates that are only backed off-chain (prospectively), so all that remains after inclusion is to sign bitfields, which is cheap. Availability currently suffers from the same issues as backing, with respect to the short relay-chain block time - there are 6 seconds between every relay-chain block, which gives only a few seconds to recover chunks before signing a bitfield. If the window of opportunity for enough backers to recover their chunk of a candidate is missed even by a tiny margin, the parachain is delayed by 6 whole seconds. Making availability asynchronous would be ideal.
The text was updated successfully, but these errors were encountered: