-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: Introduce additional parallelism within application of a single shard chunk #11319
Comments
In an effort to implement this, the very first thing that will need to happen is a refactor of the VM runner. There are a couple of issues to resolve:
|
I'm exploring movement of the import code to their respective implementations in order to make the interfaces to the import code a little more flexible (and possibly less dependent on VMLogic struct.) Part of #11319
I always found it weird that we had VM specific code in what's a generic part of the near-vm-runner. Well -- it has actually started bothering me in real ways, so it gets moved. I didn't do very much to make sure it ends up pretty. And I don't think I want to for the VMs that're not wasmtime or near_vm. Good thing is that the wasmer0/2 code can be largely ignored sans changes to `VMLogic` (which I'm considering addressing next...) Based on top of #11503 Part of #11319
This largely mirrors the code in near_vm_runner module. I heard some people pondering what it would be like to use a higher quality backend. Outside LLVM, Cranelift is by far the next in line in produced code quality. Since we already have wasmtime in place, might as well wire it up completely for a full experience. Based on top of #11529 Part of #11319
This is a part of an effort to drop the lifetime from the `VMLogic` structure in order to make it more straightforward to have it live for a longer duration alongside the instance and not force our hand in pretty much "create-execute contract-drop" flow. Part of #11319
…11615) `VMLogic` containing a lifetime makes it difficult to have it live for any longer than a short sequence of `instantiate-link-run` and is one of the reasons why we're forced to have some unsafe code in our linking code. This refactor replaces some of the reference fields with `Arc`s, `Box`es, etc. This is not a complete refactor, I intend to do the remainder as a follow-up. Based on #11614 Part of #11319
In a world where we have pipelined compilation, instantiation and execution, `VMLogic` will have to move between threads, which requires that it becomes `Send`. It in turn has required some other types to become not only `Send` but also `Sync` due to them currently being stored as a `&` reference (which allows for multiple copies, there are better places to explain why `Sync` becomes necessary here...) I'm not sure if all of these types will continue requiring `Sync`. In particular `TrieUpdate` that's stored in `RuntimeExt` is now by reference, but I eventually want to also make `VMLogic: 'static`, which would require finding some owning pointer structure that would work for `TrieUpdate`... Or I might be able to use scoped threads... in which case we're looking at `Sync` anyway... I think the changes here are largely straightforward enough, but overall things are shaping to be pretty involved, eh? Part of near#11319
The split boundary has been chosen to be what's necessary to compute a VMOutcome, which now in turn allows us to load a contract without constructing a VMLogic, or contract memory quite yet. This might very well resolve issues I've been working through by attempting to remove lifetimes and such from `VMLogic`...? As previous changes this makes quite some sense in isolation regardless of the ongoing projects. While I imagine there are more lines, they will mostly be due to the fact that in many places the previous code now needs to go through an additional field projection to get to types it needs to operate. @Ekleog-NEAR I think you'll appreciate these as I recall you've struggled with the VMLogic nonsense as well in the past. Part of #11319
This is a very exciting step forward! Finally we got up to the point where we can do some work in preparing the contract to run separately from actual running of the contract. And all of this is encapsulated in a very neat API that gives out `Send + 'static` types for users to pass around between threads or whatever so that they can pipeline these processes. It will remain to see whether the requirement to have `&External` and `&VMContext` in both calls is a problem, and how much of a problem it is, but that might be very well solvable with some scoped threads or smart use of channels, or even just `Arc<Mutex>`, especially since both of these structures generally tend to be unique to a contract execution... Part of #11319
cc #11808 |
…ers (near#11810) Best reviewed commit-by-commit. This ultimately lifts the contract preparation up through a several function call layers in the transaction runtime up to the layer where all the currently necessary data are available. This PR also establishes in broad strokes where the pipelining decisions will be made (`ApplyProcessingReceiptState`) and makes some minor changes to the type to have it contain local receipts (in addition to the previously contained delayed receipts etc) in a queue of sorts which would allow the pipelining code to look-ahead of the ongoing processing work and queue-up preparation of the upcoming contracts there. This work so far is intended to have no functional changes. Part of near#11319
I have a question: instead of thinking about dynamic vs. static, could we just use an approach where at the beginning of apply, we eagerly go over receipts and load contracts into a queue, then we when need to execute a function call receipt, we pop prepared contracts from the front of queue. We need a concurrent queue here, but otherwise it should be easier to implement. |
I would argue that the described solution is a form of a static approach where Traditional queues are somewhat poor of a data structure for this problem too. For instance local receipts are constructed within apply and processed right after they are constructed before delayed or incoming receipts. But then if those local receipts do not fit into the current chunk, they get pushed to the back of the delayed receipt queue. As thus the ordering in which the receipts get executed is quite dynamic in the grand scheme of things, and is somewhat ambiguous at the beginning of For these reasons I don't think ordered data structures or starting preparation work in |
The proof of concept (wherein we only pipeline contract preparation and block any processing for accounts which have had a contract deployment in the chunk) is showing some mildly positive results. We're successfully moving about 80% of the time spent on preparation to other worker threads. I have initially thought that there's a significant amount of overhead in thread pool management introduced by rayon, but it turns out to be mostly due to sleeping when the thread in a pool goes idle. We are still spending ~20% of the original time on the critical path doing those same preparations. This is -- to an extent -- intended currently, this is how we keep the latency minimal in case the pipeline implementation can't get to processing the action before it is time to execute it. We could improve on this For an MVP what's remaining is to also offload deployment action processing into the pipeline implementation here. This is mostly necessary in order to establish the necessary data dependency/ordering information between deployment and function call actions. As a bonus this should also move the arguably most expensive (and unpredictable, if rare) action away from the critical path, leaving more time for other tasks. Ultimately, this is required to bring down the worst-case execution time (so that we can reduce the gas fees.) Otherwise any receipt that begins with a deployment action immediately disables any improvements seen here. |
Pipelines deeper than 1 action with a function call per shard give no observable benefits. |
) Part of #11319 and the final change in integration with the transaction runtime as all interesting receipt types are handled now. There are also receipt types like yield timeouts which only result in generation of new (delayed) receipts, so they don't need to be handled by this mechanism.
I don't have a separate issue for parallelizing receipts themselves within a chunk, but one interesting though came up in my discussion with @akhi3030 that I wanted to write it down. There's a problem that we can parallelize receipts all we want, but if all of them target the same account, they would still need to be executed sequentially. So we wouldn't be able to reduce gas costs for that reason. Akhi mentioned that we could instead have chunk producers determine a number of "threads" of execution and then we could raise the gas limit to a multiple of determined threads. This way even if all receipts were targetting the same account, we would put all of the receipts on one thread and this account would not get any more than 1Pgas per chunk anyway. But we would have N-1 threads worth of spare capacity for other accounts and their actions. So while the throughput for e.g. |
When executing the transactions and receipts within a chunk in
apply
, the flow of execution is currently very sequential and is roughly as such:I've already talked about the fact that we don't necessarily need to have
commit
be as serial as it is currently and we can start applying the next chunk before the[commit]
for the previous chunk has fully concluded in #11118. If we look closer and notice that[vm_runner::run]
is composed of multiple currently sequential steps, namely[vm_runner::load_and_instantiate]
and[vm_runner::invoke_method]
, and that only theinvoke_method
calls need to be executed serially, we could pipeline/parallelize the processing of a single chunk much more to look more like this (each horizontal line is a separate "thread" of execution):Notice how the latency of prefetching – and much more importantly – loading and instantiating the contract gets hidden "behind" the time the runtime spends actually executing the contract code in the background.
This sort of pipelining can be dynamic (automatically adjusting how many contracts are preloaded in advance), or static (preloading N contracts in advance.) Either have benefits and downsides:
Ultimately I think if we work towards implementing something like this we should start with a static and configurable pipeline depth. There's a good chance that an optimal depth will turn out to be one or two calls deep, in which case throwing away that work isn't going to be particularly painful. Especially knowing that doing this work might place some of the necessary data in the caches so that the next time around these preparatory steps get executed faster and the pipeline is quicker to fill.
@bowenwang1996 had raised another idea that we could consider preparing all the contracts even before we head into the
apply
call (also in parallel.) This is also an option we could consider, although it remains to be seen how much more of an improvement it would be over a basic localized and static pipeline. There's also an unsolved question of how many of these "all" we would need, since the number of contracts that get executed in a single apply is dynamic and dependent on gas and compute cost limits.cc @Ekleog-NEAR this approach is another idea to deal with the cost of allocating memory maps for the contract data. Thoughts appreciated.
The text was updated successfully, but these errors were encountered: