Idea: Introduce additional parallelism within application of a single shard chunk #11319

nagisa · 2024-05-15T15:21:28Z

When executing the transactions and receipts within a chunk in apply, the flow of execution is currently very sequential and is roughly as such:

[prefetch A] → [vm_runner::run A] → [prefetch B] → [vm_runner::run B] ... → [commit]

I've already talked about the fact that we don't necessarily need to have commit be as serial as it is currently and we can start applying the next chunk before the [commit] for the previous chunk has fully concluded in #11118. If we look closer and notice that [vm_runner::run] is composed of multiple currently sequential steps, namely [vm_runner::load_and_instantiate] and [vm_runner::invoke_method], and that only the invoke_method calls need to be executed serially, we could pipeline/parallelize the processing of a single chunk much more to look more like this (each horizontal line is a separate "thread" of execution):

[prefetch A] → [vm_runner::instantiate A] → [vm_runner::invoke_method A] → [vm_runner::invoke_method B] → [vm_runner::invoke_method C] ... → [commit]
                                 [prefetch B] [vm_runner::instantiate B]-^ 
                                                                [prefetch C] [vm_runner::instantiate C]-^

Notice how the latency of prefetching – and much more importantly – loading and instantiating the contract gets hidden "behind" the time the runtime spends actually executing the contract code in the background.

This sort of pipelining can be dynamic (automatically adjusting how many contracts are preloaded in advance), or static (preloading N contracts in advance.) Either have benefits and downsides:

Static:
- Pro: simpler to implement
- Pro: simpler to reason about
- Con: with deeper pipelines we may "waste" some work when gas/compute limits prevent us from invoking methods for which we've already preloaded the contracts for;
- Con: therefore we need to carefully select an optimal pipeline depth...
Dynamic:
- Pro: we don't need to fiddle with pipeline depth parameters
- Pro: adaptive to the time contract methods consume executing, always maintaining roughly an optimal pipeline depth;
- Pro: less wasted work
- Con: more difficult to reason about and implement.

Ultimately I think if we work towards implementing something like this we should start with a static and configurable pipeline depth. There's a good chance that an optimal depth will turn out to be one or two calls deep, in which case throwing away that work isn't going to be particularly painful. Especially knowing that doing this work might place some of the necessary data in the caches so that the next time around these preparatory steps get executed faster and the pipeline is quicker to fill.

@bowenwang1996 had raised another idea that we could consider preparing all the contracts even before we head into the apply call (also in parallel.) This is also an option we could consider, although it remains to be seen how much more of an improvement it would be over a basic localized and static pipeline. There's also an unsolved question of how many of these "all" we would need, since the number of contracts that get executed in a single apply is dynamic and dependent on gas and compute cost limits.

cc @Ekleog-NEAR this approach is another idea to deal with the cost of allocating memory maps for the contract data. Thoughts appreciated.

The text was updated successfully, but these errors were encountered:

nagisa · 2024-05-20T10:46:36Z

In an effort to implement this, the very first thing that will need to happen is a refactor of the VM runner.

There are a couple of issues to resolve:

Split up the near_vm_runner::run method into an instantiate and run_method;
- not quite as written, but vm: split preparation and running of a contract #11667 implements this in a way that's agnostic over where exactly the split occurs.
Instantiation must be split up so that the start method can be run at a later time. This may be particularly troublesome for VMs that we don't control implementation of. For instance wasmtime provides "pre-instantiate" sort of method, but that does less work than we'd ideally like.
In near_vm_runner we have the split here already, but the instantiation is mixed up together with the loading of the executable, which is probably something we should fix as well.
- All the caches make things more difficult than they ideally should be...
- We may need to keep the cache of loaded contracts for use-cases like ReadRPC which will not be able to predict upcoming requests, but we could possibly drop that cache in nearcore.
Figure out what type represents an instance: do we store it within a VM (and make the VM only able to run a single method?) or do we do some type erasure shenanigans to obtain a type that makes sense to return from an object-safe trait?
Implementing this requires significant-enough changes to the code structure to require a protocol version change. How do we maintain backward compatibility when the old runtimes cannot allow for the same sort of logic flow...?

I'm exploring movement of the import code to their respective implementations in order to make the interfaces to the import code a little more flexible (and possibly less dependent on VMLogic struct.) Part of #11319

I always found it weird that we had VM specific code in what's a generic part of the near-vm-runner. Well -- it has actually started bothering me in real ways, so it gets moved. I didn't do very much to make sure it ends up pretty. And I don't think I want to for the VMs that're not wasmtime or near_vm. Good thing is that the wasmer0/2 code can be largely ignored sans changes to `VMLogic` (which I'm considering addressing next...) Based on top of #11503 Part of #11319

This largely mirrors the code in near_vm_runner module. I heard some people pondering what it would be like to use a higher quality backend. Outside LLVM, Cranelift is by far the next in line in produced code quality. Since we already have wasmtime in place, might as well wire it up completely for a full experience. Based on top of #11529 Part of #11319

This is a part of an effort to drop the lifetime from the `VMLogic` structure in order to make it more straightforward to have it live for a longer duration alongside the instance and not force our hand in pretty much "create-execute contract-drop" flow. Part of #11319

…11615) `VMLogic` containing a lifetime makes it difficult to have it live for any longer than a short sequence of `instantiate-link-run` and is one of the reasons why we're forced to have some unsafe code in our linking code. This refactor replaces some of the reference fields with `Arc`s, `Box`es, etc. This is not a complete refactor, I intend to do the remainder as a follow-up. Based on #11614 Part of #11319

In a world where we have pipelined compilation, instantiation and execution, `VMLogic` will have to move between threads, which requires that it becomes `Send`. It in turn has required some other types to become not only `Send` but also `Sync` due to them currently being stored as a `&` reference (which allows for multiple copies, there are better places to explain why `Sync` becomes necessary here...) I'm not sure if all of these types will continue requiring `Sync`. In particular `TrieUpdate` that's stored in `RuntimeExt` is now by reference, but I eventually want to also make `VMLogic: 'static`, which would require finding some owning pointer structure that would work for `TrieUpdate`... Or I might be able to use scoped threads... in which case we're looking at `Sync` anyway... I think the changes here are largely straightforward enough, but overall things are shaping to be pretty involved, eh? Part of near#11319

@Ekleog-NEAR

The split boundary has been chosen to be what's necessary to compute a VMOutcome, which now in turn allows us to load a contract without constructing a VMLogic, or contract memory quite yet. This might very well resolve issues I've been working through by attempting to remove lifetimes and such from `VMLogic`...? As previous changes this makes quite some sense in isolation regardless of the ongoing projects. While I imagine there are more lines, they will mostly be due to the fact that in many places the previous code now needs to go through an additional field projection to get to types it needs to operate. @Ekleog-NEAR I think you'll appreciate these as I recall you've struggled with the VMLogic nonsense as well in the past. Part of #11319

This is a very exciting step forward! Finally we got up to the point where we can do some work in preparing the contract to run separately from actual running of the contract. And all of this is encapsulated in a very neat API that gives out `Send + 'static` types for users to pass around between threads or whatever so that they can pipeline these processes. It will remain to see whether the requirement to have `&External` and `&VMContext` in both calls is a problem, and how much of a problem it is, but that might be very well solvable with some scoped threads or smart use of channels, or even just `Arc<Mutex>`, especially since both of these structures generally tend to be unique to a contract execution... Part of #11319

nagisa · 2024-07-18T11:15:42Z

cc #11808

…ers (near#11810) Best reviewed commit-by-commit. This ultimately lifts the contract preparation up through a several function call layers in the transaction runtime up to the layer where all the currently necessary data are available. This PR also establishes in broad strokes where the pipelining decisions will be made (`ApplyProcessingReceiptState`) and makes some minor changes to the type to have it contain local receipts (in addition to the previously contained delayed receipts etc) in a queue of sorts which would allow the pipelining code to look-ahead of the ongoing processing work and queue-up preparation of the upcoming contracts there. This work so far is intended to have no functional changes. Part of near#11319

bowenwang1996 · 2024-07-24T14:50:26Z

I have a question: instead of thinking about dynamic vs. static, could we just use an approach where at the beginning of apply, we eagerly go over receipts and load contracts into a queue, then we when need to execute a function call receipt, we pop prepared contracts from the front of queue. We need a concurrent queue here, but otherwise it should be easier to implement.

nagisa · 2024-07-25T01:22:02Z

I would argue that the described solution is a form of a static approach where N of preloaded contracts is implicitly bounded by available memory or the maximum number of delayed+incoming+local receipts. If we don't carefully manage the number of prepared receipts we'd be looking at >1 TiB of memory to store the prepared contracts (assuming ~20k of delayed receipts and 64MiB of contract data memory per prepared contract.)

Traditional queues are somewhat poor of a data structure for this problem too. For instance local receipts are constructed within apply and processed right after they are constructed before delayed or incoming receipts. But then if those local receipts do not fit into the current chunk, they get pushed to the back of the delayed receipt queue. As thus the ordering in which the receipts get executed is quite dynamic in the grand scheme of things, and is somewhat ambiguous at the beginning of apply.

For these reasons I don't think ordered data structures or starting preparation work in apply are the right way forward.

nagisa · 2024-08-12T12:49:47Z

The proof of concept (wherein we only pipeline contract preparation and block any processing for accounts which have had a contract deployment in the chunk) is showing some mildly positive results. We're successfully moving about 80% of the time spent on preparation to other worker threads.

I have initially thought that there's a significant amount of overhead in thread pool management introduced by rayon, but it turns out to be mostly due to sleeping when the thread in a pool goes idle.

We are still spending ~20% of the original time on the critical path doing those same preparations. This is -- to an extent -- intended currently, this is how we keep the latency minimal in case the pipeline implementation can't get to processing the action before it is time to execute it. We could improve on this

For an MVP what's remaining is to also offload deployment action processing into the pipeline implementation here. This is mostly necessary in order to establish the necessary data dependency/ordering information between deployment and function call actions. As a bonus this should also move the arguably most expensive (and unpredictable, if rare) action away from the critical path, leaving more time for other tasks. Ultimately, this is required to bring down the worst-case execution time (so that we can reduce the gas fees.) Otherwise any receipt that begins with a deployment action immediately disables any improvements seen here.

nagisa · 2024-08-12T13:15:33Z

Pipelines deeper than 1 action with a function call per shard give no observable benefits.

) Part of #11319 and the final change in integration with the transaction runtime as all interesting receipt types are handled now. There are also receipt types like yield timeouts which only result in generation of new (delayed) receipts, so they don't need to be handled by this mechanism.

nagisa · 2024-08-16T15:17:04Z

I don't have a separate issue for parallelizing receipts themselves within a chunk, but one interesting though came up in my discussion with @akhi3030 that I wanted to write it down. There's a problem that we can parallelize receipts all we want, but if all of them target the same account, they would still need to be executed sequentially. So we wouldn't be able to reduce gas costs for that reason.

Akhi mentioned that we could instead have chunk producers determine a number of "threads" of execution and then we could raise the gas limit to a multiple of determined threads. This way even if all receipts were targetting the same account, we would put all of the receipts on one thread and this account would not get any more than 1Pgas per chunk anyway. But we would have N-1 threads worth of spare capacity for other accounts and their actions.

So while the throughput for e.g. large.near would not necessarily improve, we could put humongous.near on that same shard without worrying much about it taking capacity away from largecontract.near.

nagisa added A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) A-contract-runtime Area: contract compilation and execution, virtual machines, etc T-contract-runtime Team: issues relevant to the contract runtime team labels May 15, 2024

nagisa self-assigned this May 16, 2024

This was referenced Jun 6, 2024

vm: import construction → respective VM impls #11506

Merged

vm: deduplicate code on host function tracing #11503

Merged

nagisa mentioned this issue Jun 11, 2024

wasmtime: add caching of executables and artifacts #11532

Merged

This was referenced Jun 19, 2024

refactor: replace some references with owned structures in VMLogic #11615

Merged

vmlogic: maintain memory by a Box #11614

Merged

This was referenced Jun 21, 2024

vmlogic: split up into two #11643

Merged

refactor: make RuntimeExt: Send #11634

Merged

nagisa mentioned this issue Jun 24, 2024

Idea: ponder making wasmtime the primary backend #11660

Open

nagisa mentioned this issue Jun 26, 2024

vm: split preparation and running of a contract #11667

Merged

akhi3030 mentioned this issue Jul 18, 2024

[Tracking issue] 2024 H2 single shard performance #11808

Open

nagisa mentioned this issue Jul 19, 2024

runtime: start lifting contract preparation up through tx runtime layers #11810

Merged

akhi3030 mentioned this issue Aug 6, 2024

runtime: implement a receipt preparation pipeline #11839

Merged

nagisa mentioned this issue Aug 7, 2024

runtime: pipelined preparation for incoming and delayed receipts #11904

Merged

akhi3030 closed this as completed Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Introduce additional parallelism within application of a single shard chunk #11319

Idea: Introduce additional parallelism within application of a single shard chunk #11319

nagisa commented May 15, 2024

nagisa commented May 20, 2024 •

edited

Loading

nagisa commented Jul 18, 2024

bowenwang1996 commented Jul 24, 2024

nagisa commented Jul 25, 2024 •

edited

Loading

nagisa commented Aug 12, 2024

nagisa commented Aug 12, 2024

nagisa commented Aug 16, 2024

Idea: Introduce additional parallelism within application of a single shard chunk #11319

Idea: Introduce additional parallelism within application of a single shard chunk #11319

Comments

nagisa commented May 15, 2024

nagisa commented May 20, 2024 • edited Loading

nagisa commented Jul 18, 2024

bowenwang1996 commented Jul 24, 2024

nagisa commented Jul 25, 2024 • edited Loading

nagisa commented Aug 12, 2024

nagisa commented Aug 12, 2024

nagisa commented Aug 16, 2024

nagisa commented May 20, 2024 •

edited

Loading

nagisa commented Jul 25, 2024 •

edited

Loading