Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Introduce additional parallelism within application of a single shard chunk #11319

Closed
nagisa opened this issue May 15, 2024 · 7 comments
Closed
Assignees
Labels
A-contract-runtime Area: contract compilation and execution, virtual machines, etc A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) T-contract-runtime Team: issues relevant to the contract runtime team

Comments

@nagisa
Copy link
Collaborator

nagisa commented May 15, 2024

When executing the transactions and receipts within a chunk in apply, the flow of execution is currently very sequential and is roughly as such:

[prefetch A] → [vm_runner::run A] → [prefetch B] → [vm_runner::run B] ... → [commit]

I've already talked about the fact that we don't necessarily need to have commit be as serial as it is currently and we can start applying the next chunk before the [commit] for the previous chunk has fully concluded in #11118. If we look closer and notice that [vm_runner::run] is composed of multiple currently sequential steps, namely [vm_runner::load_and_instantiate] and [vm_runner::invoke_method], and that only the invoke_method calls need to be executed serially, we could pipeline/parallelize the processing of a single chunk much more to look more like this (each horizontal line is a separate "thread" of execution):

[prefetch A] → [vm_runner::instantiate A] → [vm_runner::invoke_method A] → [vm_runner::invoke_method B] → [vm_runner::invoke_method C] ... → [commit]
                                 [prefetch B] [vm_runner::instantiate B]-^ 
                                                                [prefetch C] [vm_runner::instantiate C]-^

Notice how the latency of prefetching – and much more importantly – loading and instantiating the contract gets hidden "behind" the time the runtime spends actually executing the contract code in the background.

This sort of pipelining can be dynamic (automatically adjusting how many contracts are preloaded in advance), or static (preloading N contracts in advance.) Either have benefits and downsides:

  • Static:
    • Pro: simpler to implement
    • Pro: simpler to reason about
    • Con: with deeper pipelines we may "waste" some work when gas/compute limits prevent us from invoking methods for which we've already preloaded the contracts for;
    • Con: therefore we need to carefully select an optimal pipeline depth...
  • Dynamic:
    • Pro: we don't need to fiddle with pipeline depth parameters
    • Pro: adaptive to the time contract methods consume executing, always maintaining roughly an optimal pipeline depth;
    • Pro: less wasted work
    • Con: more difficult to reason about and implement.

Ultimately I think if we work towards implementing something like this we should start with a static and configurable pipeline depth. There's a good chance that an optimal depth will turn out to be one or two calls deep, in which case throwing away that work isn't going to be particularly painful. Especially knowing that doing this work might place some of the necessary data in the caches so that the next time around these preparatory steps get executed faster and the pipeline is quicker to fill.

@bowenwang1996 had raised another idea that we could consider preparing all the contracts even before we head into the apply call (also in parallel.) This is also an option we could consider, although it remains to be seen how much more of an improvement it would be over a basic localized and static pipeline. There's also an unsolved question of how many of these "all" we would need, since the number of contracts that get executed in a single apply is dynamic and dependent on gas and compute cost limits.

cc @Ekleog-NEAR this approach is another idea to deal with the cost of allocating memory maps for the contract data. Thoughts appreciated.

@nagisa nagisa added A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) A-contract-runtime Area: contract compilation and execution, virtual machines, etc T-contract-runtime Team: issues relevant to the contract runtime team labels May 15, 2024
@nagisa nagisa self-assigned this May 16, 2024
@nagisa
Copy link
Collaborator Author

nagisa commented May 20, 2024

In an effort to implement this, the very first thing that will need to happen is a refactor of the VM runner.

There are a couple of issues to resolve:

  • Split up the near_vm_runner::run method into an instantiate and run_method;
  • Instantiation must be split up so that the start method can be run at a later time. This may be particularly troublesome for VMs that we don't control implementation of. For instance wasmtime provides "pre-instantiate" sort of method, but that does less work than we'd ideally like.
  • In near_vm_runner we have the split here already, but the instantiation is mixed up together with the loading of the executable, which is probably something we should fix as well.
    • All the caches make things more difficult than they ideally should be...
    • We may need to keep the cache of loaded contracts for use-cases like ReadRPC which will not be able to predict upcoming requests, but we could possibly drop that cache in nearcore.
  • Figure out what type represents an instance: do we store it within a VM (and make the VM only able to run a single method?) or do we do some type erasure shenanigans to obtain a type that makes sense to return from an object-safe trait?
  • Implementing this requires significant-enough changes to the code structure to require a protocol version change. How do we maintain backward compatibility when the old runtimes cannot allow for the same sort of logic flow...?

github-merge-queue bot pushed a commit that referenced this issue Jun 7, 2024
I'm exploring movement of the import code to their respective
implementations in order to make the interfaces to the import code a
little more flexible (and possibly less dependent on VMLogic struct.)

Part of #11319
github-merge-queue bot pushed a commit that referenced this issue Jun 10, 2024
I always found it weird that we had VM specific code in what's a generic
part of the near-vm-runner. Well -- it has actually started bothering me
in real ways, so it gets moved.

I didn't do very much to make sure it ends up pretty. And I don't think
I want to for the VMs that're not wasmtime or near_vm. Good thing is
that the wasmer0/2 code can be largely ignored sans changes to `VMLogic`
(which I'm considering addressing next...)

Based on top of #11503
Part of #11319
github-merge-queue bot pushed a commit that referenced this issue Jun 19, 2024
This largely mirrors the code in near_vm_runner module. I heard some
people pondering what it would be like to use a higher quality backend.
Outside LLVM, Cranelift is by far the next in line in produced code
quality. Since we already have wasmtime in place, might as well wire it
up completely for a full experience.

Based on top of #11529
Part of #11319
github-merge-queue bot pushed a commit that referenced this issue Jun 19, 2024
This is a part of an effort to drop the lifetime from the `VMLogic`
structure in order to make it more straightforward to have it live for a
longer duration alongside the instance and not force our hand in pretty
much "create-execute contract-drop" flow.

Part of #11319
github-merge-queue bot pushed a commit that referenced this issue Jun 20, 2024
…11615)

`VMLogic` containing a lifetime makes it difficult to have it live for
any longer than a short sequence of `instantiate-link-run` and is one of
the reasons why we're forced to have some unsafe code in our linking
code.

This refactor replaces some of the reference fields with `Arc`s,
`Box`es, etc. This is not a complete refactor, I intend to do the
remainder as a follow-up.

Based on #11614
Part of #11319
jakmeier pushed a commit to jakmeier/nearcore that referenced this issue Jun 22, 2024
In a world where we have pipelined compilation, instantiation and
execution, `VMLogic` will have to move between threads, which requires
that it becomes `Send`. It in turn has required some other types to
become not only `Send` but also `Sync` due to them currently being
stored as a `&` reference (which allows for multiple copies, there are
better places to explain why `Sync` becomes necessary here...)

I'm not sure if all of these types will continue requiring `Sync`. In
particular `TrieUpdate` that's stored in `RuntimeExt` is now by
reference, but I eventually want to also make `VMLogic: 'static`, which
would require finding some owning pointer structure that would work for
`TrieUpdate`... Or I might be able to use scoped threads... in which
case we're looking at `Sync` anyway...

I think the changes here are largely straightforward enough, but overall
things are shaping to be pretty involved, eh?

Part of near#11319
github-merge-queue bot pushed a commit that referenced this issue Jun 25, 2024
The split boundary has been chosen to be what's necessary to compute a
VMOutcome, which now in turn allows us to load a contract without
constructing a VMLogic, or contract memory quite yet.

This might very well resolve issues I've been working through by
attempting to remove lifetimes and such from `VMLogic`...? As previous
changes this makes quite some sense in isolation regardless of the
ongoing projects. While I imagine there are more lines, they will mostly
be due to the fact that in many places the previous code now needs to go
through an additional field projection to get to types it needs to
operate.

@Ekleog-NEAR I think you'll appreciate these as I recall you've
struggled with the VMLogic nonsense as well in the past.

Part of #11319
github-merge-queue bot pushed a commit that referenced this issue Jul 1, 2024
This is a very exciting step forward! Finally we got up to the point
where we can do some work in preparing the contract to run separately
from actual running of the contract. And all of this is encapsulated in
a very neat API that gives out `Send + 'static` types for users to pass
around between threads or whatever so that they can pipeline these
processes.

It will remain to see whether the requirement to have `&External` and
`&VMContext` in both calls is a problem, and how much of a problem it
is, but that might be very well solvable with some scoped threads or
smart use of channels, or even just `Arc<Mutex>`, especially since both
of these structures generally tend to be unique to a contract
execution...

Part of #11319
@nagisa
Copy link
Collaborator Author

nagisa commented Jul 18, 2024

cc #11808

VanBarbascu pushed a commit to VanBarbascu/nearcore that referenced this issue Jul 23, 2024
…ers (near#11810)

Best reviewed commit-by-commit.

This ultimately lifts the contract preparation up through a several
function call layers in the transaction runtime up to the layer where
all the currently necessary data are available.

This PR also establishes in broad strokes where the pipelining decisions
will be made (`ApplyProcessingReceiptState`) and makes some minor
changes to the type to have it contain local receipts (in addition to
the previously contained delayed receipts etc) in a queue of sorts which
would allow the pipelining code to look-ahead of the ongoing processing
work and queue-up preparation of the upcoming contracts there.

This work so far is intended to have no functional changes.

Part of near#11319
@bowenwang1996
Copy link
Collaborator

I have a question: instead of thinking about dynamic vs. static, could we just use an approach where at the beginning of apply, we eagerly go over receipts and load contracts into a queue, then we when need to execute a function call receipt, we pop prepared contracts from the front of queue. We need a concurrent queue here, but otherwise it should be easier to implement.

@nagisa
Copy link
Collaborator Author

nagisa commented Jul 25, 2024

I would argue that the described solution is a form of a static approach where N of preloaded contracts is implicitly bounded by available memory or the maximum number of delayed+incoming+local receipts. If we don't carefully manage the number of prepared receipts we'd be looking at >1 TiB of memory to store the prepared contracts (assuming ~20k of delayed receipts and 64MiB of contract data memory per prepared contract.)

Traditional queues are somewhat poor of a data structure for this problem too. For instance local receipts are constructed within apply and processed right after they are constructed before delayed or incoming receipts. But then if those local receipts do not fit into the current chunk, they get pushed to the back of the delayed receipt queue. As thus the ordering in which the receipts get executed is quite dynamic in the grand scheme of things, and is somewhat ambiguous at the beginning of apply.

For these reasons I don't think ordered data structures or starting preparation work in apply are the right way forward.

@nagisa
Copy link
Collaborator Author

nagisa commented Aug 12, 2024

The proof of concept (wherein we only pipeline contract preparation and block any processing for accounts which have had a contract deployment in the chunk) is showing some mildly positive results. We're successfully moving about 80% of the time spent on preparation to other worker threads.

I have initially thought that there's a significant amount of overhead in thread pool management introduced by rayon, but it turns out to be mostly due to sleeping when the thread in a pool goes idle.

We are still spending ~20% of the original time on the critical path doing those same preparations. This is -- to an extent -- intended currently, this is how we keep the latency minimal in case the pipeline implementation can't get to processing the action before it is time to execute it. We could improve on this

For an MVP what's remaining is to also offload deployment action processing into the pipeline implementation here. This is mostly necessary in order to establish the necessary data dependency/ordering information between deployment and function call actions. As a bonus this should also move the arguably most expensive (and unpredictable, if rare) action away from the critical path, leaving more time for other tasks. Ultimately, this is required to bring down the worst-case execution time (so that we can reduce the gas fees.) Otherwise any receipt that begins with a deployment action immediately disables any improvements seen here.

@nagisa
Copy link
Collaborator Author

nagisa commented Aug 12, 2024

Pipelines deeper than 1 action with a function call per shard give no observable benefits.

github-merge-queue bot pushed a commit that referenced this issue Aug 13, 2024
)

Part of #11319 and the final change in integration with the transaction
runtime as all interesting receipt types are handled now. There are also
receipt types like yield timeouts which only result in generation of new
(delayed) receipts, so they don't need to be handled by this mechanism.
@nagisa
Copy link
Collaborator Author

nagisa commented Aug 16, 2024

I don't have a separate issue for parallelizing receipts themselves within a chunk, but one interesting though came up in my discussion with @akhi3030 that I wanted to write it down. There's a problem that we can parallelize receipts all we want, but if all of them target the same account, they would still need to be executed sequentially. So we wouldn't be able to reduce gas costs for that reason.

Akhi mentioned that we could instead have chunk producers determine a number of "threads" of execution and then we could raise the gas limit to a multiple of determined threads. This way even if all receipts were targetting the same account, we would put all of the receipts on one thread and this account would not get any more than 1Pgas per chunk anyway. But we would have N-1 threads worth of spare capacity for other accounts and their actions.

So while the throughput for e.g. large.near would not necessarily improve, we could put humongous.near on that same shard without worrying much about it taking capacity away from largecontract.near.

@akhi3030 akhi3030 closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-contract-runtime Area: contract compilation and execution, virtual machines, etc A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) T-contract-runtime Team: issues relevant to the contract runtime team
Projects
None yet
Development

No branches or pull requests

3 participants