Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited Replayability Tracking Issue #8197

Open
4 of 9 tasks
Ekleog-NEAR opened this issue Dec 9, 2022 · 22 comments
Open
4 of 9 tasks

Limited Replayability Tracking Issue #8197

Ekleog-NEAR opened this issue Dec 9, 2022 · 22 comments
Labels
A-contract-runtime Area: contract compilation and execution, virtual machines, etc A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) C-housekeeping Category: Refactoring, cleanups, code quality C-tracking-issue Category: a tracking issue P-medium Priority: medium T-contract-runtime Team: issues relevant to the contract runtime team

Comments

@Ekleog-NEAR
Copy link
Collaborator

Ekleog-NEAR commented Dec 9, 2022

Today any development to neard is required to maintain the exact behavior as seen in all past versions of the program. Among other things, this is necessary to enable replay of the blockchain all the way back to the genesis, as implemented by the various neard view-state sub-commands.

Some of the prerequisite infrastructure to reconcile the operational needs and the desire to modify the protocol already exists in the form of protocol versioning. Practice has nevertheless demonstrated protocol versioning to incur a significant hit to the development velocity. Every time a change is made, engineers need to carefully consider the interaction between the change and protocol-visible behaviour. In cases where the change is intentionally a breaking change to the protocol, care needs to be taken to also maintain the former behaviour for the previous version of the protocol. In extreme cases the only feasible way to maintain compatibility is to duplicate the affected subsystem in full. First time this happens, logic to switch between different versions of the functionality must be implemented as well. Such logic acts to further impede future evolution of the implementation.

We are not able to consistently verify whether our efforts to maintain compatible behaviour are being implemented correctly in practice. Verifying the protocol versioning is implemented correctly by replaying even portions of the history is an extremely time-consuming process (taking weeks to months per experiment) and requires significant effort to set up. This makes verification of code changes quite onerous, to the point where there has only been One Attempt on behalf of the Contract Runtime team back in 2021.

It would be ideal, if we could make the requirements for compatibility satiable by construction and remove the burden imposed by this functionality on the development process.

See the proposal thread for further details.

Current status

  • Write down the proposal;
  • Gather sufficient stakeholder (node, protocol teams’) involvement and approval;
  • Nail down the implementation strategy and similar such details;
  • Merge currently-published crates that have no reason being separate from near-vm-runner into near-vm-runner
  • Remove all the dependencies of near-vm-runner
  • Consider whether to do the same to node-runtime (and rename it to near-transaction-runtime), if yes s/near-vm-runner/near-transaction-runtime below
  • Define and document a proper API to near-vm-runner;
  • Limit replayability of near-vm-runner, removing all of the dead code;
  • Consider future steps
@Ekleog-NEAR Ekleog-NEAR added C-housekeeping Category: Refactoring, cleanups, code quality P-medium Priority: medium C-tracking-issue Category: a tracking issue labels Dec 9, 2022
@Ekleog-NEAR
Copy link
Collaborator Author

We discussed this in yesterday’s protocol meeting. The summary is as follows:

  • The issues of complete replayability: old security-flawed code staying around, development velocity issues due to having to introduce conditions for all the new changes, and the complexity of refactoring when needing to keep supporting old behavior
  • The little advantages of complete replayability: As of today, we don’t even know whether we can replay all history. @marcelo-gonzalez is currently running an experiment that will help us know if we definitely don’t, but for computing power reasons we won’t be able to know for sure if we do. The main thing complete replayability brings us is, if someone needs to replay an old block, we have the state for that block at hand. Whereas with limited replayability, we would need some way to either find old state somewhere, or somehow un-migrate it from our current archival nodes.
  • The different ways to do limited replayability:
    • Limited-to-some (the lowest-hanging-fruit, and thus @mm-near’s favorite), where we say we officially support only running a protocol version on the last nearcore release that has it, but we actually cut down on old releases only manually when we actually feel the need to
    • Limited-to-three (thanks @jakmeier for pointing out that we need three for the canary nodes), basically the same but we keep always the last 3 protocol versions alive in a single neard instance
    • Limited-to-one (my and @nagisa’s favorite), where we make each neard binary actually support only a single protocol version, and then we make sure to orchestrate execution properly so that we can pass over the execution from one neard binary to the next. It is the one that brings us the best development velocity, as we can do any refactoring we want without having to care even once about backwards compatibility (except for stored data); but the cost is more upfront development
  • The different ways to do limited-to-three (or -to-some) while still getting development velocity benefits for the runtime at least:
    • Splitting at the contract/transaction-runtime level, into a separate binary that gets talked to by RPC
    • Same, but by having nearcore depend on different crate versions of the contract/transaction-runtime (probably better)

We overall agreed that:

  • Limited-to-one replayability is currently more effort than it’s worth
  • We should definitely limit replayability at least -to-some, as the use cases for actually replaying old blocks seem very few and unlikely to ever happen in practice. We should still have a plan for what to do the day someone comes to us needing it, but we do not need to actually do it yet. We will be keeping all historical data anyway, and so can rebuild whatever we need to.
  • Currently, contract runtime being the most affected team, it should do anything it wants so long as it keeps at least limited-to-three replayability at all times. Other teams can then look at the results of the experiment there and consider the way forward.

@Ekleog-NEAR
Copy link
Collaborator Author

We then discussed it again today in the contract runtime meeting. Our current plan is to:

  • Define a clear API for the transaction runtime, that is pure, and does not depend on storage or anything else other than the callbacks passed as parameters to it. This will probably have to be a trait. (And storage might want to be a trait too?)
  • Implement a transaction runtime switcher, that decides which transaction runtime implementation (trait object) to use to execute the actions
  • Make a new transaction runtime version that only has the last version of the protocol, and defer all older protocol versions to the old "shove-all-in" transaction runtime version via the transaction runtime switcher
  • Make a new transaction runtime version that is able to load a transaction runtime from a .so file, and makes us able to compile any transaction runtime version to a .so file
  • Introduce neard calls that allow a validator to switch the transaction runtime for some protocol version to a .so file on disk, so that they could get security updates with 0 downtime (all new actions using this protocol version would immediately go to the updated transaction runtime)

@Ekleog-NEAR Ekleog-NEAR self-assigned this Feb 24, 2023
@jakmeier
Copy link
Contributor

A quick overview of existing traits / interface between the 3 runtime could be useful to inform a decision where a cut could be made.

chain -> NightshadeRuntime

  • trait RuntimeAdapter for -> direction
    • notably includes fn store(&self) -> &Store; that gives full mutable DB access
  • trait EpochInfoProvider gives some access in the <- direction (I think that's the only access that way)

NightshadeRuntime <-> transaction runtime (struct Runtime {})

  • we don't use trait for -> direction but all necessary state is passed in and out in complex structs.
    • &mut TrieUpdate is used for read-only DB access and a way to in-memory add changes to be committed to DB later by the chain
    • &ApplyState has all context needed to apply transactions and receipts (protocol version / current block height / gas price / ...)
    • trait CompiledContractCache gives write access to the DB in a round-about way
    • ApplyResult and the changes made inside &mut TrieUpdate make up all the changes that go <-
  • trait EpochInfoProvider is passed in and gives some access in <- direction

transaction runtime (Runtime) <-> contract runtime (VMLogic)

  • trait VM defines everything ->
  • all output is defined in VMResult
  • all <- accesses are defined in trait External in runtime/near-vm-logic/src/dependencies.rs
  • &dyn CompiledContractCache is also passed in, pretty sure this is the only part where we have write access to a DB (we could easily move this to a separate DB, nobody except the VM care about the complied code cache)

@jakmeier
Copy link
Contributor

Corrections on the previous comments are welcome, I'm not sure I got this all correct.

But based on what I described there, my opinion is that a clean API for the transaction runtime would probably be a cleaned-up version of what currently sits between struct Runtime {} and NightshadeRuntime. But cleaning it up could be a bit of work.

@Ekleog-NEAR
Copy link
Collaborator Author

So I just checked the current crate structure, and in order to publish crates on crates.io we would need to publish crates for near-vm-runner and all its dependency tree at least:

  • near-vm-runner
  • near-cache
  • near-vm-logic
  • near-vm-errors
  • near-primitives
  • near-stable-hasher
  • near-test-contracts
  • near-crypto
  • near-account-id
  • near-o11y
  • near-primitives-core
  • near-rpc-error-macro
  • probably others, I stopped going down the crate tree at this point

This makes me reconsider the idea of starting with manually publishing crates, even though I still think @nagisa’s idea from today’s meeting (which I’ll let you describe) is the best long-term.

Maybe we’d be better off just copy-pasting near-vm-runner for today’s lowest-hanging-fruit goal?

And then in a second step I think we should reintroduce the automated deploy-from-CI behavior, and get an auto-deploy of literally all crates with a major release on every nearcore version, ideally numbering that release with the latest protocol version this set of crates supports for clarity. It’d also make it easier for other teams to integrate limited replayability in their own workflows.

As a side-note, I remember seeing people from the rust community angry at… was it solana? for doing exactly this kind of huge update-all-crates events regularly. So if we actually go with this solution we may want to sponsor even a moderate recurring amount to Rust Foundation, so that we could reasonably answer that we’re paying for the resources we’re using should people complain about it. @akhi3030 maybe that’s a thing you could look into? (or maybe we already do, which’d be awesome)

@nagisa
Copy link
Collaborator

nagisa commented Mar 10, 2023

So here's the idea we came up in our meeting today, which we think is in the short term is the best and easiest way forward.

  1. We publish a crate that contains our current runtime without any changes. Where we make the split in terms of what is a "published runtime" and what is not is still TBD, but this can be easily changed later, so I wouldn't overthink this too much. This crate will thus implement all the legacy versions of the protocols that we have support for up to this point in time.

  2. We nuke all the old code available in the published crate, getting the state of the runtime to an idealized state, add a dependency along the lines of:

    [dependencies]
    previous_runtime_crate = { package = "runtime", version = "=CURRENT - 1" }
  3. Finally a branching construct along the lines of is added as an entrypoint:

    fn run_stuff(...)
        if we_are(REQUESTED_VERSION) { 
            run_code_with_this_runtime(code)
        } else {
            // Potentially adjusting for any API differences between current neard and what the old runtime expected.
            previous_runtime_crate::run_stuff(...)
        }
    }
  4. Every time we make a change to runtime, we publish a new crate, bumping the previous_runtime_crate and the current crate. With this setup, we're achieving effectively a limited-to-one replayability from the developer's perspective.

The cost of this is that every time we publish a runtime, the size of the final neard binary will increase by the size of a single runtime binary (sans whatever LLVM manages to deduplicate.) Another cost is linearly increasing compilation time, which I think we can deal with until we get more time to work out something better.


So I just checked the current crate structure, and in order to publish crates on crates.io we would need to publish crates for near-vm-runner and all its dependency tree at least.

I think ideally we make the runtime crate(s) self-contained as much as possible, reducing the number of crates the runtime is split into. There's no inherent reason why we need to maintain near-vm-* crates individually, and, I guess, getting rid of some of the other dependencies would mean crafting a hermetic API after all… This may change where we make the initial split.

One observation is that building a .so (cdylib) would produce self-contained binaries (we could publish these somewhere else?) and with smart use of git tags these could be made always buildable and reproducible. A concern here are global variable duplication between neard-the-binary and the .so (especially for things like tracing), but I think there are solutions for that too. These .sos could use largely the same scheme as with crates above, except that instead of [dependencies] it'd be trying to load an older version via libloading or something.

@nagisa
Copy link
Collaborator

nagisa commented Mar 10, 2023

As a side-note, I remember seeing people from the rust community angry at… was it solana? for doing exactly this kind of huge update-all-crates events regularly.

There are some projects like that – this was handled pretty well by deprioritizing these crates in places like e.g. docs.rs queue where interactive uses of the service disproportionally suffer from added latency spikes. Otherwise publishing crates in patches is not a big deal AFAIU.

@Ekleog-NEAR
Copy link
Collaborator Author

Right, so we probably do need to plan on spending quite some time refactoring, as the API of near-vm-runner will probably be much larger than the fn run we were initially thinking of (eg. near-o11y, near-primitives, near-stable-hasher and/or near-crypto are probably going to be required by things outside the runtime itself, so we’ll need to decide how to deal with them)

Otherwise publishing crates in patches is not a big deal AFAIU.

Right, I don’t think it’s a big deal either, but PR is not a rational thing and if Pagoda can avoid a PR hit (or even turn it into positive PR) for like $10/mo it’s probably worth it :)

@bowenwang1996
Copy link
Collaborator

@Ekleog-NEAR @nagisa is the eventual goal to extend this idea to nearcore as a crate? If not, what is the plan for limited replayability outside of runtime?

@nagisa
Copy link
Collaborator

nagisa commented Jun 12, 2023

We had hoped this to be a nearcore-wide thing, however when discussing this idea with other teams the overall conclusion was that no other team is feeling the impact of replayability requirements as much as the contract runtime does. For them the resources-to-benefit ratio is significantly less than for the contract runtime, especially with how busy they are right now with stuff that's higher priority anyhow.

As thus we're pursuing a design that the contract runtime team can implement on its own. Unfortunately, this also means that the solution is unlikely to be expandable past the runtime boundaries easily, or at least the end result wouldn't make much sense, holistically speaking.

(This is my opinion, Leo’s may be different)

@bowenwang1996
Copy link
Collaborator

To me one of the most important goals of limited replayability is to improve the maintainability of the codebase overall. Restricting our attention to the contract runtime will fall short of that goal. That said, I think it is okay to move gradually and potentially start with contract runtime first, but the design should be extendable to other parts of the codebase. cc @akhi3030 @jakmeier

@akhi3030
Copy link
Collaborator

@bowenwang1996: I would also like to see this project improve the maintainability of the entire codebase. I think it is good to start with just runtime as the first milestone and then see where we are.

@jakmeier
Copy link
Contributor

As I remember the discussion, the major push back came because proposed solutions that work holistically requires multiple neard binary versions on validator nodes. (During protocol upgrade, they hot switch from one to the other. Very tricky to do with RocksDB, network state, and so one, considering the time to upgrade is only one block time.)
I think there could be other solutions that don't require multiple binaries but as I remember, we didn't discuss hat much.

But regardless of the strategy, it feel like the RocksDB layer of storage, and the network stack, have their own concepts of versioning that is not easily solvable in a general way. For those components, we have to think carefully about backwards compatibility on each upgrade anyway and keeping multiple versions of the code around might only make things worse.

For the contract runtime OTOH, we can just include multiple versions in the same binary and pick them dynamically. This idea can be extended to the transaction runtime, if we can define appropriate interfaces. Those two components are also where we have most protocol version dependent if-branches at the moment.

Note that this doesn't drop support for old protocol versions. It only makes runtime code cleaner and easier to maintain.
If we want to remove support for old protocol versions, we could do so in a more manual process. For example, once a year we decide to drop support for versions older than 1 year. That would allow to delete code everywhere in nearcore with a much higher benefit/effort ratio.

@bowenwang1996
Copy link
Collaborator

@jakmeier the separation between what we do manually and what we implement programmatically makes sense. I guess one thing we need to consider is what to do in case there is a need to replay some old history with old versions. Today it is possible to do with the same binary, but if we manually remove some old versions then we need some other solution.

@jakmeier
Copy link
Contributor

The simplest solution would probably be to offload the work to those who actually want to replay a complete history. They would have to checkout old nearcore tags, compile them with the specified toolchain, and then use the different binary versions for different block height ranges.

I'm not sure if anything more sophisticated is worth building. At least I am not aware of anyone actually doing such replays.

@akhi3030
Copy link
Collaborator

I agree. I would wait and see if someone actually requests something more sophisticated before we devote resources to it.

nikurt pushed a commit that referenced this issue Jul 12, 2023
Part of #8197

This changes the behavior, in ways that hopefully would not be problematic.

@jakmeier AFAICT you’re one of the last ones to have touched io traces, WDYT about this change?
nikurt added a commit that referenced this issue Jul 12, 2023
* fix(state-sync): finalize using flat storage

* fix(state-sync): Applied values from state parts should be inlined when applicable (#9265)

* fix(state-sync): Applied values from state parts should be inlined when applicable

* fix(state-sync): Applied values from state parts should be inlined when applicable

* fix(state-sync): Applied values from state parts should be inlined when applicable

* fix(state-sync): Applied values from state parts should be inlined when applicable

* clippy

* fmt

* logging

* scan from a hash

* refactor: add FlatStateValue constructor for on-disk values (#9269)

This logic is used in multiple places, so it makes sense to abstract it behind a method in `FlatStateValue`.
`on_disk` name is chosen over `new` since this logic is only applicable for values to be persisted on disk. Later we will also add inlining for values as part of Flat Storage Delta and those will have lower inlining threshold, so we will have another constructor.

* refactor: Move adjust-db tool to database tool (#9264)

Refactoring of `adjust-db` tool, moving it to the `database` toolset where there's of database tools should live.

* feat: remove fixed shards (#9219)

In this PR I'm removing the fixed shards field from the shard layout. I removed the fixed_shards field directly from the ShardLayoutV1. This implementation is quite hacky as it does not follow the typical flow for making protocol changes. This approach needs to be discussed and agreed on first but here is an implementation just to get a feel for it. 

The motivation for removing fixed shards is that fixes_shards break the account ids contiguity within a shard. A sub-account of a fixed shard is assigned to the same shard as the fixed account but it may fall in the middle of account range of another shard. For example let's consider the following shard layout:
```
fixed_shards = ['fixed']
boundary_accounts = ['middle']
```
In this shard layout we have three shards: 
- 0 - 'fixed' and all sub-accounts
- 1 - [..'middle'] with exception for any sub-account of fixed
- 2 - ['middle'..] with exception for any sub-account of fixed

Because of the fixed account, shards 1 and 2 are now full of holes for any subaccounts of 'fixed'. This property of fixed_shards makes it hard to perform any range operations on a shard such as resharding or deletion. 

The proper way to do it would be to add ShardLayoutV2 - a copy of V1 with fixed_shards removed and version bumped. Due to the way that we use for storing state date in storage - using the shard_uid as a prefix for the storage key - this approach would require a careful migration. One way to go about it would be to perform a null-resharding in epoch preceeding the shard layout version bump. There the node would have to create a copy of the state and store it in storage with the new version in shard_uid. Once the state is duplicated the node can keep applying changes to both states, same as in resharding. Such a migration would need to be prepared very carefully and would take a lot of time and effort. 

The reason why I believe that we can get away without new shard layout version is that fixed_shards are not used in production and the shard version is not actually stored in the state. The ShardLayoutV1 with empty fixed_shards is semantically equivalent to ShardLayoutV1 with the fixed_shards field removed. 

It's worth mentioning that the migration pains would be removed if we decided to change the storage structure to one that doesn't have the shard_uid in the storage key.

* remove near-cache dep from near-vm-runner (#9273)

Part of #8197

* fix(state-sync): Clear flat storage outside of ClientActor (#9268)

Deleting multiple GBs from RocksDB is definitely taking more than 100ms.
In case a node catches up, it can potentially make a chunk producer or a block producer miss its chunk/block.

This PR moves deletion of Flat State to `SyncJobsActor` to avoid blocking `ClientActor`.

* remove near-stable-hasher dep from near-vm-runner (#9270)

Part of #8197

* fix(state-sync): Applied values from state parts should be inlined when applicable (#9265)

* fix(state-sync): Applied values from state parts should be inlined when applicable

* fix(state-sync): Applied values from state parts should be inlined when applicable

* fix(state-sync): Applied values from state parts should be inlined when applicable

* fix(state-sync): Applied values from state parts should be inlined when applicable

* clippy

* fmt

* fix(state-sync): Finalize state sync using flat storage

* fix(state-sync): Finalize state sync using flat storage

---------

Co-authored-by: Anton Puhach <anton@near.org>
Co-authored-by: Jure Bajic <jure@near.org>
Co-authored-by: wacban <wacban@users.noreply.github.com>
Co-authored-by: Ekleog-NEAR <96595974+Ekleog-NEAR@users.noreply.github.com>
Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
@nagisa
Copy link
Collaborator

nagisa commented Jul 13, 2023

One axis of consideration that I feel our summary of possible split points does not capture is where does the serialized forms of chain data materializes. Unfortunately it seems that right now we have a bunch of that all over the place. Some of it is in near-vm-runtime which borsh-serializes Actions for various delegation related functionality, and a lot of it is in the chain component. Surprisingly (to me) the node-runtime crate proper has very little serialization code, but there's a bunch of serializable stuff in near-primitives crate which currently sits in between near-vm-runtime and node-runtime:

near-primitives-core → near-vm-runtime → near-primitives → node-runtime → NightShade → chain

Ideally, I feel, we should end up in a situation where all of these schemas are part of our limited replayability story, which would mean, I feel, that we'd need to pull in a bunch of BorshSerialize types up the dependency chain so that they end up above whatever split point we end up picking at the moment.

near-bulldozer bot pushed a commit that referenced this issue Aug 10, 2023
Rather than keeping these types inside near-vm-runner where they don't belong conceptually and interfere with limited replayability due to being schemas for serialization on-chain, we move them to near_primitives and expose the necessary interfaces to the `near-vm-runner` via `Externals`.

With this none of the near-vm-runner code is on-chain schema sensitive, which is one less thing to worry about.

“But wait,” you ask “shouldn’t limited-replayability help with not needing to think about what changes are made in the first place?” Yes, indeed it should be that way, but it isn’t entirely clear to me if we don’t or won’t have intentional or accidental functionality where the data is not serialized, but rather deserialized, even without a need to invoke the runtime to execute the code. This might be as simple a tool as something that shows the stored blockchain structure (e.g. near explorer.)

(But more seriously, I have petitioned that it would make sense to have all schemas part of the limited replayability story as well, but that didn't really gain as much traction as I had hoped, so I’m just punting this problem upwards and away from the runtime crates.)

cc  #8197
nagisa added a commit that referenced this issue Aug 16, 2023
Here the noted protocol features have been replaced with runtime
configuration via parameters instead.

This would be one solution/option to getting rid of compile-time
features in contract runtime which is interfering with limited
replayability (compile-time feature control means that all of the crates
still need to be built as a single compilation unit with consistent
options.)

cc #8197
nagisa added a commit that referenced this issue Aug 18, 2023
Here the noted protocol features have been replaced with runtime
configuration via parameters instead.

This would be one solution/option to getting rid of compile-time
features in contract runtime which is interfering with limited
replayability (compile-time feature control means that all of the crates
still need to be built as a single compilation unit with consistent
options.)

cc #8197
near-bulldozer bot pushed a commit that referenced this issue Aug 18, 2023
…eatures to runtime configuration (#9364)

This PR is built on top of a previous PR (see the base branch). Here the noted protocol features have been replaced with runtime configuration via parameters instead.

This would be one solution/option to getting rid of compile-time features in contract runtime which is interfering with limited replayability (compile-time feature control means that all of the crates still need to be built as a single compilation unit with consistent options.)

cc @jakmeier 

cc  #8197
nikurt pushed a commit that referenced this issue Aug 24, 2023
Rather than keeping these types inside near-vm-runner where they don't belong conceptually and interfere with limited replayability due to being schemas for serialization on-chain, we move them to near_primitives and expose the necessary interfaces to the `near-vm-runner` via `Externals`.

With this none of the near-vm-runner code is on-chain schema sensitive, which is one less thing to worry about.

“But wait,” you ask “shouldn’t limited-replayability help with not needing to think about what changes are made in the first place?” Yes, indeed it should be that way, but it isn’t entirely clear to me if we don’t or won’t have intentional or accidental functionality where the data is not serialized, but rather deserialized, even without a need to invoke the runtime to execute the code. This might be as simple a tool as something that shows the stored blockchain structure (e.g. near explorer.)

(But more seriously, I have petitioned that it would make sense to have all schemas part of the limited replayability story as well, but that didn't really gain as much traction as I had hoped, so I’m just punting this problem upwards and away from the runtime crates.)

cc  #8197
nikurt pushed a commit to nikurt/nearcore that referenced this issue Aug 24, 2023
Rather than keeping these types inside near-vm-runner where they don't belong conceptually and interfere with limited replayability due to being schemas for serialization on-chain, we move them to near_primitives and expose the necessary interfaces to the `near-vm-runner` via `Externals`.

With this none of the near-vm-runner code is on-chain schema sensitive, which is one less thing to worry about.

“But wait,” you ask “shouldn’t limited-replayability help with not needing to think about what changes are made in the first place?” Yes, indeed it should be that way, but it isn’t entirely clear to me if we don’t or won’t have intentional or accidental functionality where the data is not serialized, but rather deserialized, even without a need to invoke the runtime to execute the code. This might be as simple a tool as something that shows the stored blockchain structure (e.g. near explorer.)

(But more seriously, I have petitioned that it would make sense to have all schemas part of the limited replayability story as well, but that didn't really gain as much traction as I had hoped, so I’m just punting this problem upwards and away from the runtime crates.)

cc  near#8197
nikurt pushed a commit to nikurt/nearcore that referenced this issue Aug 24, 2023
…eatures to runtime configuration (near#9364)

This PR is built on top of a previous PR (see the base branch). Here the noted protocol features have been replaced with runtime configuration via parameters instead.

This would be one solution/option to getting rid of compile-time features in contract runtime which is interfering with limited replayability (compile-time feature control means that all of the crates still need to be built as a single compilation unit with consistent options.)

cc @jakmeier 

cc  near#8197
nikurt pushed a commit that referenced this issue Aug 28, 2023
Rather than keeping these types inside near-vm-runner where they don't belong conceptually and interfere with limited replayability due to being schemas for serialization on-chain, we move them to near_primitives and expose the necessary interfaces to the `near-vm-runner` via `Externals`.

With this none of the near-vm-runner code is on-chain schema sensitive, which is one less thing to worry about.

“But wait,” you ask “shouldn’t limited-replayability help with not needing to think about what changes are made in the first place?” Yes, indeed it should be that way, but it isn’t entirely clear to me if we don’t or won’t have intentional or accidental functionality where the data is not serialized, but rather deserialized, even without a need to invoke the runtime to execute the code. This might be as simple a tool as something that shows the stored blockchain structure (e.g. near explorer.)

(But more seriously, I have petitioned that it would make sense to have all schemas part of the limited replayability story as well, but that didn't really gain as much traction as I had hoped, so I’m just punting this problem upwards and away from the runtime crates.)

cc  #8197
nikurt pushed a commit that referenced this issue Aug 28, 2023
…eatures to runtime configuration (#9364)

This PR is built on top of a previous PR (see the base branch). Here the noted protocol features have been replaced with runtime configuration via parameters instead.

This would be one solution/option to getting rid of compile-time features in contract runtime which is interfering with limited replayability (compile-time feature control means that all of the crates still need to be built as a single compilation unit with consistent options.)

cc @jakmeier 

cc  #8197
@nagisa
Copy link
Collaborator

nagisa commented Jan 5, 2024

A point for myself is to move this forward is to:

  1. Define the scope and the end state implementation that we're striving for. Drop all the alternatives and options. Just pick one thing and work towards that. Do this in the context of what we have learned working on this so far;
    • This is especially relevant in the context of the last message I made. We should not really have any more "ideally this is a part of the story" questions anymore. It should be written down exactly what is included.
  2. Split the remaining work that is necessary to achieve the end state into individually actionable tasks (zkASM project is a good example to learn from here.)

@nagisa
Copy link
Collaborator

nagisa commented Jan 16, 2024

Scope

Before this message the scope of this task was quite ambiguous. We were not sure if we want a split to be entirely at the boundary of the near-vm-runner or if it should occur somewhere further up within the transaction runtime. The reasons for this are multifaceted: near_vm_runner::run is a very nice singular chokepoint to make the split at, but it is also one that’s not achieving as much as we would like the project to (modifying transaction runtime is also a major pain.)

Well, it turns out that the ambiguity has caused us more harm than good, actually. The progress itself had to be put on hold for us to rethink the approach here. Eventually we realized that we should focus on putting out a working MVP prototype. Once we have a working MVP we can then plan our future steps without making this a huge change that touches upon everything in the system.

Note that this explicitly does not include into the limited replayability scheme some of the very stable common dependencies – near-crypto or near-primitives-core – that near-vm-runner depends on currently.

For the MVP lets depend on near-vm-runner versions via regular crate dependencies, with an integrator crate being setup to invoke older versions of the runtime as necessary. We can explore using semver-hack or dynamic libraries or any other fancier mechanism we thought of as a refinement of the MVP.

Fortunately there shouldn’t be too much remaining to achieve all of this. near-vm-runner has been cleaned up both through the efforts we did towards limited replayability before and through ad-hoc cleanups we’ve done during the downtime.

Remaining work

  • near-vm-runner is sufficiently self-contained as a crate for limited replayability;
  • Wait for the next near-vm-runner release;
  • Add a configuration parameter to choose a version of the runtime to invoke and a table in the integrating crate to pick the relevant crate based on the parameter;
  • Add multiple versions of dependency on near-vm-runner from the crate integrating it and the code to select between the versions based on the parameter introduced above, thus making the cut.

@Ekleog-NEAR
Copy link
Collaborator Author

I’ll ask, for the remaining dependencies of near-vm-runner: do we want to stabilize them, releasing a 1.0 for them like near-account-id, to disincentivize ourselves against making breaking changes to them?

@nagisa
Copy link
Collaborator

nagisa commented Jan 19, 2024

The three dependencies that we have currently are near-crypto, near-primitives-core and near-parameters (and everything they depend on transitively, I guess.) However it is less so a problem that a dependency like it exists than the fact that it appears in the public interface of the near-vm-runner crate.

We clearly cannot freeze near-parameters – our configuration schema is expected to be pretty fluid into perpetuity. Unfortunately it also shows up in the public API of near-vm-runner, so the changes of some sort will be necessary there (I removed the relevant checkbox in the remaining work section to that effect.)

near-crypto and near-primitives-core are both crates that I wouldn’t lose any sleep over making frozen though, but there’s always an option of making copies of the few types used in the public API, too.

@nagisa
Copy link
Collaborator

nagisa commented May 16, 2024

This is not currently being worked on, so unassigning.

@Ekleog-NEAR Ekleog-NEAR added A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) A-contract-runtime Area: contract compilation and execution, virtual machines, etc T-contract-runtime Team: issues relevant to the contract runtime team labels May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-contract-runtime Area: contract compilation and execution, virtual machines, etc A-transaction-runtime Area: transaction runtime (transaction and receipts processing, state transition, etc) C-housekeeping Category: Refactoring, cleanups, code quality C-tracking-issue Category: a tracking issue P-medium Priority: medium T-contract-runtime Team: issues relevant to the contract runtime team
Projects
None yet
Development

No branches or pull requests

5 participants