Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude contract code out of state witness & distribute separately #11099

Closed
Tracked by #82 ...
walnut-the-cat opened this issue Apr 17, 2024 · 3 comments
Closed
Tracked by #82 ...
Assignees
Labels
A-stateless-validation Area: stateless validation

Comments

@walnut-the-cat
Copy link
Contributor

walnut-the-cat commented Apr 17, 2024

Relevant discussion

Link

Issue

During stateless validation forknet test, we observed node crash with the following error

2024-04-16T20:21:23.545144Z DEBUG chunk_tracing{chunk_hash=HnFSQEoLMEnMXK2pxnnnbv7GkwFobanyrd7JJbNS2Rrj}:new_chunk{shard_id=3}:apply_chunk{shard_id=3}:process_state_update:apply{protocol_version=84 num_transactions=19}:process_receipt{receipt_id=GHhLncT5GM2ksuwVzUqPMkzCp132V7xToQZPfUbKeRgP predecessor=operator.meta-pool.near receiver=lockup-meta-pool.near id=GHhLncT5GM2ksuwVzUqPMkzCp132V7xToQZPfUbKeRgP}:run{code.hash=EXekfV3kpFHHsTi4JUDh2MVLCKS3hpKdPbXMuRirxrvY vm_kind=NearVm}: vm: close time.busy=49.3µs time.idle=3.42µs
thread '<unnamed>' panicked at core/store/src/trie/trie_storage.rs:317:16:
!!!CRASH!!!: MissingTrieValue(TrieMemoryPartialStorage, 5FWvfWAJxH1mbCHuzLGwBfL9EYjH8YWVin6Pmp3H8gdM)
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <near_store::trie::trie_storage::TrieMemoryPartialStorage as near_store::trie::trie_storage::TrieStorage>::retrieve_raw_bytes
   4: near_store::trie::Trie::internal_retrieve_trie_node
   5: near_store::trie::Trie::retrieve_raw_node
   6: near_store::trie::Trie::lookup_from_state_column
   7: near_store::trie::Trie::get_optimized_ref
   8: near_store::trie::Trie::get
   9: near_store::trie::update::TrieUpdate::get
  10: near_store::get_code
  11: node_runtime::actions::execute_function_call
  12: node_runtime::Runtime::apply_action
  13: node_runtime::Runtime::apply_action_receipt
  14: node_runtime::Runtime::apply::{{closure}}
  15: node_runtime::Runtime::apply
  16: <near_chain::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk
  17: near_chain::update_shard::apply_new_chunk
  18: core::ops::function::FnOnce::call_once{{vtable.shim}}
  19: <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute
  20: rayon_core::registry::WorkerThread::wait_until_cold

@Longarithm mentioned that

  10: near_store::get_code

is due to missing contract doe from state witness.

From debug log, @staffik confirmed that it was likely the case and the crash was happening with different contracts, including lockup-meta-pool.near and pack.promotional.basketball.playible.near

@Longarithm 's understanding of how this can cause node crash is as follows:

  • Chunk producer reads code from cache and doesn't go to trie for the code;
  • so trie nodes required for reading contract code are never read and recorded;
  • so chunk validator doesn't know where to take it.

Timeline

April 17

@Longarithm is preparing a quick patch to bypass the issue in Forknet for now, but we need a proper solution in place before MainNet launch

April 18

The team had discussion on the proper solution and concluded to separate contract out of state witness. When a chunk validator realizes that it does not have a contract code to validate incoming state witness, it will reactively request missing code to its peers. As a result, chunk miss may happen, but the chunk validator should be compiled contract code ready fur the future validation.

The project involves following works but not limited to:

  • Introduce a new network message to request contract code
    • Saketh's tip on how to do so: link
  • Remove contract code from state witness
@walnut-the-cat walnut-the-cat added the A-stateless-validation Area: stateless validation label Apr 17, 2024
@walnut-the-cat walnut-the-cat changed the title Incomplete state witness when contract code is cached. Separate contract code out of state witness Apr 18, 2024
@walnut-the-cat walnut-the-cat changed the title Separate contract code out of state witness Separate contract code out of state witness & reactive contract request by chunk validator Apr 18, 2024
@tayfunelmas tayfunelmas self-assigned this Apr 18, 2024
@walnut-the-cat
Copy link
Contributor Author

For now, @tayfunelmas will continue making progress on building network message, but @Longarithm will pause and focus on #11124 until we have a clear evidence that including contract code in state witness does not work for MVP launch. Relevant discussion can be found here: link

@nagisa
Copy link
Collaborator

nagisa commented Jul 26, 2024

Does this issue in principle imply that the contract code would no longer be a part of The State (i.e. no longer stored in the trie, with all that entails), or is this only a partial step towards such a future and the remaining work would need to be documented as a separate issue?

@tayfunelmas tayfunelmas changed the title Separate contract code out of state witness & reactive contract request by chunk validator Separate contract code out of state witness & distribute separately Sep 25, 2024
@tayfunelmas tayfunelmas changed the title Separate contract code out of state witness & distribute separately Exclude contract code out of state witness & distribute separately Sep 25, 2024
@tayfunelmas
Copy link
Contributor

Adding a description of the implementation in the context of this issue below.


Distributing contracts separately from state witness

The current chunk state witness structure is inefficient due to a significant portion being consumed by contract code. This document describes an implementation for optimizing the state witness size. The changes described are guarded by the protocol feature ExcludeContractCodeFromStateWitness.

In feature ExcludeContractCodeFromStateWitness, we optimize the state witness size by distributing the contract code separately from the witness, under the following observations:

  1. New contracts are deployed infrequently, so the function calls are often made to the same contract and we distribute the same contract code in the witness many times at different heights.
  2. Once deployed, a contract is compiled and stored in the compiled-contract cache. This cache is stored in the disk and persistent across epochs, until the VM configuration changes. The chunk application uses this cache to bypass the trie traversal and storage reads to fetch the uncompiled contract code from storage. The cache is maintaned with high hit rates by both chunk producers and chunk validators.

Deploying a contract to an account

A contract is deployed and called through a Near account.
When a contract is deployed to an account with id account_id, (by running a DeployContractAction), the deployment of the contract is recorded in State in two places:

  1. The cache_key field of the Account struct (pointed by the key TrieKey::Account{account_id}) is updated to store the hash of the (uncompiled) contract code (link). Note that, when no contract is deployed, this field contains CryptoHash::default().
  2. A new entry is created with key TrieKey::ContractCode{account_id} and value containing the (uncompiled) code of the contract.

When the contract is deployed, it is also pre-compiled, and the compiled code is persisted on disk in the compiled-contract cache. When applying a function call to a pre-compiled contract, the implementation skips reading from the trie and directly invokes the compiled code from the cache.

Calling a function from contract

In stateless validation, all accesses to the state by actions are recorded in the state witness, so that the chunk validators can validate the chunk by applying the same actions.
A FunctionCallAction represents calling a method of the contract deployed to an account.
The function call is fulfilled by either retrieving the pre-compiled contract from the cache or compiling the code right before execution.

Note that, not all chunk validators may contain the contract in their compiled-contract cache.
Thus, the contract code (which is also part of the State) should be available when validating the state witness.
To address this, independent of whether the compiled contract cache was hit or not, all the accesses to the uncompiled code are recorded by explicitly reading from the key TrieKey::ContractCode{account_id}, which records all the internal trie nodes from the state root upto the leaf node and the value (code) itself.
Thus, each function call will contribute to the state witness size at least the contract code, which may go up to 4 MB.

Excluding contract code from state witness

When ExcludeContractCodeFromStateWitness is enabled, we distribute the following in parallel to the state witness:

  1. Hashes of the contracts code called during the chunk application. We do not include the contract code in this message. Instead, the chunk validators request the contracts missing in their compiled-contract cache.
  2. Code of the contracts deployed during the chunk application. We distribute this message only to the validators other than the chunk validators, since the chunk validators can access the new contract code and update their cache by applying the deploy actions (contained in the incoming receipts) in the state witness.

Collecting contract accesses and deployments

In order to identify which contracts to distribute, we collect (1) the hashes of the contracts called by a FunctionCallAction and (2) contract code deployed by a DeployContractAction.
When ExcludeContractCodeFromStateWitness is enabled, the chunk producer performs the following when applying the receips in a chunk (note that it is done by all the chunk producers tracking the same shard):

  • For function calls, it skips recoding the read of the value from TrieKey::ContractCode{account_id}. Instead, it just records the hash of the contract code. The TrieUpdate::record_contract_call function called when executing a FunctionCallAction implements the different behaviors with and without the feature enabled.
  • For contract deployments, it records the code deployed when executing a DeployContractAction, by calling the TrieUpdate::record_contract_deploy function.

Both information is collected in the ContractStorage in the TrieUpdate.
While finalizing the TrieUpdate after applying the chunk is finished, we take out this information, packages them in a struct called ContractUpdates and pass it to upstream callers in the ApplyChunk and then to ApplyChunkResult. Finally, we write the data in ContractUpdates to the database in the StoredChunkStateTransitionData along with the partial state recorded during the chunk application.

Sending contract updates to validators

Upon finishing producing the new chunk, the chunk producer reconstructs ContractUpdates from the database (by reading StoredChunkStateTransitionData) along with the other state-transition data, generates the state witness, and sends (in the same DistributeStateWitnessRequest) both the state witness and ContractUpdates to the PartialWitnessActor.

NOTE: All the operations described in the rest of this document are performed in the PartialWitnessActor.

PartialWitnessActor distributes the state witness and the contract updates in the following order (see code here):

  1. It first sends the hashes of the contract code accessed to the chunk validators (except for the validators that trak the same shard). This allows validators to check their compiled-contract cache and request code for the missing contracts, while waiting for the witness parts. This is sent in a message called ChunkContractAccesses.
  2. It then send the state witness parts to witness-part owners.
  3. It finally sends the new contracts deployed to the validators that do not validate the witness in the current turn. This allows the other validators to update their compiled-contract cache for the later turns when they become a chunk validator for the respective shard. The parts are sent in a message called PartialEncodedContractDeploys. The code for deployed contracts is distributed to validators in parts after compressing and encoding in Reed-Solomon code, similarly to how the state witness is sent in parts.

Handling contract accesses messages in chunk validators

When a chunk validator receives ChunkContractAccesses, it checks its local cache to see if it is missing any contracts needed to validate the incoming witness. If any contract is missing, it sends a message called ContractCodeRequest to a random chunk producer that tracks the same shard (which balances the requests across the chunk producers that can provide the missing contract code).
While waiting for the parts of the state witness, the validator also waits for the response to the contract code request (if any). If this request is not fulfilled before enough state-witness parts are collected, the validation fails.

A chunk producer receiving the ContractCodeRequest validates the request against the saved state transition data (note that all chunk producers tracking the same shard saves the ContractUpdates in the StoredChunkStateTransitionData in the database) and responds with the ContractCodeResponse containing the (compressed) contract code.

Handling deployed code messages in other validators

When a validator receives a PartialEncodedContractDeploys message for a newly deployed contract, it starts collecting the parts using its PartialEncodedContractDeploysTracker. The tracker waits until the sufficient number of parts are collected, then decodes and decompresses the original contracts and compiles them in parallel. The compilation internally persists the compiled contract code in the local cache, so that the validator can use the compiled code later when validating the upcoming chunk witnesses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation
Projects
None yet
Development

No branches or pull requests

5 participants