Perf idea: better runtime support for contracts that use less memory #11033

akhi3030 · 2024-04-11T13:28:30Z

Our runtime today starts contracts with a memory of 64 MiB regardless of what the contract declares. The contract can then bring that number up to 128 MiB. We suspect that many contracts are actually using much less memory than this 64MiB. If someone were to write a contract that uses as little memory as possible (by specifying a small initial memory and using memory.grow as appropriate), then maybe we can optimise our runtime to execute such contracts more efficiently.

See also #10851 for a preliminary investigation on the matter.

The text was updated successfully, but these errors were encountered:

Ekleog-NEAR · 2024-04-11T13:45:01Z

What we could do would be to cap the initial memory size to 64MiB, and to use the contract-provided initial memory size if it’s below. The good thing is, so long as the contract is properly compiled, it should not break any contract. The main drawback is implementation complexity (we currently basically do not read the value we replace, meaning we can’t hack around to figure out performance). Also, due to the huge slowdown zeroing out large memories, we would need some way to unmap/remap the memory pages when the memory actually used is too large, instead of zeroing it out, or we’d be vulnerable to huge undercharging attacks.

So, to keep in mind if/when we come back to this:

We need to allow contracts to start with up to 64MiB if they request it, or we’ll break contracts. We likely will do that by reading the contract-declared memory size and then using min(contract-declared memory size, 64MiB) as the initial memory size
We need to unmap the memory instead of zeroing it out, especially if the initial memory size is too large, as we’d be vulnerable to very severe undercharging otherwise

Ekleog-NEAR · 2024-04-12T12:48:12Z

Current plan:

Dump the larger contracts, verifying what their contract-declared initial memory size is
Crunch some numbers to check whether that would be a size that would make sense to zero out manually or not
If yes, actually implement the full feature

Ekleog-NEAR · 2024-04-15T16:10:44Z

It looks like all the contracts from our top 10 most gas-using accounts request exactly 17 pages of initial wasm memory. My guess is this is the current minimum implied by rustc, assuming the contracts are all written in rust.

So I reran the experiment from #10851, except with a 20-pages hardcoded initial memory at the 3 places the #10851 protocol change patch had a 1 hardcoded. This is still very much a hack, but should be enough to at least get a rough answer of whether there’s actually gains to be hoped for by implementing the full feature.

The results are as follow:

For shard 2 and 3, the test somehow explodes the db, so there’s no answer here. Likely due to some other contract actually requiring more than 20 pages.
Shards 0 and 5 gets a significant (~1.9x faster) improvement
Shards 1 and 4 get a small (~1.1x faster) improvement

So there does seem to be some pretty significant gains to be had here. However, we will need a very good plan to test the changes, because of the results from shard 2 and 3 proving that there are lots of risks if the change is ill-implemented (like the hackish change used to test the potential gains)

Current plan:

Actually implement the full feature
Check that it actually makes things faster
Somehow determine at what contract memory size the effect of zeroing out the memory is actually worse than re-mmap’ing it
Update the estimator so that it gives numbers for contracts that actually use the full 64MiB memory, to not underestimate the worst-case changes
(Should be tracked in a future issue:) Add a separate fee for initializing contract memory, proportional to the contract-requested initial memory, to incentivize contracts to lower that?

This is basically just moving code around, to be able to reduce visibility into NearVmMemory’s internals and thus derisk modifying them. Part of #11033

Part of #11033

Ekleog-NEAR · 2024-05-08T15:08:54Z

TL;DR

Benchmark results of the full implementation on our current workload

I finished implementing the full change (minus the "protocol version bump" part of it) and benchmarked what the speedup would be if it were to be fully activated. The results are unfortunately as I initially feared: the time spent in MMAP is actually mostly zeroing out memory, and thus we cannot optimize it out.

Actually, this change results in a 20~30% slowdown even in the best-case scenario where we know what memory our contracts request ahead of time. The hope for speedup that came from the hackish experiment results was actually due to it being hackish, and likely not resetting properly the whole memory, or just resulting in contract crashes.

Further work that could happen on this topic

We could maybe hope for some improvements if contracts did active work to reduce their initial memory requirements, but they currently all request ~1MiB, due to rustc defaults. We could ask them to compile with -C link-arg=-zstack-size=16 or similar to reduce that initial stack size further.

Then we could maybe hope for the ~10% performance win that could be seen. Or maybe we would still see the 20-30% performance slowdown that we saw with 17 initial pages. In order to test we’d need some reproducible workload on a contract that we control, and that we could compile with varying numbers of memory pages.

In order to properly implement this, we would also need some incentive for contract authors to care about this rustc flag, and thus a new gas cost that’d depend on the size of the initially-requested memory. This would, again, be quite a lot of additional effort.

I’ve been suggesting dropping the idea before we even started to initially work on it due to the fact that the linux kernel is often pretty well-implemented, and it has access to more information than we could have as it knows which pages have actually been touched. So I’m once more suggesting that we drop the idea, because mmap should have minimal overhead besides just its memory zeroing, and we cannot just not do the memory zeroing.

If we ever have such a reproducible benchmark ready then it might make sense to at least see whether the patch does bring in benefits or not. But until then I don’t think it makes sense to spend more time on this optimization idea.

Detailed results

Control group

The control group, neard-base in the benchmark results, is based on nearcore on the c3ad36f commit. This commit is itself, the master branch at the a9ddf9c commit plus a prerequisite PR that should have no runtime (and thus no performance) impact.

Patch

commit d91e1d9bd71fc3f544e2d3b3c2dcb0a65542fc24
Author: Léo Gaspard <leo@near.org>
Date:   Wed Apr 24 10:34:23 2024 +0000

    wip

diff --git a/Cargo.lock b/Cargo.lock
index 9b7f3bf9d..674bcf206 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4929,11 +4929,13 @@ dependencies = [
  "borsh 1.0.0",
  "bytesize",
  "cov-mark",
+ "crossbeam",
  "ed25519-dalek",
  "enum-map",
  "expect-test",
  "finite-wasm",
  "hex",
+ "lazy_static",
  "lru 0.12.3",
  "memoffset 0.8.0",
  "near-crypto",
diff --git a/core/parameters/src/parameter_table.rs b/core/parameters/src/parameter_table.rs
index b4d68efb2..0b9a8fb5c 100644
--- a/core/parameters/src/parameter_table.rs
+++ b/core/parameters/src/parameter_table.rs
@@ -326,6 +326,7 @@ impl TryFrom<&ParameterTable> for RuntimeConfig {
                 function_call_weight: params.get(Parameter::FunctionCallWeight)?,
                 eth_implicit_accounts: params.get(Parameter::EthImplicitAccounts)?,
                 yield_resume_host_functions: params.get(Parameter::YieldResume)?,
+                lower_initial_contract_memory: true, // TODO: protocol change
             },
             account_creation_config: AccountCreationConfig {
                 min_allowed_top_level_account_length: params
diff --git a/core/parameters/src/view.rs b/core/parameters/src/view.rs
index 39487c249..06a6d656c 100644
--- a/core/parameters/src/view.rs
+++ b/core/parameters/src/view.rs
@@ -227,6 +227,8 @@ pub struct VMConfigView {
     pub eth_implicit_accounts: bool,
     /// See [`VMConfig::yield_resume_host_functions`].
     pub yield_resume_host_functions: bool,
+    /// See [`VMConfig::lower_initial_contract_memory`].
+    pub lower_initial_contract_memory: bool,
 
     /// Describes limits for VM and Runtime.
     ///
@@ -253,6 +255,7 @@ impl From<crate::vm::Config> for VMConfigView {
             vm_kind: config.vm_kind,
             eth_implicit_accounts: config.eth_implicit_accounts,
             yield_resume_host_functions: config.yield_resume_host_functions,
+            lower_initial_contract_memory: config.lower_initial_contract_memory,
         }
     }
 }
@@ -275,6 +278,7 @@ impl From<VMConfigView> for crate::vm::Config {
             vm_kind: view.vm_kind,
             eth_implicit_accounts: view.eth_implicit_accounts,
             yield_resume_host_functions: view.yield_resume_host_functions,
+            lower_initial_contract_memory: view.lower_initial_contract_memory,
         }
     }
 }
diff --git a/core/parameters/src/vm.rs b/core/parameters/src/vm.rs
index f2f7a46af..7c721e60f 100644
--- a/core/parameters/src/vm.rs
+++ b/core/parameters/src/vm.rs
@@ -192,6 +192,9 @@ pub struct Config {
     /// Enable the `promise_yield_create` and `promise_yield_resume` host functions.
     pub yield_resume_host_functions: bool,
 
+    /// Enable the `LowerInitialContractMemory` protocol feature.
+    pub lower_initial_contract_memory: bool,
+
     /// Describes limits for VM and Runtime.
     pub limit_config: LimitConfig,
 }
@@ -224,6 +227,7 @@ impl Config {
         self.ed25519_verify = true;
         self.math_extension = true;
         self.implicit_account_creation = true;
+        self.lower_initial_contract_memory = true;
     }
 }
 
diff --git a/runtime/near-vm-runner/Cargo.toml b/runtime/near-vm-runner/Cargo.toml
index 042bd0bab..2e5910369 100644
--- a/runtime/near-vm-runner/Cargo.toml
+++ b/runtime/near-vm-runner/Cargo.toml
@@ -17,9 +17,11 @@ anyhow = { workspace = true, optional = true }
 base64.workspace = true
 bn.workspace = true
 borsh.workspace = true
+crossbeam.workspace = true
 ed25519-dalek.workspace = true
 enum-map.workspace = true
 finite-wasm = { workspace = true, features = ["instrument"], optional = true }
+lazy_static.workspace = true
 lru = "0.12.3"
 memoffset.workspace = true
 num-rational.workspace = true
diff --git a/runtime/near-vm-runner/src/cache.rs b/runtime/near-vm-runner/src/cache.rs
index 11b238a66..ae4c19c25 100644
--- a/runtime/near-vm-runner/src/cache.rs
+++ b/runtime/near-vm-runner/src/cache.rs
@@ -80,6 +80,7 @@ impl CompiledContract {
 #[derive(Debug, Clone, PartialEq, BorshDeserialize, BorshSerialize)]
 pub struct CompiledContractInfo {
     pub wasm_bytes: u64,
+    pub initial_memory_pages: u32,
     pub compiled: CompiledContract,
 }
 
@@ -314,6 +315,7 @@ impl ContractRuntimeCache for FilesystemContractRuntimeCache {
             }
         }
         temp_file.write_all(&value.wasm_bytes.to_le_bytes())?;
+        temp_file.write_all(&value.initial_memory_pages.to_le_bytes())?;
         let temp_filename = temp_file.into_temp_path();
         // This is atomic, so there wouldn't be instances where getters see an intermediate state.
         rustix::fs::renameat(&self.state.dir, &*temp_filename, &self.state.dir, final_filename)?;
@@ -351,15 +353,17 @@ impl ContractRuntimeCache for FilesystemContractRuntimeCache {
             // The file turns out to be empty/truncated? Treat as if there's no cached file.
             return Ok(None);
         }
-        let wasm_bytes = u64::from_le_bytes(buffer[buffer.len() - 8..].try_into().unwrap());
-        let tag = buffer[buffer.len() - 9];
-        buffer.truncate(buffer.len() - 9);
+        let initial_memory_pages = u32::from_le_bytes(buffer[buffer.len() - 4..].try_into().unwrap());
+        let wasm_bytes = u64::from_le_bytes(buffer[buffer.len() - 12..buffer.len() - 4].try_into().unwrap());
+        let tag = buffer[buffer.len() - 13];
+        buffer.truncate(buffer.len() - 13);
         Ok(match tag {
             CODE_TAG => {
-                Some(CompiledContractInfo { wasm_bytes, compiled: CompiledContract::Code(buffer) })
+                Some(CompiledContractInfo { wasm_bytes, initial_memory_pages, compiled: CompiledContract::Code(buffer) })
             }
             ERROR_TAG => Some(CompiledContractInfo {
                 wasm_bytes,
+                initial_memory_pages,
                 compiled: CompiledContract::CompileModuleError(borsh::from_slice(&buffer)?),
             }),
             // File is malformed? For this code, since we're talking about a cache lets just treat
diff --git a/runtime/near-vm-runner/src/near_vm_runner/memory.rs b/runtime/near-vm-runner/src/near_vm_runner/memory.rs
index 7d8b7f570..37432de63 100644
--- a/runtime/near-vm-runner/src/near_vm_runner/memory.rs
+++ b/runtime/near-vm-runner/src/near_vm_runner/memory.rs
@@ -26,7 +26,6 @@ impl NearVmMemory {
         )?)))
     }
 
-    #[cfg(unused)] // TODO: this will be used once we reuse the memories
     pub fn into_preallocated(self) -> Result<PreallocatedMemory, String> {
         Ok(PreallocatedMemory(
             Arc::into_inner(self.0)
diff --git a/runtime/near-vm-runner/src/near_vm_runner/mod.rs b/runtime/near-vm-runner/src/near_vm_runner/mod.rs
index f2c7b48b5..331d2e407 100644
--- a/runtime/near-vm-runner/src/near_vm_runner/mod.rs
+++ b/runtime/near-vm-runner/src/near_vm_runner/mod.rs
@@ -38,7 +38,7 @@ enum NearVmCompiler {
 //  major version << 6
 //  minor version
 const VM_CONFIG: NearVmConfig = NearVmConfig {
-    seed: (2 << 29) | (2 << 6) | 2,
+    seed: (2 << 29) | (2 << 6) | 15,
     engine: NearVmEngine::Universal,
     compiler: NearVmCompiler::Singlepass,
 };
diff --git a/runtime/near-vm-runner/src/near_vm_runner/runner.rs b/runtime/near-vm-runner/src/near_vm_runner/runner.rs
index eb0cdf12d..2e6f89e35 100644
--- a/runtime/near-vm-runner/src/near_vm_runner/runner.rs
+++ b/runtime/near-vm-runner/src/near_vm_runner/runner.rs
@@ -8,6 +8,7 @@ use crate::logic::errors::{
 use crate::logic::gas_counter::FastGasCounter;
 use crate::logic::types::PromiseResult;
 use crate::logic::{Config, External, VMContext, VMLogic, VMOutcome};
+use crate::near_vm_runner::memory::PreallocatedMemory;
 use crate::near_vm_runner::{NearVmCompiler, NearVmEngine};
 use crate::runner::VMResult;
 use crate::{
@@ -160,7 +161,7 @@ impl NearVM {
     pub(crate) fn compile_uncached(
         &self,
         code: &ContractCode,
-    ) -> Result<UniversalExecutable, CompilationError> {
+    ) -> Result<(u32, UniversalExecutable), CompilationError> {
         let _span = tracing::debug_span!(target: "vm", "NearVM::compile_uncached").entered();
         let prepared_code = prepare::prepare_contract(code.code(), &self.config, VMKind::NearVm)
             .map_err(CompilationError::PrepareError)?;
@@ -169,6 +170,7 @@ impl NearVM {
             matches!(self.engine.validate(&prepared_code), Ok(_)),
             "near_vm failed to validate the prepared code"
         );
+        let initial_memory_pages = self.initial_memory_pages_for(&prepared_code)?;
         let executable = self
             .engine
             .compile_universal(&prepared_code, &self)
@@ -176,19 +178,20 @@ impl NearVM {
                 tracing::error!(?err, "near_vm failed to compile the prepared code (this is defense-in-depth, the error was recovered from but should be reported to pagoda)");
                 CompilationError::WasmerCompileError { msg: err.to_string() }
             })?;
-        Ok(executable)
+        Ok((initial_memory_pages, executable))
     }
 
     fn compile_and_cache(
         &self,
         code: &ContractCode,
         cache: &dyn ContractRuntimeCache,
-    ) -> Result<Result<UniversalExecutable, CompilationError>, CacheError> {
+    ) -> Result<Result<(u32, UniversalExecutable), CompilationError>, CacheError> {
         let executable_or_error = self.compile_uncached(code);
         let key = get_contract_cache_key(*code.hash(), &self.config);
         let record = CompiledContractInfo {
             wasm_bytes: code.code().len() as u64,
-            compiled: match &executable_or_error {
+            initial_memory_pages: executable_or_error.as_ref().map(|r| r.0).unwrap_or(self.config.limit_config.initial_memory_pages),
+            compiled: match executable_or_error.as_ref().map(|r| &r.1) {
                 Ok(executable) => {
                     let code = executable
                         .serialize()
@@ -220,10 +223,10 @@ impl NearVM {
         method_name: &str,
         closure: impl FnOnce(VMMemory, VMLogic<'_>, &VMArtifact) -> Result<VMOutcome, VMRunnerError>,
     ) -> VMResult<VMOutcome> {
-        // (wasm code size, compilation result)
-        type MemoryCacheType = (u64, Result<VMArtifact, CompilationError>);
+        // (wasm code size, initial memory pages, compilation result)
+        type MemoryCacheType = (u64, u32, Result<VMArtifact, CompilationError>);
         let to_any = |v: MemoryCacheType| -> Box<dyn std::any::Any + Send> { Box::new(v) };
-        let (wasm_bytes, artifact_result) = cache.memory_cache().try_lookup(
+        let (wasm_bytes, initial_memory_pages, artifact_result) = cache.memory_cache().try_lookup(
             code_hash,
             || match code {
                 None => {
@@ -243,9 +246,9 @@ impl NearVM {
                     };
 
                     match &code.compiled {
-                        CompiledContract::CompileModuleError(err) => {
-                            Ok::<_, VMRunnerError>(to_any((code.wasm_bytes, Err(err.clone()))))
-                        }
+                        CompiledContract::CompileModuleError(err) => Ok::<_, VMRunnerError>(
+                            to_any((code.wasm_bytes, code.initial_memory_pages, Err(err.clone()))),
+                        ),
                         CompiledContract::Code(serialized_module) => {
                             let _span =
                                 tracing::debug_span!(target: "vm", "NearVM::load_from_fs_cache")
@@ -269,7 +272,11 @@ impl NearVM {
                                     .load_universal_executable_ref(&executable)
                                     .map(Arc::new)
                                     .map_err(|err| VMRunnerError::LoadingError(err.to_string()))?;
-                                Ok(to_any((code.wasm_bytes, Ok(artifact))))
+                                Ok(to_any((
+                                    code.wasm_bytes,
+                                    code.initial_memory_pages,
+                                    Ok(artifact),
+                                )))
                             }
                         }
                     }
@@ -277,34 +284,39 @@ impl NearVM {
                 Some(code) => {
                     let _span =
                         tracing::debug_span!(target: "vm", "NearVM::build_from_source").entered();
-                    Ok(to_any((
-                        code.code().len() as u64,
+                    let (initial_memory_pages, compiled) =
                         match self.compile_and_cache(code, cache)? {
-                            Ok(executable) => Ok(self
-                                .engine
-                                .load_universal_executable(&executable)
-                                .map(Arc::new)
-                                .map_err(|err| VMRunnerError::LoadingError(err.to_string()))?),
-                            Err(err) => Err(err),
-                        },
-                    )))
+                            Ok((initial_memory_pages, executable)) => (
+                                initial_memory_pages,
+                                Ok(self
+                                    .engine
+                                    .load_universal_executable(&executable)
+                                    .map(Arc::new)
+                                    .map_err(|err| VMRunnerError::LoadingError(err.to_string()))?),
+                            ),
+                            Err(err) => (0, Err(err)),
+                        };
+                    Ok(to_any((code.code().len() as u64, initial_memory_pages, compiled)))
                 }
             },
             move |value| {
                 let _span =
                     tracing::debug_span!(target: "vm", "NearVM::load_from_mem_cache").entered();
-                let &(wasm_bytes, ref downcast) = value
+                let &(wasm_bytes, initial_memory_pages, ref downcast) = value
                     .downcast_ref::<MemoryCacheType>()
                     .expect("downcast should always succeed");
 
-                (wasm_bytes, downcast.clone())
+                (wasm_bytes, initial_memory_pages, downcast.clone())
             },
         )?;
 
+        lazy_static::lazy_static! {
+            static ref MEMORIES: crossbeam::queue::ArrayQueue<PreallocatedMemory> = crossbeam::queue::ArrayQueue::new(16);
+        }
         let mut memory = NearVmMemory::new(
-            self.config.limit_config.initial_memory_pages,
+            initial_memory_pages,
             self.config.limit_config.max_memory_pages,
-            None, // TODO: this should actually reuse the memories
+            MEMORIES.pop(),
         )
         .expect("Cannot create memory for a contract call");
         // FIXME: this mostly duplicates the `run_module` method.
@@ -323,12 +335,38 @@ impl NearVM {
                 if let Err(e) = result {
                     return Ok(VMOutcome::abort(logic, e));
                 }
-                closure(vmmemory, logic, &artifact)
+                let res = closure(vmmemory, logic, &artifact);
+                if let Ok(mmap) = memory.into_preallocated() {
+                    tracing::info!("Reusing a memory");
+                    let _ = MEMORIES.push(mmap);
+                } else {
+                    tracing::error!("Not reusing a memory")
+                }
+                res
             }
             Err(e) => Ok(VMOutcome::abort(logic, FunctionCallError::CompilationError(e))),
         }
     }
 
+    fn initial_memory_pages_for(&self, code: &[u8]) -> Result<u32, CompilationError> {
+        let parser = wasmparser::Parser::new(0);
+        for payload in parser.parse_all(code) {
+            if let Ok(wasmparser::Payload::ImportSection(reader)) = payload {
+                for mem in reader {
+                    if let Ok(mem) = mem {
+                        if let wasmparser::ImportSectionEntryType::Memory(
+                            wasmparser::MemoryType::M32 { limits, .. },
+                        ) = mem.ty
+                        {
+                            return Ok(limits.initial);
+                        }
+                    }
+                }
+            }
+        }
+        panic!("Tried running a contract that was not prepared with a memory import");
+    }
+
     fn run_method(
         &self,
         artifact: &VMArtifact,
diff --git a/runtime/near-vm-runner/src/prepare/prepare_v2.rs b/runtime/near-vm-runner/src/prepare/prepare_v2.rs
index 894dea6fd..ada52e31b 100644
--- a/runtime/near-vm-runner/src/prepare/prepare_v2.rs
+++ b/runtime/near-vm-runner/src/prepare/prepare_v2.rs
@@ -240,8 +240,37 @@ impl<'a> PrepareContext<'a> {
     }
 
     fn memory_import(&self) -> wasm_encoder::EntityType {
+        // First, figure out the requested initial memory
+        // This parsing should be fast enough, as it can skip over whole sections
+        // And considering the memory section is after the import section, we will not
+        // have parsed it yet in the "regular" parse when calling this function.
+        let mut requested_initial_memory = None;
+        let parser = wp::Parser::new(0);
+        'outer: for payload in parser.parse_all(self.code) {
+            if let Ok(wp::Payload::MemorySection(reader)) = payload {
+                for mem in reader {
+                    if let Ok(mem) = mem {
+                        if requested_initial_memory.is_some() {
+                            requested_initial_memory = None;
+                            break 'outer;
+                        }
+                        requested_initial_memory = Some(mem.initial);
+                    }
+                }
+            }
+        }
+
+        // Then, generate a memory import, that has at most the limit-configured initial memory,
+        // but tries to get that number down by using the contract-provided data.
+        let max_initial_memory = requested_initial_memory.unwrap_or(u64::MAX);
+        let config_initial_memory = u64::from(self.config.limit_config.initial_memory_pages);
+        let initial_memory = if self.config.lower_initial_contract_memory {
+            std::cmp::min(max_initial_memory, config_initial_memory)
+        } else {
+            config_initial_memory
+        };
         wasm_encoder::EntityType::Memory(wasm_encoder::MemoryType {
-            minimum: u64::from(self.config.limit_config.initial_memory_pages),
+            minimum: initial_memory,
             maximum: Some(u64::from(self.config.limit_config.max_memory_pages)),
             memory64: false,
             shared: false,
diff --git a/runtime/near-vm-runner/src/wasmer2_runner.rs b/runtime/near-vm-runner/src/wasmer2_runner.rs
index d79e6b1cd..c7f952f56 100644
--- a/runtime/near-vm-runner/src/wasmer2_runner.rs
+++ b/runtime/near-vm-runner/src/wasmer2_runner.rs
@@ -300,6 +300,7 @@ impl Wasmer2VM {
         if let Some(cache) = cache {
             let record = CompiledContractInfo {
                 wasm_bytes: code.code().len() as u64,
+                initial_memory_pages: self.config.limit_config.initial_memory_pages,
                 compiled: match &executable_or_error {
                     Ok(executable) => {
                         let code = executable
diff --git a/runtime/near-vm-runner/src/wasmer_runner.rs b/runtime/near-vm-runner/src/wasmer_runner.rs
index 59bf3bf81..f2ea9f06d 100644
--- a/runtime/near-vm-runner/src/wasmer_runner.rs
+++ b/runtime/near-vm-runner/src/wasmer_runner.rs
@@ -268,6 +268,7 @@ impl Wasmer0VM {
         if let Some(cache) = cache {
             let record = CompiledContractInfo {
                 wasm_bytes: code.code().len() as u64,
+                initial_memory_pages: self.config.limit_config.initial_memory_pages,
                 compiled: match &module_or_error {
                     Ok(module) => {
                         let code = module
diff --git a/runtime/near-vm/vm/src/memory/linear_memory.rs b/runtime/near-vm/vm/src/memory/linear_memory.rs
index 6e786e00c..43f84b8bc 100644
--- a/runtime/near-vm/vm/src/memory/linear_memory.rs
+++ b/runtime/near-vm/vm/src/memory/linear_memory.rs
@@ -141,7 +141,7 @@ impl LinearMemory {
         let mapped_pages = memory.minimum;
         let mapped_bytes = mapped_pages.bytes();
 
-        let alloc = if let Some(alloc) = from_mmap {
+        let alloc = if let Some(mut alloc) = from_mmap {
             // For now we always request the same size, because our prepare step hardcodes a maximum size
             // of 64 MiB. This could change in the future, at which point this assert will start triggering
             // and we’ll need to think of a better way to handle things.
@@ -150,6 +150,7 @@ impl LinearMemory {
                 request_bytes,
                 "Multiple data memory mmap's had different maximal lengths"
             );
+            alloc.make_accessible(mapped_bytes.0).map_err(MemoryError::Region)?;
             alloc
         } else {
             Mmap::accessible_reserved(mapped_bytes.0, request_bytes).map_err(MemoryError::Region)?
diff --git a/runtime/near-vm/vm/src/mmap.rs b/runtime/near-vm/vm/src/mmap.rs
index 9df6e1e65..b5985454b 100644
--- a/runtime/near-vm/vm/src/mmap.rs
+++ b/runtime/near-vm/vm/src/mmap.rs
@@ -227,6 +227,9 @@ impl Mmap {
     pub fn reset(&mut self) -> Result<(), String> {
         unsafe {
             if self.accessible_len > 0 {
+                if self.accessible_len > 18 * 64 * 1024 {
+                    return Err(String::from("too big memories are not worth resetting"));
+                }
                 self.as_mut_ptr().write_bytes(0, self.accessible_len);
                 region::protect(self.as_ptr(), self.accessible_len, region::Protection::NONE)
                     .map_err(|e| e.to_string())?;

Testing

Running the patched neard manually confirmed that the new tracing logs do confirm that memories are properly reused most of the time. In the benchmark results, 17% of the time is also spent in the write_bytes call that manually resets the memories to 0, thus validating the hypothesis that resetting the memories takes a long time.

Running with the if self.accessible_len > 18 * 64 * 1024 (18 WASM pages) replaced with if self.accessible_len > 17 * 64 * 1024 (17 WASM pages) also results in memories not being reused. With this change, the same benchmarks show basically no difference with the control group. This confirms that the additional wasmparser runs do not result in any significant performance penalties.

Considering this testing, the benchmark results are representative of exactly the performance changes provoked by not relying on the linux kernel’s zeroing out of memory in mmap, and instead doing it ourselves.

Benchmark results

Shard 0

Shard 0 failed to reproduce blocks, thus preventing a proper benchmark comparison. This is likely due to the missing protocol version bump, and does indicate that finishing the change properly would likely result in real-world contract breakage somewhere on the shard.

Shard 1

On shard 1, performance is basically the same with and without the patch:

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking
  Time (mean ± σ):      2.187 s ±  0.027 s    [User: 4.209 s, System: 0.849 s]
  Range (min … max):    2.158 s …  2.212 s    3 runs

Benchmark 2: ./neard-reuse view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking
  Time (mean ± σ):      2.151 s ±  0.018 s    [User: 4.159 s, System: 0.829 s]
  Range (min … max):    2.135 s …  2.171 s    3 runs

Summary
  ./neard-reuse view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking ran
    1.02 ± 0.02 times faster than ./neard-base view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking

Shard 2

Shard 2 sees a 20% slowdown with the patch:

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking
  Time (mean ± σ):      5.229 s ±  0.124 s    [User: 9.260 s, System: 3.722 s]
  Range (min … max):    5.098 s …  5.344 s    3 runs

Benchmark 2: ./neard-reuse view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking
  Time (mean ± σ):      6.269 s ±  0.012 s    [User: 10.259 s, System: 4.591 s]
  Range (min … max):    6.259 s …  6.283 s    3 runs

Summary
  ./neard-base view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking ran
    1.20 ± 0.03 times faster than ./neard-reuse view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking

Shard 3

Shard 3 sees the same 20-25% slowdown with the patch:

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 3 --use-flat-storage benchmarking
  Time (mean ± σ):      9.158 s ±  0.150 s    [User: 27.688 s, System: 5.420 s]
  Range (min … max):    8.985 s …  9.252 s    3 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.

Benchmark 2: ./neard-reuse view-state --readwrite apply-range --shard-id 3 --use-flat-storage benchmarking
  Time (mean ± σ):     11.190 s ±  0.089 s    [User: 34.099 s, System: 7.273 s]
  Range (min … max):   11.118 s … 11.290 s    3 runs

Summary
  ./neard-base view-state --readwrite apply-range --shard-id 3 --use-flat-storage benchmarking ran
    1.22 ± 0.02 times faster than ./neard-reuse view-state --readwrite apply-range --shard-id 3 --use-flat-storage benchmarking

Shard 4

Shard 4 was like shard 0, with no results due to outcome changes that break the benchmarking setup.

Shard 5

Shard 5 even saw the patch-using binary being 35% slower than the baseline:

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 5 --use-flat-storage benchmarking
  Time (mean ± σ):      4.229 s ±  0.212 s    [User: 5.642 s, System: 2.227 s]
  Range (min … max):    4.037 s …  4.456 s    3 runs

Benchmark 2: ./neard-reuse view-state --readwrite apply-range --shard-id 5 --use-flat-storage benchmarking
  Time (mean ± σ):      5.765 s ±  0.663 s    [User: 15.108 s, System: 3.978 s]
  Range (min … max):    5.309 s …  6.526 s    3 runs

Summary
  ./neard-base view-state --readwrite apply-range --shard-id 5 --use-flat-storage benchmarking ran
    1.36 ± 0.17 times faster than ./neard-reuse view-state --readwrite apply-range --shard-id 5 --use-flat-storage benchmarking

Ekleog-NEAR · 2024-05-15T12:14:18Z

Closing as per the above comment.

akhi3030 mentioned this issue Apr 11, 2024

[Tracking issue] contract runtime work on alleviating congestion on mainnet #10981

Open

akhi3030 added A-contract-runtime Area: contract compilation and execution, virtual machines, etc T-contract-runtime Team: issues relevant to the contract runtime team labels Apr 11, 2024

Ekleog-NEAR mentioned this issue Apr 11, 2024

near_vm: reuse contract data/memory between contract executions #10851

Closed

Ekleog-NEAR self-assigned this Apr 12, 2024

Ekleog-NEAR mentioned this issue Apr 16, 2024

split NearVmMemory into a separate file to limit visibility #11084

Merged

This was referenced Apr 18, 2024

reduce visibility around LinearMemory #11108

Merged

Tech debt: getting rid of VMMemoryDefinitionOwnership #11116

Open

github-merge-queue bot pushed a commit that referenced this issue Apr 19, 2024

reduce visibility around LinearMemory (#11108)

0202ed6

Part of #11033

Ekleog-NEAR mentioned this issue Apr 24, 2024

introduce all the necessary features for reusing contract data memories #11144

Closed

github-actions bot mentioned this issue May 1, 2024

Monthly issue metrics report #11194

Open

Ekleog-NEAR closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf idea: better runtime support for contracts that use less memory #11033

Perf idea: better runtime support for contracts that use less memory #11033

akhi3030 commented Apr 11, 2024 •

edited by Ekleog-NEAR

Loading

Ekleog-NEAR commented Apr 11, 2024 •

edited

Loading

Ekleog-NEAR commented Apr 12, 2024

Ekleog-NEAR commented Apr 15, 2024 •

edited

Loading

Ekleog-NEAR commented May 8, 2024

Ekleog-NEAR commented May 15, 2024

Perf idea: better runtime support for contracts that use less memory #11033

Perf idea: better runtime support for contracts that use less memory #11033

Comments

akhi3030 commented Apr 11, 2024 • edited by Ekleog-NEAR Loading

Ekleog-NEAR commented Apr 11, 2024 • edited Loading

Ekleog-NEAR commented Apr 12, 2024

Ekleog-NEAR commented Apr 15, 2024 • edited Loading

Ekleog-NEAR commented May 8, 2024

TL;DR

Benchmark results of the full implementation on our current workload

Further work that could happen on this topic

Detailed results

Control group

Patch

Testing

Benchmark results

Shard 0

Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Ekleog-NEAR commented May 15, 2024

akhi3030 commented Apr 11, 2024 •

edited by Ekleog-NEAR

Loading

Ekleog-NEAR commented Apr 11, 2024 •

edited

Loading

Ekleog-NEAR commented Apr 15, 2024 •

edited

Loading