Skip to content

Commit

Permalink
Merge pull request #1289 from filecoin-project/opt/sdr-phase1
Browse files Browse the repository at this point in the history
Optimize Phase1.
  • Loading branch information
porcuquine authored Sep 25, 2020
2 parents d77986e + 0313c66 commit 24cd455
Show file tree
Hide file tree
Showing 19 changed files with 1,292 additions and 251 deletions.
31 changes: 31 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,33 @@ jobs:
RUST_TEST_THREADS: 1
FIL_PROOFS_USE_FIL_BLST: true

# Running with `use_multicore_sdr=true` should be integrated directly into the test code. For now we
# just re-run the lifecycle tests to exercise the use_multicore_sdr code path with that setting set.
test_multicore_sdr:
docker:
- image: filecoin/rust:latest
working_directory: /mnt/crate
resource_class: 2xlarge+
steps:
- configure_environment_variables
- checkout
- attach_workspace:
at: "."
- restore_cache:
keys:
- cargo-v28-{{ checksum "rust-toolchain" }}-{{ checksum "Cargo.toml" }}-{{ checksum "Cargo.lock" }}-{{ arch }}
- restore_parameter_cache
- run:
name: Test with use_multicore_sdr enabled
command: |
ulimit -n 20000
ulimit -u 20000
ulimit -n 20000
cargo +$(cat rust-toolchain) test --verbose --release -- --ignored lifecycle
environment:
RUST_TEST_THREADS: 1
FIL_PROOFS_USE_MULTICORE_SDR: true

bench:
docker:
- image: filecoin/rust:latest
Expand Down Expand Up @@ -382,6 +409,10 @@ workflows:
requires:
- cargo_fetch
- ensure_groth_parameters_and_keys_linux
- test_multicore_sdr:
requires:
- cargo_fetch
- ensure_groth_parameters_and_keys_linux
- test:
requires:
- cargo_fetch
Expand Down
34 changes: 21 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,24 +141,14 @@ While replicating and generating the Merkle Trees (MT) for the proof at the same

### Speed

One of the most computational expensive operations during replication (besides the encoding itself) is the generation of the indexes of the (expansion) parents in the Stacked graph, implemented through a Feistel cipher (used as a pseudorandom permutation). To reduce that time we provide a caching mechanism to generate them only once and reuse them throughout replication (across the different layers). Already built into the system it can be activated with the environmental variable

```
FIL_PROOFS_MAXIMIZE_CACHING=1
```

To check that it's working you can inspect the replication log to find `using parents cache of unlimited size`. As the log indicates, we don't have a fine grain control at the moment so it either stores all parents or none. This cache will add about 1.5x the entire sector size to the disk cache used during replication, and a configurable sliding window of cached data is used as memory overhead. This setting is _very recommended_ as it has a considerable impact on replication time.

You can also verify if the cache is working by inspecting the time each layer takes to encode, `encoding, layer:` in the log, where the first two layers, forward and reverse, will take more time than the rest to populate the cache while the remaining 8 should see a considerable time drop.

Note that this setting is enabled by `default`. It can be disabled by setting the value to 0.

A related setting that can also be tuned is the SDR parents cache size. This value is defaulted to 2048 nodes, which is the equivalent of 112KiB of resident memory (where each cached node consists of DEGREE (base + exp = 6 + 8) x 4 byte elements = 56 bytes in length). Given that the cache is now located on disk, it is memory mapped when accessed in window sizes related to this variable. This default was chosen to minimize memory while still allowing efficient access to the cache. If you would like to experiment with alternate sizes, you can modify the environment variable
One of the most computationally expensive operations during replication (besides the encoding itself) is the generation of the indexes of the (expansion) parents in the Stacked graph, implemented through a Feistel cipher (used as a pseudorandom permutation). To reduce that time we provide a caching mechanism to generate them only once and reuse them throughout replication (across the different layers).

```
FIL_PROOFS_SDR_PARENTS_CACHE_SIZE=2048
```

This value is defaulted to 2048 nodes, which is the equivalent of 112KiB of resident memory (where each cached node consists of DEGREE (base + exp = 6 + 8) x 4 byte elements = 56 bytes in length). Given that the cache is now located on disk, it is memory mapped when accessed in window sizes related to this variable. This default was chosen to minimize memory while still allowing efficient access to the cache. If you would like to experiment with alternate sizes, you can modify the environment variable

Increasing this value will increase the amount of resident RAM used.

Lastly, the parent's cache data is located on disk by default in `/var/tmp/filecoin-parents`. To modify this location, use the environment variable
Expand All @@ -171,6 +161,24 @@ Using the above, the cache data would be located at `/path/to/parent/cache/filec

Alternatively, use `FIL_PROOFS_CACHE_DIR=/path/to/parent/cache`, in which the parent cache will be located in `$FIL_PROOFS_CACHE_DIR/filecoin-parents`. Note that if you're using `FIL_PROOFS_CACHE_DIR`, it must be set through the environment and cannot be set using the configuration file. This setting has no effect if `FIL_PROOFS_PARENT_CACHE` is also specified.

```
FIL_PROOFS_USE_MULTICORE_SDR
```

When performing SDR replication (Precommit Phase 1) using only a single core, memory access to fetch a node's parents is
a bottlneck. Multicore SDR uses multiple cores (which should be restricted to a single core complex for shared cache) to
assemble each nodes parents and perform some prehashing. This setting is not enabled by default but can be activated by
setting `FIL_PROOFS_USE_MULTICORE_SDR=1`.

To take advantage of shared cache, the process should have been restricted to a single complex's cores. For example, on
an AMD Threadripper 3970x (where tested), this can be accomplished using `taskset -c 4,5,6,7` to ensure four 'adjacent'
cores are used (note that this avoids spanning a complex border).

Best performance will also be achieved when it is possible to lock pages which have been memory-mapped. This can be
accomplished either by running the process as root, or by increasing the system limit for max locked memory with `ulimit
-l`. Two sector size's worth of data (for current and previous layers) must be locked -- along with 56 *
`FIL_PROOFS_PARENT_CACHE_SIZE` bytes for the parent cache.

### GPU Usage

We can now optionally build the column hashed tree 'tree_c' using the GPU with noticeable speed-up over the CPU. To activate the GPU for this, use the environment variable
Expand Down
47 changes: 2 additions & 45 deletions fil-proofs-tooling/src/bin/settings/main.rs
Original file line number Diff line number Diff line change
@@ -1,51 +1,8 @@
use anyhow::Result;

use storage_proofs::settings::{Settings, SETTINGS};
use storage_proofs::settings::SETTINGS;

fn main() -> Result<()> {
let Settings {
parameter_cache,
maximize_caching,
verify_cache,
verify_production_params,
parent_cache,
sdr_parents_cache_size,
use_gpu_column_builder,
max_gpu_column_batch_size,
column_write_batch_size,
use_gpu_tree_builder,
max_gpu_tree_batch_size,
rows_to_discard,
window_post_synthesis_num_cpus,
pedersen_hash_exp_window_size,
use_fil_blst,
} = &*SETTINGS.lock().unwrap();

println!("parameter_cache: {}", parameter_cache);
println!("maximize_caching: {}", maximize_caching);
println!("verify_cache: {}", verify_cache);
println!("verify_production_params: {}", verify_production_params);
println!("parent_cache: {}", parent_cache);
println!("sdr_parents_cache_size: {}", sdr_parents_cache_size);

println!("use_gpu_column_builder: {}", use_gpu_column_builder);
println!("max_gpu_column_batch_size: {}", max_gpu_column_batch_size);
println!("column_write_batch_size: {}", column_write_batch_size);

println!("use_gpu_tree_builder: {}", use_gpu_tree_builder);
println!("max_gpu_tree_batch_size: {}", max_gpu_tree_batch_size);

println!("rows_to_discard: {}", rows_to_discard);
println!(
"window_post_synthesis_num_cpus: {}",
window_post_synthesis_num_cpus
);
println!(
"pedersen_hash_exp_window_size: {}",
pedersen_hash_exp_window_size
);

println!("use_fil_blst: {}", use_fil_blst);

println!("{:#?}", *SETTINGS.lock().unwrap());
Ok(())
}
5 changes: 3 additions & 2 deletions rust-fil-proofs.config.toml.sample
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@
# The location to store downloaded parameter files required for proofs.
parameter_cache = "/var/tmp/filecoin-proofs-parameters/"

# This enables the use of the parent cache to help speed-up runtime.
maximize_caching = true
# The location to store the on-disk parents cache.
parent_cache = "/var/tmp/filecoin-parents"
# The max number of parent cache elements to have mapped in RAM at a time.
Expand Down Expand Up @@ -38,3 +36,6 @@ pedersen_hash_exp_window_size = 16

# This enables accelerate snark verification
use_fil_blst = false

# This enables multicore SDR replication
use_multicore_sdr = false
4 changes: 2 additions & 2 deletions storage-proofs/core/src/settings.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ const PREFIX: &str = "FIL_PROOFS";
#[derive(Debug, Serialize, Deserialize)]
#[serde(default)]
pub struct Settings {
pub maximize_caching: bool,
pub verify_cache: bool,
pub verify_production_params: bool,
pub pedersen_hash_exp_window_size: u32,
Expand All @@ -31,12 +30,12 @@ pub struct Settings {
pub parameter_cache: String,
pub parent_cache: String,
pub use_fil_blst: bool,
pub use_multicore_sdr: bool,
}

impl Default for Settings {
fn default() -> Self {
Settings {
maximize_caching: true,
verify_cache: false,
verify_production_params: false,
pedersen_hash_exp_window_size: 16,
Expand All @@ -54,6 +53,7 @@ impl Default for Settings {
parameter_cache: "/var/tmp/filecoin-proof-parameters/".to_string(),
parent_cache: cache("filecoin-parents"),
use_fil_blst: false,
use_multicore_sdr: false,
}
}
}
Expand Down
7 changes: 5 additions & 2 deletions storage-proofs/porep/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,16 @@ repository = "https://github.com/filecoin-project/rust-fil-proofs"
readme = "README.md"

[dependencies]
crossbeam = "0.7.3"
digest = "0.9"
storage-proofs-core = { path = "../core", version = "^5.0.0"}
sha2raw = { path = "../../sha2raw", version = "^2.0.0"}
rand = "0.7"
merkletree = "0.21.0"
memmap = "0.7"
mapr = "0.8.0"
num-bigint = "0.2"
num-traits = "0.2"
sha2 = "0.9.1"
sha2 = { version = "0.9.1", features = ["compress"] }
rayon = "1.0.0"
serde = { version = "1.0", features = ["derive"]}
serde_json = "1.0"
Expand All @@ -34,6 +36,7 @@ hex = "0.4.2"
bincode = "1.1.2"
byteorder = "1.3.4"
lazy_static = "1.2"
byte-slice-cast = "0.3.5"

[dev-dependencies]
tempfile = "3"
Expand Down
5 changes: 4 additions & 1 deletion storage-proofs/porep/benches/encode.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,10 @@ use rand::thread_rng;
use storage_proofs_core::fr32::fr_into_bytes;
use storage_proofs_core::hasher::sha256::Sha256Hasher;
use storage_proofs_core::hasher::{Domain, Hasher};
use storage_proofs_porep::stacked::{create_label, create_label_exp, StackedBucketGraph};
use storage_proofs_porep::stacked::{
create_label::single::{create_label, create_label_exp},
StackedBucketGraph,
};

struct Pregenerated<H: 'static + Hasher> {
data: Vec<u8>,
Expand Down
16 changes: 12 additions & 4 deletions storage-proofs/porep/src/stacked/circuit/create_label.rs
Original file line number Diff line number Diff line change
Expand Up @@ -82,9 +82,7 @@ mod tests {
util::{data_at_node, NODE_SIZE},
};

use crate::stacked::vanilla::{
create_label_exp, StackedBucketGraph, EXP_DEGREE, TOTAL_PARENTS,
};
use crate::stacked::vanilla::{create_label, StackedBucketGraph, EXP_DEGREE, TOTAL_PARENTS};

#[test]
fn test_create_label() {
Expand Down Expand Up @@ -167,7 +165,17 @@ mod tests {
assert_eq!(cs.num_constraints(), 532_025);

let (l1, l2) = data.split_at_mut(size * NODE_SIZE);
create_label_exp(&graph, None, &id_fr.into(), &*l2, l1, layer, node).unwrap();
create_label::single::create_label_exp(
&graph,
None,
fr_into_bytes(&id_fr),
&*l2,
l1,
layer,
node,
)
.unwrap();

let expected_raw = data_at_node(&l1, node).unwrap();
let expected = bytes_into_fr(expected_raw).unwrap();

Expand Down
11 changes: 6 additions & 5 deletions storage-proofs/porep/src/stacked/vanilla/cache.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ use anyhow::{bail, ensure, Context};
use byteorder::{ByteOrder, LittleEndian};
use lazy_static::lazy_static;
use log::{info, trace};
use mapr::{Mmap, MmapOptions};
use rayon::prelude::*;
use serde::{Deserialize, Serialize};
use sha2::{Digest, Sha256};
Expand Down Expand Up @@ -55,7 +56,7 @@ pub struct ParentCache {
#[derive(Debug)]
struct CacheData {
/// This is a large list of fixed (parent) sized arrays.
data: memmap::Mmap,
data: Mmap,
/// Offset in nodes.
offset: u32,
/// Len in nodes.
Expand All @@ -78,7 +79,7 @@ impl CacheData {
let len = self.len as usize * DEGREE * NODE_BYTES;

self.data = unsafe {
memmap::MmapOptions::new()
MmapOptions::new()
.offset(offset as u64)
.len(len)
.map(self.file.as_ref())
Expand Down Expand Up @@ -132,7 +133,7 @@ impl CacheData {
}

let data = unsafe {
memmap::MmapOptions::new()
MmapOptions::new()
.offset((offset as usize * DEGREE * NODE_BYTES) as u64)
.len(len as usize * DEGREE * NODE_BYTES)
.map(file.as_ref())
Expand Down Expand Up @@ -208,7 +209,7 @@ impl ParentCache {
info!("[open] parent cache: calculating consistency digest");
let file = File::open(&path)?;
let data = unsafe {
memmap::MmapOptions::new()
MmapOptions::new()
.map(&file)
.with_context(|| format!("could not mmap path={}", path.display()))?
};
Expand Down Expand Up @@ -274,7 +275,7 @@ impl ParentCache {
.with_context(|| format!("failed to set length: {}", cache_size))?;

let mut data = unsafe {
memmap::MmapOptions::new()
MmapOptions::new()
.map_mut(file.as_ref())
.with_context(|| format!("could not mmap path={}", path.display()))?
};
Expand Down
Loading

0 comments on commit 24cd455

Please sign in to comment.