Scheduling prover on another thread #808

elcritch · 2024-05-14T18:32:15Z

The prover currently blocks the main Chronos thread. It should be scheduled on thread similar to the erasure coding as done in #716.

elcritch · 2024-05-14T19:19:56Z

Notes on the circom-compat prover and Nim / Rust data ownership needed to do the threading properly:

the only current backend CircomCompat uses a Nim library nim-circom-compat
nim-circom-compat uses a C API exposed in a small Rust wrapper library circom-compat-ffi
circom-compat-ffi wraps the Rust library circom-compat, which is the core Circom / Ark library
the circom-compat-ffi wrapper creates a context pointer which owns the Rust memory for everything

elcritch · 2024-05-14T19:53:58Z

Notes on CircomBn254Cfg:

CircomBn254Cfg contains all the needed context for CircomCompat on the Rust side
- This is fairly opaque, but on the Rust side includes reading and running the wasm code
- WASM code gen might take some time, so probably want to avoid creating one per prover run

TODO: figure out a simple way to make CircomBn254Cfg for each worker thread?

there's no simple way in taskpools to run setup code on each worker thread AFAIK
the simplest would then be to pass the needed args to create CircomBn254Cfg on prover work submission
each worker thread would have a thread local CircomBn254Cfg which it'd check if nil, and if so use the args to create a CircomBn254Cfg

Next handling the CircomCompatCtx is relatively simple:

the Nim ffi boils down to calling Circom's builder.push_input through the C API
I checked that each call to push_input_xx copies each bigint into memory owned by the CircomCompatCtx Rust pointer
I'm not 100% certain about the string, but it uses to_string which I believe creates a copy to the original string
calling builder.push_input copies the BigInt's into the circom-compat's builder which holds the data in a HashMap<String, Vec<BigInt>>
CircomCompatCtx is manually created and released on the Nim side so it should be easy to create it on the Chronos thread and then free it once the work is done on the taskpool worker thread

elcritch · 2024-05-29T13:14:13Z

Updates: Adding a thread using taskpools results in memory corruption and crashes in either the Rust library code or on the Nim side after calling into the Rust library.

I've tried lots of variations on the problem. Initially my thinking was that there was something fairly simple like using sharing a piece of mutable memory between threads. However, I'm fairly competent I've tracked down all the places on the inputs where this occurs.

Things I've tried which haven't resolved issue:

updating circom-compat-ffi to create a new rng for each thread
duplicating CircomBn254Cfg on the main thread
explicitly duplicating CircomBn254Cfg
create a new CircomBn254Cfg from circuit files (wasm, zkey, etc)
updating circom-compat-ffi to explicitly clone builder keys using to_owned
not freeing / releasing memory from CircomBn254Cfg or CircomCompatCtx to see if Nim side is calling something wrong

The Nim side appears to be working fine running the builder and creating the CircomBn254Cfg or CircomCompatCtx objects, however, as soon as prove is called it begins corrupting memory . Printing Nim inputs to the builder shows corruption in the builder keys, despite using to_owned to ensure the Rust will make it's own copy of the strings.

elcritch · 2024-05-29T13:16:36Z

Also, the proof returns appear to be working fine as are sending the arguments to the taskpool workers. Even running only a single thread worker which would prevent race conditions results in corrupted memory.

My current best guess is that something in the CircomCompat or Wasmer libraries assume they're running on the main thread, or have a thread local resource that is instantiated internally.

elcritch · 2024-05-29T13:49:21Z

I'm going to setup a docker image so I can run valgrind on my mac and try to see what's going on.

Example parameter corruption below:

TASK: task: proof: success((a: (x: [252, 125, 205, 30, 91, 169, 208, 144, 232, 29, 242, 124, 78, 116, 186, 167, 118, 99, 221, 210, 215, 182, 33, 41, 207, 198, 113, 40, 80, 145, 99, 18], y: [173, 31, 56, 59, 77, 216, 231, 124, 23, 188, 79, 117, 200, 10, 199, 13, 116, 69, 134, 172, 56, 116, 225, 203, 14, 136, 127, 175, 192, 153, 96, 35]), b: (x: [[94, 109, 223, 96, 64, 52, 178, 201, 60, 203, 105, 119, 66, 155, 205, 209, 8, 234, 163, 208, 173, 164, 91, 83, 220, 48, 137, 72, 79, 234, 199, 29], [250, 128, 38, 89, 24, 148, 228, 232, 88, 216, 131, 10, 178, 203, 128, 242, 4, 210, 81, 200, 180, 31, 163, 127, 214, 163, 178, 66, 67, 50, 63, 45]], y: [[194, 156, 184, 152, 149, 70, 210, 6, 167, 99, 182, 182, 40, 60, 98, 158, 117, 67, 86, 45, 125, 183, 118, 193, 254, 255, 4, 210, 31, 119, 169, 6], [81, 217, 255, 215, 196, 250, 117, 10, 119, 9, 141, 91, 234, 103, 221, 104, 135, 113, 136, 254, 9, 91, 56, 152, 110, 99, 15, 199, 245, 82, 124, 40]]), c: (x: [55, 23, 205, 200, 58, 113, 165, 31, 166, 57, 161, 90, 103, 18, 239, 162, 36, 18, 25, 165, 218, 114, 234, 182, 137, 191, 139, 72, 65, 74, 51, 34], y: [108, 100, 140, 89, 122, 70, 78, 164, 197, 68, 107, 10, 252, 80, 14, 236, 73, 215, 139, 193, 93, 129, 50, 85, 108, 117, 18, 170, 79, 172, 153, 10])))
TASK: task: params POST: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "tests/circuits/fixtures/proof_main.r1cs", wasmPath: "tests/circuits/fixtures/proof_main.wasm", zkeyPath: "")
TASK: task: 
TASK: task: params: 0x114f0f380
TASK: task: params: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "tests/circuits/fixtures/proof_main.r1cs", wasmPath: "tests/circuits/fixtures/proof_main.wasm", zkeyPath: "")
TASK: task: -2001899808422488719
TASK: task spawn: params: 0x104c0e5b0
PROVE: 17
TASK: task spawn: params: 0x104c0edd0
PROVE: 18
TASK: task spawn: params: 0x104d37790
PROVE: 19
TASK: task spawn: params: 0x104c42ec0
PROVE: 20
TASK: task spawn: params: 0x104dda650
PROVE: 21
TASK: task spawn: params: 0x104c42a60
PROVE: 22
TASK: task spawn: params: 0x104c9bce0
PROVE: 23
TASK: task: proof: success((a: (x: [81, 35, 236, 68, 189, 99, 225, 10, 196, 226, 243, 68, 50, 229, 30, 152, 240, 87, 72, 199, 57, 148, 7, 211, 54, 247, 46, 101, 246, 118, 200, 42], y: [110, 79, 55, 98, 72, 199, 47, 7, 192, 80, 38, 5, 219, 105, 202, 90, 17, 243, 80, 71, 209, 92, 132, 24, 102, 191, 198, 233, 72, 220, 203, 40]), b: (x: [[228, 140, 96, 208, 2, 19, 115, 179, 216, 137, 145, 45, 135, 211, 132, 61, 237, 36, 11, 213, 234, 52, 143, 255, 111, 204, 50, 10, 76, 208, 244, 38], [99, 220, 252, 121, 17, 173, 194, 225, 147, 17, 36, 174, 113, 93, 227, 65, 147, 197, 162, 64, 105, 66, 150, 147, 132, 239, 81, 95, 24, 132, 126, 35]], y: [[82, 132, 204, 152, 211, 196, 92, 195, 40, 252, 126, 184, 31, 50, 41, 169, 69, 240, 230, 19, 95, 96, 114, 95, 41, 207, 78, 137, 21, 202, 152, 3], [153, 197, 8, 14, 251, 110, 63, 127, 55, 161, 111, 251, 170, 19, 163, 79, 113, 103, 209, 110, 35, 245, 203, 20, 84, 47, 247, 117, 184, 102, 251, 25]]), c: (x: [219, 122, 190, 26, 79, 118, 204, 95, 179, 74, 2, 48, 128, 94, 191, 208, 113, 163, 252, 107, 148, 228, 24, 230, 39, 188, 160, 187, 125, 197, 176, 33], y: [237, 41, 43, 42, 44, 61, 81, 206, 218, 6, 61, 247, 33, 229, 6, 105, 12, 216, 125, 145, 128, 144, 243, 200, 103, 100, 13, 66, 19, 154, 157, 27])))
TASK: task: params POST: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "tests/circuits/fixtures/proof_main.r1cs", wasmPath: "tests/circuits/fixtures/proof_main.wasm", zkeyPath: "")
TASK: task: 
TASK: task: params: 0x114f0f4c0
TASK: task: params: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "\x12 ��B��\x1C\x14���șo�$\'�A�d��L���\exR�U", wasmPath: "\x12 R߆;��AKV,:�D�\x13=\v���\e\x11��Q���\x15��", zkeyPath: "")

elcritch assigned elcritch and tbekas May 14, 2024

elcritch linked a pull request May 16, 2024 that will close this issue

Scheduling prover on another thread #813

Draft

gmega mentioned this issue Sep 17, 2024

[Epic] Multithreading codex-storage/codex-pm#83

Open

2 tasks

gmega added the Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details label Sep 18, 2024

gmega mentioned this issue Oct 2, 2024

[BUG] Codex hangs when creating a storage request for a big file using the REST api #877

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling prover on another thread #808

Scheduling prover on another thread #808

elcritch commented May 14, 2024 •

edited

Loading

elcritch commented May 14, 2024

elcritch commented May 14, 2024 •

edited

Loading

elcritch commented May 29, 2024 •

edited

Loading

elcritch commented May 29, 2024

elcritch commented May 29, 2024

Scheduling prover on another thread #808

Scheduling prover on another thread #808

Comments

elcritch commented May 14, 2024 • edited Loading

elcritch commented May 14, 2024

elcritch commented May 14, 2024 • edited Loading

elcritch commented May 29, 2024 • edited Loading

elcritch commented May 29, 2024

elcritch commented May 29, 2024

elcritch commented May 14, 2024 •

edited

Loading

elcritch commented May 14, 2024 •

edited

Loading

elcritch commented May 29, 2024 •

edited

Loading