Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling prover on another thread #808

Open
Tracked by #83
elcritch opened this issue May 14, 2024 · 5 comments · May be fixed by #813
Open
Tracked by #83

Scheduling prover on another thread #808

elcritch opened this issue May 14, 2024 · 5 comments · May be fixed by #813
Assignees
Labels
Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details

Comments

@elcritch
Copy link
Contributor

elcritch commented May 14, 2024

The prover currently blocks the main Chronos thread. It should be scheduled on thread similar to the erasure coding as done in #716.

@elcritch
Copy link
Contributor Author

Notes on the circom-compat prover and Nim / Rust data ownership needed to do the threading properly:

  • the only current backend CircomCompat uses a Nim library nim-circom-compat
  • nim-circom-compat uses a C API exposed in a small Rust wrapper library circom-compat-ffi
  • circom-compat-ffi wraps the Rust library circom-compat, which is the core Circom / Ark library
  • the circom-compat-ffi wrapper creates a context pointer which owns the Rust memory for everything

@elcritch
Copy link
Contributor Author

elcritch commented May 14, 2024

Notes on CircomBn254Cfg:

  • CircomBn254Cfg contains all the needed context for CircomCompat on the Rust side
    • This is fairly opaque, but on the Rust side includes reading and running the wasm code
    • WASM code gen might take some time, so probably want to avoid creating one per prover run

TODO: figure out a simple way to make CircomBn254Cfg for each worker thread?

  • there's no simple way in taskpools to run setup code on each worker thread AFAIK
  • the simplest would then be to pass the needed args to create CircomBn254Cfg on prover work submission
  • each worker thread would have a thread local CircomBn254Cfg which it'd check if nil, and if so use the args to create a CircomBn254Cfg

Next handling the CircomCompatCtx is relatively simple:

  • the Nim ffi boils down to calling Circom's builder.push_input through the C API
  • I checked that each call to push_input_xx copies each bigint into memory owned by the CircomCompatCtx Rust pointer
  • I'm not 100% certain about the string, but it uses to_string which I believe creates a copy to the original string
  • calling builder.push_input copies the BigInt's into the circom-compat's builder which holds the data in a HashMap<String, Vec<BigInt>>
  • CircomCompatCtx is manually created and released on the Nim side so it should be easy to create it on the Chronos thread and then free it once the work is done on the taskpool worker thread

@elcritch elcritch linked a pull request May 16, 2024 that will close this issue
@elcritch elcritch linked a pull request May 16, 2024 that will close this issue
@elcritch
Copy link
Contributor Author

elcritch commented May 29, 2024

Updates: Adding a thread using taskpools results in memory corruption and crashes in either the Rust library code or on the Nim side after calling into the Rust library.

I've tried lots of variations on the problem. Initially my thinking was that there was something fairly simple like using sharing a piece of mutable memory between threads. However, I'm fairly competent I've tracked down all the places on the inputs where this occurs.

Things I've tried which haven't resolved issue:

  • updating circom-compat-ffi to create a new rng for each thread
  • duplicating CircomBn254Cfg on the main thread
  • explicitly duplicating CircomBn254Cfg
  • create a new CircomBn254Cfg from circuit files (wasm, zkey, etc)
  • updating circom-compat-ffi to explicitly clone builder keys using to_owned
  • not freeing / releasing memory from CircomBn254Cfg or CircomCompatCtx to see if Nim side is calling something wrong

The Nim side appears to be working fine running the builder and creating the CircomBn254Cfg or CircomCompatCtx objects, however, as soon as prove is called it begins corrupting memory . Printing Nim inputs to the builder shows corruption in the builder keys, despite using to_owned to ensure the Rust will make it's own copy of the strings.

@elcritch
Copy link
Contributor Author

Also, the proof returns appear to be working fine as are sending the arguments to the taskpool workers. Even running only a single thread worker which would prevent race conditions results in corrupted memory.

My current best guess is that something in the CircomCompat or Wasmer libraries assume they're running on the main thread, or have a thread local resource that is instantiated internally.

@elcritch
Copy link
Contributor Author

I'm going to setup a docker image so I can run valgrind on my mac and try to see what's going on.

Example parameter corruption below:

TASK: task: proof: success((a: (x: [252, 125, 205, 30, 91, 169, 208, 144, 232, 29, 242, 124, 78, 116, 186, 167, 118, 99, 221, 210, 215, 182, 33, 41, 207, 198, 113, 40, 80, 145, 99, 18], y: [173, 31, 56, 59, 77, 216, 231, 124, 23, 188, 79, 117, 200, 10, 199, 13, 116, 69, 134, 172, 56, 116, 225, 203, 14, 136, 127, 175, 192, 153, 96, 35]), b: (x: [[94, 109, 223, 96, 64, 52, 178, 201, 60, 203, 105, 119, 66, 155, 205, 209, 8, 234, 163, 208, 173, 164, 91, 83, 220, 48, 137, 72, 79, 234, 199, 29], [250, 128, 38, 89, 24, 148, 228, 232, 88, 216, 131, 10, 178, 203, 128, 242, 4, 210, 81, 200, 180, 31, 163, 127, 214, 163, 178, 66, 67, 50, 63, 45]], y: [[194, 156, 184, 152, 149, 70, 210, 6, 167, 99, 182, 182, 40, 60, 98, 158, 117, 67, 86, 45, 125, 183, 118, 193, 254, 255, 4, 210, 31, 119, 169, 6], [81, 217, 255, 215, 196, 250, 117, 10, 119, 9, 141, 91, 234, 103, 221, 104, 135, 113, 136, 254, 9, 91, 56, 152, 110, 99, 15, 199, 245, 82, 124, 40]]), c: (x: [55, 23, 205, 200, 58, 113, 165, 31, 166, 57, 161, 90, 103, 18, 239, 162, 36, 18, 25, 165, 218, 114, 234, 182, 137, 191, 139, 72, 65, 74, 51, 34], y: [108, 100, 140, 89, 122, 70, 78, 164, 197, 68, 107, 10, 252, 80, 14, 236, 73, 215, 139, 193, 93, 129, 50, 85, 108, 117, 18, 170, 79, 172, 153, 10])))
TASK: task: params POST: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "tests/circuits/fixtures/proof_main.r1cs", wasmPath: "tests/circuits/fixtures/proof_main.wasm", zkeyPath: "")
TASK: task: 
TASK: task: params: 0x114f0f380
TASK: task: params: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "tests/circuits/fixtures/proof_main.r1cs", wasmPath: "tests/circuits/fixtures/proof_main.wasm", zkeyPath: "")
TASK: task: -2001899808422488719
TASK: task spawn: params: 0x104c0e5b0
PROVE: 17
TASK: task spawn: params: 0x104c0edd0
PROVE: 18
TASK: task spawn: params: 0x104d37790
PROVE: 19
TASK: task spawn: params: 0x104c42ec0
PROVE: 20
TASK: task spawn: params: 0x104dda650
PROVE: 21
TASK: task spawn: params: 0x104c42a60
PROVE: 22
TASK: task spawn: params: 0x104c9bce0
PROVE: 23
TASK: task: proof: success((a: (x: [81, 35, 236, 68, 189, 99, 225, 10, 196, 226, 243, 68, 50, 229, 30, 152, 240, 87, 72, 199, 57, 148, 7, 211, 54, 247, 46, 101, 246, 118, 200, 42], y: [110, 79, 55, 98, 72, 199, 47, 7, 192, 80, 38, 5, 219, 105, 202, 90, 17, 243, 80, 71, 209, 92, 132, 24, 102, 191, 198, 233, 72, 220, 203, 40]), b: (x: [[228, 140, 96, 208, 2, 19, 115, 179, 216, 137, 145, 45, 135, 211, 132, 61, 237, 36, 11, 213, 234, 52, 143, 255, 111, 204, 50, 10, 76, 208, 244, 38], [99, 220, 252, 121, 17, 173, 194, 225, 147, 17, 36, 174, 113, 93, 227, 65, 147, 197, 162, 64, 105, 66, 150, 147, 132, 239, 81, 95, 24, 132, 126, 35]], y: [[82, 132, 204, 152, 211, 196, 92, 195, 40, 252, 126, 184, 31, 50, 41, 169, 69, 240, 230, 19, 95, 96, 114, 95, 41, 207, 78, 137, 21, 202, 152, 3], [153, 197, 8, 14, 251, 110, 63, 127, 55, 161, 111, 251, 170, 19, 163, 79, 113, 103, 209, 110, 35, 245, 203, 20, 84, 47, 247, 117, 184, 102, 251, 25]]), c: (x: [219, 122, 190, 26, 79, 118, 204, 95, 179, 74, 2, 48, 128, 94, 191, 208, 113, 163, 252, 107, 148, 228, 24, 230, 39, 188, 160, 187, 125, 197, 176, 33], y: [237, 41, 43, 42, 44, 61, 81, 206, 218, 6, 61, 247, 33, 229, 6, 105, 12, 216, 125, 145, 128, 144, 243, 200, 103, 100, 13, 66, 19, 154, 157, 27])))
TASK: task: params POST: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "tests/circuits/fixtures/proof_main.r1cs", wasmPath: "tests/circuits/fixtures/proof_main.wasm", zkeyPath: "")
TASK: task: 
TASK: task: params: 0x114f0f4c0
TASK: task: params: (slotDepth: 32, datasetDepth: 8, blkDepth: 5, cellElms: 67, numSamples: 5, r1csPath: "\x12 ��B��\x1C\x14���șo�$\'�A�d��L���\exR�U", wasmPath: "\x12 R߆;��AKV,:�D�\x13=\v���\e\x11��Q���\x15��", zkeyPath: "")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client See https://miro.com/app/board/uXjVNZ03E-c=/ for details
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants