[Design Discussion] Zk Accel API via trait object [skip ci - Don't merge] #13

mratsim · 2023-12-18T10:29:29Z

This draft PR should not be merged, it's for laying out design tradeoffs of ZAL in Halo2.
see #12 for the generics approach

This uses:

An extra dyn trait object to add any backend supporting MsmAccel<C: CurveAffine> to Halo2
An extra lifetime parameter to represent the ZalEngine lifetime.

Zal: https://github.com/taikoxyz/halo2curves/blob/zal-on-0.3.2/src/zal.rs

// The ZK Accel Layer API
// ---------------------------------------------------

pub trait ZalEngine: Debug {}

pub trait MsmAccel<C: CurveAffine>: ZalEngine {
    fn msm(&self, coeffs: &[C::Scalar], base: &[C]) -> C::Curve;
}

// ZAL using Halo2curves as a backend
// ---------------------------------------------------

#[derive(Debug)]
pub struct H2cEngine;

impl H2cEngine {
    pub fn new() -> Self {
        Self {}
    }
}

impl ZalEngine for H2cEngine {}

impl<C: CurveAffine> MsmAccel<C> for H2cEngine {
    fn msm(&self, coeffs: &[C::Scalar], bases: &[C]) -> C::Curve {
        #[allow(deprecated)]
        best_multiexp(coeffs, bases)
    }
}

impl Default for H2cEngine {
    fn default() -> Self {
        Self::new()
    }
}

Issues:

reasonably noisy, there is an extra 'zal lifetime parameter on some traits, but at a high level it gets merged into the 'params lifetime.
ZalEngine does not implement Send + Sync and some high-level code in shplonk requires it for Plonk permutations.

cc @einar-taiko

Deprecate pre-ZAL API Insert patch in `Cargo.toml` for `../halo2curves`

mratsim · 2023-12-18T10:30:23Z

halo2_proofs/src/poly/kzg/multiopen/shplonk.rs

                .map(|commitment| {
                    let evals: Vec<F> = rotations_vec
-                        .par_iter()
+                        // .par_iter()
+                        .iter()


This requires Send + Sync but having ZalEngine being part of Commitment prevents that

mratsim · 2023-12-18T10:30:54Z

halo2_proofs/src/poly/kzg/multiopen/shplonk/prover.rs

@@ -192,7 +192,8 @@ where
        let v: ChallengeV<_> = transcript.squeeze_challenge_scalar();

        let quotient_polynomials = rotation_sets
-            .par_iter()
+            // .par_iter()


mratsim · 2023-12-18T10:33:59Z

halo2_proofs/src/poly/kzg/multiopen/shplonk/verifier.rs

@@ -62,7 +62,7 @@ where
        mut msm_accumulator: DualMSM<'params, E>,
    ) -> Result<Self::Guard, Error>
    where
-        I: IntoIterator<Item = VerifierQuery<'com, E::G1Affine, MSMKZG<E>>> + Clone,
+        I: IntoIterator<Item = VerifierQuery<'com, E::G1Affine, MSMKZG<'params, E>>> + Clone,


Compiler complains here:

but changing to 'com or adding a constraint 'params: 'com then creates a not enough constrained lifetime.

mratsim · 2023-12-18T10:34:53Z

halo2_proofs/src/poly/kzg/multiopen/gwc/verifier.rs

@@ -58,7 +58,7 @@ where
        mut msm_accumulator: DualMSM<'params, E>,
    ) -> Result<Self::Guard, Error>
    where
-        I: IntoIterator<Item = VerifierQuery<'com, E::G1Affine, MSMKZG<E>>> + Clone,
+        I: IntoIterator<Item = VerifierQuery<'com, E::G1Affine, MSMKZG<'params, E>>> + Clone,


Same issue as shplonk

mratsim · 2023-12-18T10:57:34Z

While the 'com and 'params / 'zal lifetime issue can probably be solved,
solving the Send = Sync issue for parallelizing rotations/permutations in Plonk would likely require a large refactoring to put the engine in a different data structure.

In the first place, stuffing the engine in those data-structure was to avoid changing the function signature everywhere, but in the end we have to.

Hence we should probably make the engine an input only in the functions that require MSM evaluation and pass it as an extra input:

we would have changed the function signatures anyway
no lifetime to deal with in MSM MSMKZG, DualMSM, MSMIPA, CommitmentScheme, ... data structure
relatively easy to maintain, understand for further refactoring.

einar-taiko · 2023-12-20T19:01:04Z

While the 'com and 'params / 'zal lifetime issue can probably be solved

Regarding the lifetime issue. I think it could make a lot of sense to use the 'static lifetime for the engine reference, implying the engine is available for full running-time of the program.

einar-taiko · 2023-12-20T21:41:41Z

I think I need to understand better what is the nature of the ZAL engine object. Here is my current understanding:

We have a caller, e.g. the zkevm-circuits crate.
We have a callee the halo2 proof system crate.

Caller creates a single engine object (unknown size so dyn trait object) on the stack and keeps ownership.
Caller passes a reference (immutable burrow) to this object to the callee who may lend multiple burrows of it internally.
Callee uses these references to conduct the computations.
Callee returns to caller who may or may not deallocate the engine.

If this is the way, I don't think we can avoid doing heavy annotation of lifetimes. My intuition is, that if we do not annotate it every where we pass something, that contains a reference, how can the lifetime checker know which lifetimes to link? This seems to be a heavy drawback of the &'zal dyn ZalEngine type approach.

An alternative could be to put it in an Arc<dyn ZalEngine> on the heap and this way pass runtime RC'd smart pointers around. Since we already allocate the engine dynamically, keeping track of the references at runtime should not occur any significant overhead.

The last approach, i.e.

make the engine an input only in the functions that require MSM evaluation and pass it as an extra input:

raises some questions, I need input on:

Can the engine object contain data?
Can methods on the engine object be called in parallel by different threads at the same time?
Should it contain a synchronized work queue?
(or) Should the second thread to invoke an operation panick?
Is the engine Send?
Is the engine Sync?
Can you copy/clone the engine object?
If yes, what happens, if two engines schedule work on a single GPU?

I think it boils down to that the API is clear, but the semantics are still in flux.

mratsim · 2023-12-20T23:55:16Z

I think I need to understand better what is the nature of the ZAL engine object. Here is my current understanding:

We have a caller, e.g. the zkevm-circuits crate. We have a callee the halo2 proof system crate.
1. Caller creates a single engine object (unknown size so `dyn` trait object) on the stack and keeps ownership.

Yes for the engine object.
The caller actually know the size, and within Halo2 it's just a pointer. dyn was there for type-erasure and avoiding generics.

2. Caller passes a reference (immutable burrow) to this object to the callee who may lend multiple burrows of it internally.

3. Callee uses these references to conduct the computations.

4. Callee returns to caller who may or may not deallocate the engine.

Yes exactly.

If this is the way, I don't think we can avoid doing heavy annotation of lifetimes. My intuition is, that if we do not annotate it every where we pass something, that contains a reference, how can the lifetime checker know which lifetimes to link? This seems to be a heavy drawback of the &'zal dyn ZalEngine type approach.

Adding lifetimes within Halo2 is OK, having new lifetimes that leak into end users like zkevm-circuits or Powdr is API breakage. If we break the API, I think the smarter way is to just not store the engine in the low-level datastructure of Halo2 and just pass it around: no lifetimes, no issues of Send+Sync, easy to understand, maintain and refactor.

An alternative could be to put it in an Arc<dyn ZalEngine> on the heap and this way pass runtime RC'd smart pointers around. Since we already allocate the engine dynamically, keeping track of the references at runtime should not occur any significant overhead.

This is possible, though Arc<&dyn ZalEngine> or Arc<Box<dyn ZalEngine>> i think.

Regarding overhead, the whole point of Send+Sync is to allow the following section:

halo2/halo2_proofs/src/poly/kzg/multiopen/shplonk.rs

Lines 121 to 134 in fb69aa2

    
           let rotation_sets = rotation_set_commitment_map 
        
               .into_par_iter() 
        
               .map(|(rotations, commitments)| { 
        
                   let rotations_vec = rotations.iter().collect::<Vec<_>>(); 
        
                   let commitments: Vec<Commitment<F, Q::Commitment>> = commitments 
        
                       .into_par_iter() 
        
                       .map(|commitment| { 
        
                           let evals: Vec<F> = rotations_vec 
        
                               .par_iter() 
        
                               .map(|&&rotation| get_eval(commitment, rotation)) 
        
                               .collect(); 
        
                           Commitment((commitment, evals)) 
        
                       }) 
        
                       .collect();

with 3 nested parallel loops, that will all increment and decrement Arc, leading to cache flushes, for a resource that is not even used in those loops.

So we use Arc to allow this section to be parallel but introduce overhead.

Hence:

Either this section is a bottleneck, parallelism help and Arc should go (and the engine not be stored there)
Or this section is not a bottleneck and we don't care about make it parallel so we don't require Send+Sync and so we don't require Arc.

The last approach, i.e.

make the engine an input only in the functions that require MSM evaluation and pass it as an extra input:

raises some questions, I need input on:
1. Can the engine object contain data?

Halo2 doesn't need to know, it only has a pointer to it. It may use thread-local resources.

2. Can methods on the engine object be called in parallel by different threads at the same time?

In Halo2, there is no part of the code that call 2 MSMs in parallel.

In general I think if we use an "accelerator", Halo2 expresses concurrency to it (with the async API) and let it handle parallelism best.

And it would mess up with thread-local storage.

For example, assuming Cuda, you would really want those concurrent MSMs to be issued on different cuda streams:

https://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf

Similarly for OpenCL event queues:

http://www.heterogeneouscompute.org/hipeac2011Presentations/OpenCL-events.pdf

Or approaches using a supervisor thread with load-balancing queue to reduce contention:

https://arbook.icg.tugraz.at/schmalstieg/Schmalstieg_353.pdf

but only the accel runtime knows of those, not rayon.

Also the async API is out-of-scope for this PR. Implementation complexity would be easier though, in order:

(Proposed) Engine always called from same thread, function returns when result is there.
(Async API) Engine always called from same thread, function returns and leave an handle. Halo2 can issue another computation (from the same thread), receives a second handle. Then can wait/synchronize for computations to be ready using those handles.
(Send+Sync Engine) Any thread can call the engine, meaning internally it requires queues for handling incoming jobs/tasks.

3. Should it contain a synchronized work queue?

This is a per-accelerator design decision.

4. (or) Should the second thread to invoke an operation panick?

If 2 MSMs can be issued in parallel, then we can use the async API, but it's not in scope for this. And Halo2 doesn't issue multiple MSMs in parallel as far as I've seen.

5. Is the engine Send?

Cannot be moved to another thread, cannot be copied is the less restrictive constraint for engine design. But we use a reference to it.

It is possible to allow an engine to be called from any thread with locks or lock-free queues but debugging concurrent data structure to remove deadlocks, livelocks and race conditions is extremely time-consuming, when it doesn't straight up require formal verification.

6. Is the engine Sync?

Same as above.

7. Can you copy/clone the engine object?

No, just like you can't copy a database handle, memory allocated, a network connection or a GPU. It's an uncopyable and unmovable resource and you interact with it through a reference to it.

8. If yes, what happens, if two engines schedule work on a single GPU?

N/A

I think it boils down to that the API is clear, but the semantics are still in flux.

einar-taiko and others added 2 commits December 13, 2023 11:23

Migrate to new ZAL API

ebbd512

Deprecate pre-ZAL API Insert patch in `Cargo.toml` for `../halo2curves`

zal: low hanging fruit with dyn Trait [skip ci]

ec9c286

mratsim commented Dec 18, 2023

View reviewed changes

mratsim mentioned this pull request Dec 18, 2023

[Design Discussion] Zk Accel API via generics [skip ci - Don't merge] #12

Draft

mratsim mentioned this pull request Jan 9, 2024

[Design Discussion] Zk Accel API via extra function parameter [Don't merge] #14

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design Discussion] Zk Accel API via trait object [skip ci - Don't merge] #13

[Design Discussion] Zk Accel API via trait object [skip ci - Don't merge] #13

mratsim commented Dec 18, 2023

mratsim Dec 18, 2023

mratsim Dec 18, 2023

mratsim Dec 18, 2023

mratsim Dec 18, 2023

mratsim commented Dec 18, 2023

einar-taiko commented Dec 20, 2023 •

edited

Loading

einar-taiko commented Dec 20, 2023

mratsim commented Dec 20, 2023 •

edited

Loading

[Design Discussion] Zk Accel API via trait object [skip ci - Don't merge] #13

Are you sure you want to change the base?

[Design Discussion] Zk Accel API via trait object [skip ci - Don't merge] #13

Conversation

mratsim commented Dec 18, 2023

mratsim Dec 18, 2023

Choose a reason for hiding this comment

mratsim Dec 18, 2023

Choose a reason for hiding this comment

mratsim Dec 18, 2023

Choose a reason for hiding this comment

mratsim Dec 18, 2023

Choose a reason for hiding this comment

mratsim commented Dec 18, 2023

einar-taiko commented Dec 20, 2023 • edited Loading

einar-taiko commented Dec 20, 2023

mratsim commented Dec 20, 2023 • edited Loading

einar-taiko commented Dec 20, 2023 •

edited

Loading

mratsim commented Dec 20, 2023 •

edited

Loading