Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Faster randomness sampling for field elements #9627

Merged
merged 4 commits into from
Nov 1, 2024

Conversation

Rumata888
Copy link
Contributor

@Rumata888 Rumata888 commented Oct 31, 2024

Changes the algorithm for converting uint512_ts to fields. The result is equivalent (we can make an even faster version, but without the equivalence). 1.5x faster on the mainframe, 5x in wasm
Before. Native
image
After. Native:
image

Before. Wasm:
image

After. Wasm:
image

numeric::RNG& engine = numeric::get_randomness();
for (auto _ : state) {
state.PauseTiming();
size_t num_cycles = 1 << static_cast<size_t>(state.range(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: numeric literals like 1 are usually interpreted as int types. Might make the code ever so slightly more readable by making this 1UL (32 bit unsigned long). Up to you though, this isa very petty comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@@ -485,4 +499,5 @@ BENCHMARK(sequential_copy)->Unit(kMicrosecond)->DenseRange(20, 25);
BENCHMARK(uint_multiplication)->Unit(kMicrosecond)->DenseRange(12, 27);
BENCHMARK(uint_extended_multiplication)->Unit(kMicrosecond)->DenseRange(12, 27);
BENCHMARK(pippenger)->Unit(kMicrosecond)->DenseRange(16, 20)->Setup(DoPippengerSetup)->Iterations(5);
BENCHMARK(bn254fr_random)->Unit(kMicrosecond)->DenseRange(10, 16);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when making a zk proof do we not need randomness equal to the circuit size? It might be useful to extend this to 20 to validate something odd doesn't happen for larger circuits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we'll do it in parallel. And right now 2^20 is very slow

}

// This test shows that ((lo|hi)% modulus) in uint512_t is equivalent to (lo + 2^256 * hi) in field elements so we
// don't have to use the slow API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the "slow API"? might be useful to make this more explicit so that this comment does not have implied context that future readers might not have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

uint512_t q(modulus);
uint512_t reduced = source % q;
return field(reduced.lo);
constexpr field pow_2_256 = field(uint256_t(1) << 128).sqr();
Copy link
Contributor

@zac-williamson zac-williamson Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How sensitive are we to performance here? If we wanted to maximize performance, I think the following algorithm would be fastest:

  1. generate random u512 t
  2. feed t directly into a Barrett reduction algorithm, producing output t*2^{-256} mod p
  3. multiply the output by 2^{512} mod p to return t * 2^{256} mod p (which is t mod p converted into Montgomery form, which is what the underlying field type expects)

The above requires 2 modular reductions. The current method performs 3, due to converting two random u256 values into field elements (along with additional arithmetic operations)

Very minor point, I don't know how performance sensitive we are to this algorithm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our old modulo algorithm was super slow, so this is a significant improvement. I tried checking what kind of an improvement would not converting to montgomery for 256-bit chunks make and it was minimal. The problem is that now that the division algorithm is not extremely slow, the bottleneck is sampling the uint512_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we don't do 2^256 montgomery in wasm. It's different(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the context about the wasm. Replace 2^{256} with whatever the montgomery R parameter is and I think the above all applies.

But yes I appreciate it doesn't affect performance too much, it's the difference between 3 reductions and 2. It makes sense that generating the random bytes would be the limiting factor after your changes.

Copy link
Contributor

@zac-williamson zac-williamson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments about minor details, as well as a potential improvement. But any changes are optional so I have approved. good work!

@Rumata888 Rumata888 merged commit b98e93f into master Nov 1, 2024
46 checks passed
@Rumata888 Rumata888 deleted the is/faster_random branch November 1, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants