Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bignum syscalls #17393

Closed
wants to merge 5 commits into from
Closed

Add bignum syscalls #17393

wants to merge 5 commits into from

Conversation

FrankC01
Copy link
Contributor

Problem

The problem is that Solana did not have bignum for solana programs

Summary of Changes

Add syscalls to cover fundamental bignum operations at a fair price to execution cost
This PR is the successor to #17082

Fixes #

Draft reviewers:
@jackcmay
@arthurgreef
@seanyoung

@codecov
Copy link

codecov bot commented May 21, 2021

Codecov Report

Merging #17393 (c731ff2) into master (3b1738c) will decrease coverage by 0.1%.
The diff coverage is 0.7%.

@@            Coverage Diff            @@
##           master   #17393     +/-   ##
=========================================
- Coverage    82.3%    82.1%   -0.2%     
=========================================
  Files         433      434      +1     
  Lines      120912   122197   +1285     
=========================================
+ Hits        99586   100420    +834     
- Misses      21326    21777    +451     

@FrankC01 FrankC01 force-pushed the bignum branch 2 times, most recently from e203cf3 to 504b55c Compare May 23, 2021 16:46
@FrankC01
Copy link
Contributor Author

@jackcmay In addition to your insight why this fails in stable-perf

@arthurgreef and I recently changed the execution code of sol_bignum_mod_exp which uses an ETH formula https://eips.ethereum.org/EIPS/eip-198

However; this results in a GAS unit cost. My question is, what is the unit ratio? Does 1 EU = 1 GAS ( 1:1 )? Or....?

Thanks

sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
@jackcmay
Copy link
Contributor

@FrankC01 I left review comments in one of the syscall implementations that I think will apply to the rest, once we resolve those and the changes are applied to the rest I can review. Also, are you expecting any failure in the bn calcs to abort the program?

@FrankC01
Copy link
Contributor Author

@FrankC01 I left review comments in one of the syscall implementations that I think will apply to the rest, once we resolve those and the changes are applied to the rest I can review. Also, are you expecting any failure in the bn calcs to abort the program?

Yes

sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
sdk/program/src/bignum.rs Outdated Show resolved Hide resolved
@mvines
Copy link
Member

mvines commented Jun 23, 2021

JIT isn't going to fix this, with jit we have the added overhead of memory translation not to mention that the C versions are faster.

Is the JIT/BPF version fast enough though? I don't see any discussion about how much slower the the BPF version would be. It seems like we jumped right into "make a syscall because it'll be fastest".

The C version could be compiled into the BPF program as well, so I think we'd just be dealing with the memory translation overhead of the JIT: something that if we can optimize further will benefit all programs.

@jackcmay
Copy link
Contributor

Add syscalls to cover fundamental bignum operations at a fair price to execution cost

From the opening comment, this instead feels like a BPF cost model problem. Syscalls are a burden on the platform forever, IMO they should only be used to access runtime information.

I definitely see that point of view, it's can slippery slope to continue adding syscalls for computation-intensive operations. We've been adding syscalls for some of the general and commonly used but expensive operations (hashing, memory movement, etc...) and bignum and ecrecover I think fall into that category. Especially in terms of pure performance of the chain vs charging more to go slower.

@mvines
Copy link
Member

mvines commented Jun 23, 2021

Ya, ecrecover I'm not going to push back on because EVM. Here though I don't see the full trade-off yet. The part that's missing for me is the X in "Hi, we tried to do X on Solana and failed because we don't have bignum syscalls"

@jackcmay
Copy link
Contributor

JIT isn't going to fix this, with jit we have the added overhead of memory translation not to mention that the C versions are faster.

Is the JIT/BPF version fast enough though? I don't see any discussion about how much slower the the BPF version would be. It seems like we jumped right into "make a syscall because it'll be fastest".

The C version could be compiled into the BPF program as well, so I think we'd just be dealing with the memory translation overhead of the JIT: something that if we can optimize further will benefit all programs.

I think @FrankC01 originally tried to use the OpenSSL BigNum library in-program and it exceeded the instruction count. Not sure if there were execution time measurements done. @FrankC01 have you measured the time it took to do it in-program (given an expanded compute budget) vs doing it in as a syscall?

@FrankC01
Copy link
Contributor Author

FrankC01 commented Jun 23, 2021

@mvines @jackcmay - We originally started with our program, grant application attached, to achieve constant time/space NFT Mint, Transfer and Burn. For this we needed RSA accumulator verification on-chain. The underlying capabilities needed are:

  1. Big Number math operations
  2. Crypto hashes (hash-to-prime, hash-generator)

We first tried num-bigint-dig, which is a fork of num-bigint with crypto operations, but we exceeded the compute unit budget by a very large amount (we hacked the execution units budget to make it work). This is exemplified by our programs Mint instruction which utilizes functions that support verifying (in constant time) the membership/non-membership of an NFT token in a tracking account as well as a token account state accumulator.

membership/non-membership verification:

Consists of:
hash_to_prime which finds the next prime number hash and with this results and other BigNumbers:
let left = f.mul(&f.mul(&f.exp(&self.q, &l), &f.exp(&self.u, &self.r)), &f.exp(g, &(&alpha.mul(&self.r))));
This consumed 121,000 units for each proof of which there are 4.

At this point we determined to locally added the underlying big number syscalls and took the opportunity to use OpenSSL because it has much better performance, has is_prime test, and support (more lively development and testing than num-bigint-dig). For the compute unit costs we used two algorithms from ethereum that charged gas for both simple and complex math.

Completing that we also recognized the potential demand for other Solana programs that could benefit from a consistent and performant BigNumber and crypto operations (e.g. is_prime) and approached @jackcmay about creating this PR.

When the BigNumber syscalls with BigNumber were approved we were going to quickly follow up with syscalls for crypto hashes.

Grant application

@FrankC01
Copy link
Contributor Author

FrankC01 commented Jun 25, 2021

@jackcmay @aeyakovenko @t-nelson @sakridge @mvines

Team: More information on why we believe we need syscalls for crypto where the PR under analysis is just the start:

We developed a PoC for verifying GM17/BN128 proofs generated from ZoKrates. The proof was verified using the arkworks-rs, a pure Rust crypto library which uses num-bigint.

This capability is another contribution we were hoping to add to the Solana platform. However; we hit a number of limits right off the bat:

  1. We had to increase the BpfComputeBudget to over 4.5 M max_units
  2. This ran further but ultimately failed due to exhausting the memory limit during proof verification[1]
fn process_zokrates(_program_id: &Pubkey, _accounts: &[AccountInfo]) -> ProgramResult {
    msg!("Processing ZoKrates");
    let proof: Proof<ProofPoints<G1Affine, G2Affine>> = serde_json::from_str(PROOF).unwrap();
    let vk: VerificationKey<G1Affine, G2Affine> = serde_json::from_str(VKEY).unwrap();
    let is_good = verify::<Bn128Field, GM17, Ark>(vk, proof);
    msg!("zokrates verify = {}", is_good);

    Ok(())
}
  1. When we narrowed the code down by commenting out the actual verification step, we consumed 60 K units just in the creation of the proof and verification key objects. However; that errored out as well with Max frame depth reached: 18.

We see this approach to using syscalls as enhancing Solana so that developers have more tools to create scalable applications with privacy.

[1] - The elliptic curve pairing operations are implemented in Ethereum precompiles to reduce GAS costs.

@mvines
Copy link
Member

mvines commented Jun 28, 2021

@Lichtso / @dmakarov - What do you think? Is it a fool's errand to try to build all this in BPF, and instead we just need to accept the syscalls into native?

@dmakarov
Copy link
Contributor

@Lichtso / @dmakarov - What do you think? Is it a fool's errand to try to build all this in BPF, and instead we just need to accept the syscalls into native?

A syscall into native is easier to implement, and it is probably faster at run-time.

@mvines
Copy link
Member

mvines commented Jun 28, 2021

Certainly, but it also adds to the ongoing maintenance burden of the runtime and appears to bring unsafe C external dependencies into the consensus path

@Lichtso
Copy link
Contributor

Lichtso commented Jun 28, 2021

From a JIT perspective syscalls are easy to compile but the most expensive thing at runtime because of the two context switches: Saving the registers, switching the stack, rebaseing the instruction meter, etc.

But again, what are the alternatives? Fused ops, static analysis?

@arthurgreef
Copy link
Contributor

@mvines. Just thought I'd add that Solana would not be the first to add an external C library into a consensus path. Cardano uses gmp as the backing C library for arbitrary length integers. They do mention that the implementation is in assembly and that may result in different cost functions per processing architecture.

@mvines
Copy link
Member

mvines commented Jun 29, 2021

@mvines. Just thought I'd add that Solana would not be the first to add an external C library into a consensus path. Cardano uses gmp as the backing C library for arbitrary length integers.

This doesn't mean it's ok but this is an interesting data point, thanks!

They do mention that the implementation is in assembly and that may result in different cost functions per processing architecture.

This seems less than ideal. Perhaps BPF assembly is the way to go here. It seems like with frequent use in a program, the syscall overhead might come into play in the actual wallclock execution time.

I'm still generally of the opinion that it would be better to make these libraries work within the BPF environment. It certainly stresses the compute model and perhaps the VM/JIT, which is much a harder problem to solve than escaping to a syscall but ultimately is probably a more useful problem to solve for the platform as a whole.

@FrankC01
Copy link
Contributor Author

@mvines. Just thought I'd add that Solana would not be the first to add an external C library into a consensus path. Cardano uses gmp as the backing C library for arbitrary length integers.

This doesn't mean it's ok but this is an interesting data point, thanks!

They do mention that the implementation is in assembly and that may result in different cost functions per processing architecture.

This seems less than ideal. Perhaps BPF assembly is the way to go here. It seems like with frequent use in a program, the syscall overhead might come into play in the actual wallclock execution time.

I'm still generally of the opinion that it would be better to make these libraries work within the BPF environment. It certainly stresses the compute model and perhaps the VM/JIT, which is much a harder problem to solve than escaping to a syscall but ultimately is probably a more useful problem to solve for the platform as a whole.

Just to be sure, we are not recommending GMP i(with lots of assembly code) instead of OpenSSL. Don't want to conflate the primary discussion.

@jon-chuang
Copy link
Contributor

jon-chuang commented Jul 6, 2021

Rather than maintaining so many syscalls directly into the bpf_loader, which is a huge burden, and has high context-switch overhead, a cryptographic/bignum VM might be an alternative.

Basically this is @Lichtso 's fused ops comment taken to its conclusion, together with an extension of the idea of batching invokes (#18428)

There are two approaches here:

  1. VM's opcodes are serialized into a &[u8] buffer.
  2. Have the fused operations be serialized into the instruction data.

For 2., for instance, a single fused operation and operands (in terms of data in keyed accounts) can be serialized accordingly, the syscall can be handled by a single entrypoint, the feature set of which can be extended separately from the bpf_loader and bpf stack.

This separates concerns between the bpf VM stack and native optimised bignum operations.

Still, a cryptographic or bignum VM is on the consensus path.

So keeping it lightweight to ensure correctness while achieving the efficiency goals might be challenging.

Extra maintenance costs include correct compute costs for the VM's primitive ops. But this is already a maintenance burden for per-syscall compute costs required for the approach in this PR.

If we are talking about long-term maintainability, in-terms of supporting more cryptographic operations beyond that of this PR, this is the direction I see most promise in. But getting there takes some work.


Taking things further: an extensible NativeVM that handles native operation fusion. #18465

@jackcmay
Copy link
Contributor

jackcmay commented Jul 6, 2021

@jon-chuang Would you please open a new issue for the fused-op VM idea, we should debate that on its own.

@jackcmay
Copy link
Contributor

jackcmay commented Jul 6, 2021

@mvines. Just thought I'd add that Solana would not be the first to add an external C library into a consensus path. Cardano uses gmp as the backing C library for arbitrary length integers.

This doesn't mean it's ok but this is an interesting data point, thanks!

They do mention that the implementation is in assembly and that may result in different cost functions per processing architecture.

This seems less than ideal. Perhaps BPF assembly is the way to go here. It seems like with frequent use in a program, the syscall overhead might come into play in the actual wallclock execution time.

I'm still generally of the opinion that it would be better to make these libraries work within the BPF environment. It certainly stresses the compute model and perhaps the VM/JIT, which is much a harder problem to solve than escaping to a syscall but ultimately is probably a more useful problem to solve for the platform as a whole.

I like the direction but it looked like earlier attempts at using these libraries with BPF resulted in ~4M instructions, well above the 200k we now allow and probably clock-time a lot slower than native. I have my doubts that even with some heavy optimizations that we could get the BPF solution down enough to work, even in the mid-term. We already have mem-syscalls (more syscalls) that can help with memory movement which BPF has been notoriously bad at. @Lichtso @dmakarov do you have a sense of what kinds of optimizations we could do in the mid-term to help? We've also talked about expanding the BPF instruction set, maybe adding some more advanced but primitive instructions would help?

@dmakarov
Copy link
Contributor

dmakarov commented Jul 6, 2021

I like the direction but it looked like earlier attempts at using these libraries with BPF resulted in ~4M instructions, well above the 200k we now allow and probably clock-time a lot slower than native. I have my doubts that even with some heavy optimizations that we could get the BPF solution down enough to work, even in the mid-term. We already have mem-syscalls (more syscalls) that can help with memory movement which BPF has been notoriously bad at. @Lichtso @dmakarov do you have a sense of what kinds of optimizations we could do in the mid-term to help? We've also talked about expanding the BPF instruction set, maybe adding some more advanced but primitive instructions would help?

Adding more instructions is similar to adding intrinsic functions. These functions would be implemented in native code by the VM.

@jon-chuang
Copy link
Contributor

jon-chuang commented Jul 7, 2021

Edit: actually, according to #17720 (comment), BPF program wall clock time is 73-300x slower for ecrecover compared to several native. But I'm not sure if that is JITed or interpreted.

Edit: It's almost certainly interpreted as ProgramTest sets use_bpf_jit to false by default. Let me bench locally. Data is needed.


At this point I have to say I'm not convinced the issue is with the efficiency of bpf.

So far, regarding cryptographic operations, which are ALU rather than memory intensive, I have yet to see an actual benchmark showing wall clock time of the JITed BPF is significantly slower (although one expects it to be somewhat slower).

As suggested here solana-labs/rbpf#185, it could be that the bpf uniform opcost model is unfairly-tuned towards ALU-heavy computations.

The right solution might be to increase the opcosts for memory ops, and increase the overall compute budget proportionally.

To determine if this is the right action, we need benchmarks of JITed BPF versions of cryptographic operations like bignum and elliptic curves.

@FrankC01 , would you be able to provide this data to help move things forward?

@jon-chuang
Copy link
Contributor

jon-chuang commented Jul 7, 2021

Btw @jackcmay , if we're talking about the performance of bytecode, it may be instructive to consider that a WASM secp256k1 in this bitcoin lib is shown to achieve performance comparable to the native rust libsecp256k1:
0.00025195263s / op

recall:

Library Time (sec)
libsecp256k1 0.0001925
k256 0.0001561
secp256k1 0.00004772

By contrast the bpf code ran in 0.014s: #17720 (comment)

If it turns out WASM is strictly faster than eBPF, the correct thing to do might be to support WASM in a cross-program invocation, as a separate VM, just like was proposed for MoveVM and I think is proposed for NeonEVM (?), instead of intrinsics and syscalls.

Programs needing near-native performance can then leverage the WASM VM rather than compiling to eBPF.


Here are some resources:
eBPF and WASM comparison - (interpreted slowdown is 30-60x compared to native)
WASM slowdown v.s. Native - shows about 1.5-2x slowdown, which seems acceptable.


I decided to bench samkim's bpf-ristretto.

Here's the time taken for Edwards curve scalar muls:

operation time in s
Edwards Curve scalar mul interpreted 1.02
Edwards Curve scalar mul JIT 0.036218524
Edwards Curve scalar mul native 0.00005913

This is pretty much insanely slow - a more than 500x slowdown even when JITed...!

Here are more detailed benchmarks
The code for performing the benchmarks are available here: https://github.com/jon-chuang/ristretto-bpf-count/tree/jon-chuang/bench-bpf-jit

Operation Native JITed Interpreted
test_field_add 2ns / op 2012ns / op 40267ns / op
test_field_mul 20ns / op 13273ns / op 422687ns / op
test_scalar_add 66ns / op 44262ns / op 1549212ns / op
test_scalar_mul 90ns / op 57151ns / op 1810822ns / op
test_edwards_decompress 8039ns / op 3621537ns / op 100665402ns / op
test_edwards_add 14085ns / op 7440229ns / op 221401182ns / op
test_edwards_mul 62335ns / op 38426995ns / op 1005777127ns / op

Here are the results for test_edwards_mul when compiling with a reduced-effiency mul using only 32, 32 -> 64 bit muls:
106546ns / op. This is still a 350x performance degradation...


https://eprint.iacr.org/2019/542.pdf shows that libsodium when compiled down to WASM (emscripten) has a 0.00024s/op for verify. That's merely 5x slower than an optimised implementation (e.g. https://github.com/dalek-cryptography/ed25519-dalek). This suggests that WASM is a lot faster than eBPF.

Here, WASM is benchmarked against the EVM.

Edit: Here is some information about how much slower WASM is compared to native for EC crypto operations:
https://www.mdpi.com/2079-9292/9/11/1839/pdf
Table 4. shows a 30-75x degradation in performance compared with libsodium.

@Lichtso
Copy link
Contributor

Lichtso commented Jul 7, 2021

@jon-chuang Can you send me (alexander solana com) the exact ELF files of the benchmark you ran?
I would like to disassemble, profile and analyze them and if I compile them they might turn out slightly different.

@Lichtso
Copy link
Contributor

Lichtso commented Jul 8, 2021

@jon-chuang Thanks, I received the ELF and analyzed it statically (no profiling so far) like this:

First I stripped all functions which are not directly involved in the ec_math like deserialization, formating, allocation, dropping, memcpy, etc. Then I ran:

rbpf_cli -e ec_math.so -u disassembler > ristretto-bpf-count.s
awk '{if ($1 !~ /:/) arr[$1] += 1} END {for (i in arr) print i, arr[i]}' ristretto-bpf-count.s

Which already reveals the problem:

mov64   1922
ldxdw   1577
stxdw   1195
add64   1102
call    330
rsh64   251
lddw    219
and64   197
stxb    196
jgt     173
lsh64   96
xor64   89
sub64   76
or64    62
ldxb    41
mul64   33
exit    24
arsh64  23
jeq     16
ja      11
jne     10
neg64   5
syscall 3

ristretto-bpf-count

In other words, one third of all instructions in the arithmetic functions are memory accesses. And we know that these are extremely slow, so most likely responsible for the poor performance.

@jon-chuang
Copy link
Contributor

jon-chuang commented Jul 8, 2021

@Lichtso , one thing to note is that the compute cost for the edwards_mul operation is 3.4M, and the execution time is 34,000us, which means 100CU/us. The cybercore secp256k1 benchmark showed a 200CU/us. This is 2x and 4x better than mainnet median already. However, they are 50x slower than an unoptimised (reference) rust implementation.

This makes sense since cryptography is typically more ALU-intensive than memory-intensive. This suggests the performance impact even from relatively small number of loads/stores dwarfs ALU ops. It suggests that the impact on mainnet programs is even more pronounced.

While the instruction count is informative, it's not perfect - it would be better to profile the opcode type count as executed, for instance for test_edward_mul, if this is possible, rather than count their occurence in the elf.

@stale
Copy link

stale bot commented Jul 30, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 30, 2021
@stale
Copy link

stale bot commented Aug 10, 2021

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants