-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore speedup techniques #20
Comments
Another thing: provided that batching drives cost of point add from ~30M (4+~20+4) to 11M (3+4+4), the 4+4M now become significant cost instead. Luckily, we need to compute those in separate lanes for the batch, before and after. Meaning prime target for SIMD. |
That batch invert seems pretty simple to implement. Then I guess we'd have to do something like batch p3_tobytes. For every worker have array we ge_add stuff to, then batch_p3_tobytes, then compare each. How big these arrays/batches should be? I guess I could pick some random number like 64 first and after implementing try tweaking and checking what could work better. This kind of batching has assumption that hits for correct key are rather rare, which is very probably true for the most use cases of this tool. |
Damn this just got a whole lot faster. |
Well done. Forget about what I said about SIMD, i tried few prototypes, they all suck. 128bit mulx hogs the ALU completely, and SIMD brings next to no benefit, only massive complexity. Only real speedup is switching from ge_add / p3p3 to ge_dbl / p2. Doing that would involve massive refactor though, and can only speed up things about twice from the current batch algo. |
L1 size to fit groupe elements and thread stacks vs diminishing results. If you want to get super fancy, make autotuning option which grows the batch until best timing is obtained per one mega-batch (guesstimate anywhere between 64 to 256). |
Explain. We need p3 version converted to bytes for public key check. But we can have whatever representation for computing as long as we convert to p3 later. |
Actually it'd effectively be shift of secret key (power of 2) so uhhhh only way to do it fast would be starting from lower-entropy key with upper bits zeroed. |
Wait uh. We can't do that. Shifting wouldn't work because we need certain upper bits pattern in secret key. So yeah without deeper understanding of ed25519 I'm not sure if it's good idea to touch these either. |
ge_dbl has p2 input, and p1 -> p2 is one mul cheaper than p1 -> p3. Then ge_dbl is significantly cheaper than on its own than ge_add, as it's mostly squares, not muls. As for the scalar, it's run of the mill double-exponentiation. Say you get Then you do Edit: The formula should be mul/exp, not expexp. |
As for what the scalar actually is in ed25519. The bit twiddling djb does simply ensures that it is a multiple of 8 (cofactor, super important) and that it is less than M (group order). Note that scalar which is raised to power may no longer be "multiple of 8" at first sight, but is perfectly valid (still lies in group). This is why this is checked only on generation, but later operations check only the restricted 2^252 bound. As for that, on rare occasion it falls into between M and 2^252. Most of ed25519 code can't cope with such scalar because they use bit twiddle as quick range check which can't comprehend the real M prime, just something a bit less than that. Chance of that happening is rare, but must be checked after raising the scalar to power of counter to remain interoperable with other 25519 code. |
My main question is whether we can do it.
which zeros 2 topmost bits and sets 3rd topmost bit so we essentially have only 2 bits to shift, anything more and we're truncating. |
I'm not sure why you want to shift the scalar at all. That would be horribly inefficient (you'd have to do that every round AND you'd have to do it mod M. Bit twiddles are ONLY for generation, MUST NOT BE APPLIED after further operations, such as addition or multiplication of scalar. Later on, you can only check for overflows.
Can be done if you don't wan't to bother with modular reductions - you simply detect overflow and bail. |
The reason why you can't use "bail on overflow" with ge_dbl is as you noted - you'd shift out of range pretty quickly, so you'd end up constantly reseeding. |
Heh, my lack of understanding of primitives involved shows.
I think I'm starting to understand, fe doubling is modular operation, therefore we need to do scalar exponentation in modular way too? |
Indeed. The formula I posted earlier turned to be incorrect (my confusion, for a change, deleted my misinformative posts). |
Uhh, it's Well, I'll explain my understanding.
I wasn't going to do that. I'd shift left n amount of times, where n is number of times I did |
This sorta has issue that we're going to reseed much more often, especially if we wanna more entropy. |
You have M points on the curve, and scalar is simply "index" for each point. All prime modulus math which applies on curve, applies to scalar. Except for scalar its ordinary counts, on curve you have only + and *2 operations from which you can build scalar_multiply. https://github.com/floodyberry/ed25519-donna/blob/master/modm-donna-64bit.h Note that ref10 is exceedingly clever wrt scalars, it handles only sc_muladd, which does what it says using primitive level reductions (very slow, but it hardly matters). |
Uh so M isn't power of 2 and that makes my assumptions invalid, thanks for clarification. |
|
Yes, M is a prime, and all the math in there sucks (see the barret reduction horrors in donna for reasonably fast code). The way the formula would go in practice is is compute |
Thanks for patience explaining, understood. |
@cathugger The amount of refactoring to go into this not sure if its worth it indeed. The batching is the most crucial thing, as it has speed up most dramatic. I'm considering making a general-purpose ed25519 vanitygen solely for opencl, so this is more of an excercise of "wishlist" for things to have in there. |
~16x speedup with 256 batch at 9972a83 |
Speedup is a little bit smaller (~14x) when actual filter is specified but still pretty good, doing more batching beyond 1024 helps but not as sharply. I have feeling CPU caching benefits aren't as big as gain we get avoiding multiplications. |
Ok seems at 4096 it's not faster anymore, kinda even slower so I guess CPU caching benefits win at that point. |
Things work reasonably well so far though I wanna make donna impl work before enabling this, and I'm going away from this for a while, so whoever wanna test on their own using ref10 or amd64* impls can just edit |
Oh and pass |
Ok, here is ge_dbl prototype to formalize what it is all about: package main
import (
"crypto/sha512"
"encoding/base32"
"github.com/agl/ed25519/edwards25519"
"log"
"math/big"
)
func GetScalar(seed []byte) *[32]byte {
h := sha512.New()
h.Write(seed)
digest := h.Sum(nil)
digest[0] &= 248
digest[31] &= 127
digest[31] |= 64
var hBytes [32]byte
copy(hBytes[:], digest)
return &hBytes
}
// SetBytes uses BE, but ed25519 is LE
func Flip(buf []byte) (res [32]byte) {
// Bytes() from big can return less than 32bytes. Ensure that we start from right index.
top := len(buf)-1
for i,v := range buf {
res[top-i] = v
}
return
}
func main() {
// The counter we'd find wanted onion at
counter := int64(12345)
M, _ := new(big.Int).SetString("7237005577332262213973186563042994240857116359379907606001950938285454250989", 10)
onion := base32.NewEncoding("abcdefghijklmnopqrstuvwxyz234567").WithPadding(base32.NoPadding)
var ge edwards25519.ExtendedGroupElement
// This is just standard address seeded by all-zero (for reference)
var seed [32]byte
sc := GetScalar(seed[:])
edwards25519.GeScalarMultBase(&ge, sc)
// Print reference
var buf [32]byte
ge.ToBytes(&buf)
log.Println(onion.EncodeToString(buf[:]))
// Now double it 12345 times
for i := int64(0); i < counter; i++ {
var c edwards25519.CompletedGroupElement
ge.Double(&c)
c.ToExtended(&ge)
}
// Print after 12345 times
ge.ToBytes(&buf)
log.Println(onion.EncodeToString(buf[:]))
// Now take the starting scalar, and multiply it by 2^12345 mod M
result := new(big.Int)
result.Exp(big.NewInt(2), big.NewInt(counter), M)
flipped := Flip(sc[:])
result.Mul(result, new(big.Int).SetBytes(flipped[:]))
result.Mod(result, M)
// Which should be same as public key doubled 12345 times
sc2 := Flip(result.Bytes())
var ge2 edwards25519.ExtendedGroupElement
edwards25519.GeScalarMultBase(&ge2, &sc2)
ge2.ToBytes(&buf)
log.Println(onion.EncodeToString(buf[:]))
} |
0befa41 regarding point doubling, I don't think current primitives in ed25519 implementations support proper exponent thingie, will probably need to look elsewhere (later), otherwise I don't see much other issues implementing it yet. |
Is |
@programagor yeah, they're currently not compatible (because pass x batch mainloop func isn't written). I should fix that tbh. |
@programagor added to master, will be part of next release |
Can I just ask, wouldnt the easiest way to speed this up be to split the load? if possible? |
@joshjames You can simply run mkp224o on multiple machines with a different starting seed. Individual instances don't need to communicate with each other, they will explore their own key space based on the seed. If you run two instances with the same seed, they will generate the same keys. |
additionally, yaml output can be combined with something like netcat to deliver generated keys to single server |
https://lib25519.cr.yp.to seems to be a new thing with possibly improved fe25519_mul impl in |
Vast majority of computation cost lies in ge_add. mkp uses same technique as horse, that is adding 8-multiplies to a work public point, and bumping the secret scalar by 8 in lockstep.
One step costs 4 field multiplications (ge_add), a very costly field inversion (p3_tobytes) and finally another 4 multiplications (back to projective for next add).
The heaviest thing now is the field inversion, so time for some moonmath. Turns there's this thing called Montgomery trick,:
https://github.com/dalek-cryptography/curve25519-dalek/blob/d58b663ea8eab24c3232cc92781e1ff3e24a866d/src/field.rs#L139
Where you multiply all the numbers in a batch into tower of products, invert the single last (upper most) one, and then walk the tower backwards multiplying the past scratch products with that inverse to get inverses for individual numbers in the batch. This reduces cost of inversion from 20+ muls to just 3 with a batch large enough.
Further speed up would be getting rid of ge_add, and go with point doubling instead. Which could possibly provide at least 2x more speed up, but I'm not well versed in explicit ed25519 formulas of the point formats involved yet.
The text was updated successfully, but these errors were encountered: