Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile time SHA256 override? #702

Open
gmaxwell opened this issue Dec 18, 2019 · 19 comments
Open

Compile time SHA256 override? #702

gmaxwell opened this issue Dec 18, 2019 · 19 comments

Comments

@gmaxwell
Copy link
Contributor

gmaxwell commented Dec 18, 2019

It's a bit silly that the library packs in its own sha256 even when the user has their own and on small devices it's a costly waste of space. In places like Bitcoin Core the environment has a fast SIMD or SHA-NI sha256, ... it actually makes a measurable difference in signing time to use SHA-NI.

It would be nice if there were some -Dlibsecp2561k_sha256init=x -Dlibsecp2561k_sha256update=y -Dlibsecp2561k_sha256final=z settings that could be used to compile time substitute in another library and leave the internal one out.

@elichai
Copy link
Contributor

elichai commented Dec 18, 2019

Last time we discussed this on IRC some people said that providing 3 functions might be a bit much and having a single sha256 function instead is better.
I think that a single one might be a bit awkward because it requires allocating a big buffer on the stack that can contain everything (and then more things you need to clean).

What are your thoughts on this?

@real-or-random
Copy link
Contributor

I think you'll want this feature only if you seriously care about performance. And in this case, you probably have the more complex API with 3 functions anyway.

@apoelstra
Copy link
Contributor

Also, in practice a single function is often harder to use - if you want to hash a series of things it's nice to just throw them all into a hash engine ... if you need to create a buffer first then (depending on your language) you may need to think about buffer allocation and indexing.

Another alternate might be to just conditionally compile out the sha256 functions, and require the user provide replacements with exactly the correct names in order for linking to succeed. This is what we do for the default error callbacks for example.

@gmaxwell
Copy link
Contributor Author

only if you seriously care about performance.

I think the vast majority of users would be ones that don't want to have an extra 19kilobytes in their binary.

Another alternate might be to just conditionally compile out the sha256 functions, and require the user provide replacements with exactly the correct names in order for linking to succeed.

That would be ducky too.

@elichai
Copy link
Contributor

elichai commented Dec 20, 2019

Unless anyone already works on this I'd like to give it a try :)

@real-or-random
Copy link
Contributor

Unless anyone already works on this I'd like to give it a try :)

Could you look into this?

@elichai
Copy link
Contributor

elichai commented Apr 24, 2020

Could you look into this?

IIRC the problem I encountered was the SHA2 context.
I remember a few solutions but none of them was great:

  1. Decide that the context needs to be size X, provide an opaque struct with unsigned array size X, and the implementer can memcpy this into his own Context back and forth (or type pun it? need to look into what the standard says on that).

  2. The implementer also needs to provide a header with the name sha256.h that defines the struct+functions using the same names/symbols we use.

  3. Provide a c_void pointer to the context in the function's API. (it then requires a reset function to reset the context, and it limits us because we can't hash two things "in parallel")

Any feedback on which is best or maybe a different solution is appreciated.

@sipa
Copy link
Contributor

sipa commented Apr 24, 2020

One possibility is keeping the SHA256 padding/chopping into blocks on the secp256k1 side, and only require external implementations to provide a SHA256 transformation function.

The API would look like:

/** Update the state (array of 8 uint32_t) pointed to by state with 64*blocks bytes of input pointed to by data. */
void sha256_transform(uint32_t *state, size_t blocks, unsigned char *data);

This means the external implementation may need to copy data from/to "array of 8 uint32_t" to their own internal representation - but I suspect almost any C-like implementation already uses that representation anyway.

@elichai
Copy link
Contributor

elichai commented Apr 26, 2020

That's an interesting idea.
I do think it will probably require the user to slightly modify his implementation to match this (I've looked at a few sha2 implementations I've found on google, and they all have their own quirks when it comes to the "transform" function)

And is there any real advantage with passing a bunch of blocks at the same time? won't it just call the transform function in a loop? or is there some SIMD/SHA-NI magic this enables?

@elichai
Copy link
Contributor

elichai commented Apr 26, 2020

I played with this, it works pretty nice, but then I realized that one of the reasons were binary size, so I measured using counting asm lines in godbolt(I'm not sure if it's a good way to compare),
and exposing RFC6979+HMAC+SHA2+SHA2_transform.
with -Os (for minimal size):
As-is: 3,523 lines of asm.
Omitting the transform function: 429 lines.
Omitting all of SHA2: 242 lines.

with -O2:
As-is: 3,844 lines.
Omitting transform: 511 lines.
Omitting all of SHA2: 236 lines.

with -O3:
As-is: 6,852 lines.
Omitting transform: 3,470 lines.
Omitting all of SHA2: 565 lines.
(-O3 is crazy lol)

Anyone who wants to play with it: https://godbolt.org/z/MPF3w9
add -DUSE_EXTERNAL_SHA2_TRANSFORM to omit the transform function.
add -DUSE_EXTERNAL_SHA2 to omit all of sha2 functions and leave only HMAC+RFC6979

@real-or-random
Copy link
Contributor

For binary size you can really just look at the file size of the binary, see #700 (comment).

add -DUSE_EXTERNAL_SHA2 to omit all of sha2 functions and leave only HMAC+RFC6979

If at all, I think we would want the opposite. We'll need SHA256 for Schnorr sigs and ECDH, but HMAC/RFC6979 is only used for deriving nonces, and you can use SHA256, too.

@elichai
Copy link
Contributor

elichai commented Apr 27, 2020

If at all, I think we would want the opposite. We'll need SHA256 for Schnorr sigs and ECDH, but HMAC/RFC6979 is only used for deriving nonces, and you can use SHA256, too.

I just had to expose some end result through godbolt to get asm, I used RFC6979 because it doesn't have any complex logic and it uses SHA256 in a complex way that prevent inlining everything and optimizing write together with finalize and initialize (so it's an example to show the asm code of SHA2 not of RFC6979)

@real-or-random
Copy link
Contributor

This really depends on whether we have a compile-time override or a runtime override but we should think about a self-test in the spirit of https://github.com/bitcoin/bitcoin/blob/99813a9745fe10a58bedd7a4cb721faf14f907a4/src/crypto/sha256.cpp#L465 .

@elichai
Copy link
Contributor

elichai commented Apr 28, 2020

I think the main advantage is actually code size, because my tests don't show big advantages for optimized sha2 implementations, although I haven't tested with SHA-NI but I don't think it's going to be a huge improvement in terms of performance, as SHA2 is really fast compared to anything EC related

@switck
Copy link

switck commented Feb 1, 2021

Im making embedded library containing libsecp256k1, see https://github.com/switck/libngu

thoughts:

  • already have a few sha256's functions in my codebase, do not want again in my binary (especially a vanilla version like this one)
  • would like to use the hmac-sha256 being carried in your library, but it's declared static
  • definitely want your deterministic nonce, since that impacts private key security
  • this is a code-size problem, not a performance concern

RE: sharing SHA256 code

  • a single-shot function is best if you plan to share code.
  • all s/w implementations use init/update/finalize but the context is never predictable
  • hardware SHA256 has context that isn't just a memory pointer+length (ie. chip resources)
  • you are hashing short messages, not megabytes, so stop/restart not needed
  • allocating space on the stack is instant; copying/appending messages onto stack is quick
  • a single-shot function can optimize for specific message lengths, since complete message size is known at start.

@real-or-random
Copy link
Contributor

@elichai Any update here? Maybe, if you currently don't have a lot of time, it would still be good to share your WIP branch, so someone else could adopt the PR.

@elichai
Copy link
Contributor

elichai commented Dec 23, 2021

@elichai Any update here? Maybe, if you currently don't have a lot of time, it would still be good to share your WIP branch, so someone else could adopt the PR.

Sadly I can't find it :(

One possibility is keeping the SHA256 padding/chopping into blocks on the secp256k1 side, and only require external implementations to provide a SHA256 transformation function.

The API would look like:

/** Update the state (array of 8 uint32_t) pointed to by state with 64*blocks bytes of input pointed to by data. */
void sha256_transform(uint32_t *state, size_t blocks, unsigned char *data);

This means the external implementation may need to copy data from/to "array of 8 uint32_t" to their own internal representation - but I suspect almost any C-like implementation already uses that representation anyway.

Do you know if implementations would want to keep the state as a __m128i instead of uint32_t? (I see that bitcoin's implementations load it to vector registers each time, but wondering if that's optimal)

@real-or-random
Copy link
Contributor

It's a bit silly that the library packs in its own sha256 even when the user has their own and on small devices it's a costly waste of space.

In places like Bitcoin Core the environment has a fast SIMD or SHA-NI sha256, ... it actually makes a measurable difference in signing time to use SHA-NI.

Going a step back, I believe that these are separate concerns. As per the discussion here, an override helps mostly to save space.

If we want to profit from hardware optimizations, this is somewhat orthogonal. An override would help here but only if the caller controls the compilation. This is true for Bitcoin Core but in general it's rather an exception. Also, if you look at Core's SHA256 implementation (https://github.com/bitcoin/bitcoin/blob/master/src/crypto/sha256.cpp), it's not clear how the override would look like. Depending on the available hardware, the essential function is either Transform, Transform_2way, Transform_4way, Transform_8way.

This is an argument in favor of optimizing our SHA256 implementation (independent of the possibility to override), and I tend to believe that this is desirable. We didn't want to do this in the past because SHA256 was an implementation detail. But as it's an integral part of Schnorr verification and signing now, things have changed. It's just somewhat messy: We'd need to duplicate code from Core (and worse, change it to C). Or move SHA256 (and maybe other stuff such as ChaCha20 for #760) to a separate library that's linked to secp256k1 and Core. Or expose the SHA256 here and make Core use them. None of this sounds great.

Thoughts?

@benma
Copy link
Contributor

benma commented Jul 14, 2024

I'd very much like being able to plug in a different sha256 implementation to save binary space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@sipa @gmaxwell @real-or-random @benma @apoelstra @elichai @switck and others