RFC: How should a digest_size > hash function's output be handled? #16

moreati · 2015-10-16T23:14:37Z

How should a multihash implementation handle a call where the digest is longer than the specified output of that hash function? E.g. (pseudocode)

// The provided digest is 26 bytes long,
// but SHA1 can't produce a digest longer than 20 bytes
mh = multihash.encode(digest='abcdefghijklmnopqrstuvwxyz', code=Codes['sha1'])

Conversely, is the following multihash valid? (hex encoded, spaces added for legibility)

11 1a 6162636465666768696a6b6c6d6e6f707172737475767778797a

The code field declares the hash function is SHA-1 (0x11). The length field declares the digest is 26 bytes long, and the received digest field is 26 bytes long. However SHA-1 can't produce 26-byte digest.

When a multihash library is called to encode/decode such an input it could:

a. Signal an error, and stop encoding/decoding?
b. Set the length field to len(digest), then continue with processing?
c. Truncate the digest field, before continuing with processing?

The behaviour is currently unspecified, i.e. implementations can do whatever they want. AFAICT go-multihash does b. I'd like to propose that a. as a required behaviour for standards conforming implementations.

What do you think? Have I missed a specification document? Would you prefer I sent this to a mailing list (e.g. ipfs-users)?

The text was updated successfully, but these errors were encountered:

jbenet · 2015-10-18T11:35:32Z

Thanks for this, great bug catch. I think this should probably be an error. Though we could define all extra bytes to always be zeros. that way can compare sha1 and sha256 at same lengths or something.

moreati · 2015-10-18T12:15:02Z

we could define all extra bytes to always be zeros. that way can compare sha1 and sha256 at same lengths or something

I don't follow. What would be the use case for comparing digests from two different algorithms?

jbenet · 2015-10-18T12:41:13Z

@moreati oh for example in a Kademlia DHT for XOR distance between digests.

moreati · 2015-10-18T21:48:36Z

Kademlia DHT for XOR distance between digests

How niche (or otherwise) a use case are those? Would you really use a multihash encoded digest directly in a DHT, given the biasing that occurs in the first two bytes?

I'm worried that if padding bytes are introduced then it will lead to some vulnerability or error based on a false sense of security. I would very much prefer that behaviour a, (i.e. abort/signal an error) be the only conforming behaviour, or at leastt the strongly encouraged default behaviour.

If padding is to be allowed by the standard it should be an explicit choice such as passing a flag padding=true, or invoking encode_padded() instead of encode()

jbenet · 2015-10-19T22:52:16Z

How niche (or otherwise) a use case are those?

it's not uncommon.

im not sure either way, yet. but i hear you, re errors and not creating a misleading level of bit security. that's likely the right answer.

candeira · 2015-11-18T23:12:21Z

+1 as @moreati for returning an error when lenght > lenght_of_digest_algorithm, both when computing a Sum() (as the utility function is called in the go-multihash implementation) and when packing a digest into a multihash value.

Padding should be handled by the application that requires comparison of hashes of different lenghts.

If indeed padding is what's needed. For instance, Kademlia implementations could decide to:

rehash all multihashes in order to get a uniform hash length for keying and computing distances.
pad a hash with itself instead of with zeroes. This would not add entropy, but would avoid clustering of shorter hashes.
etc.

jbenet · 2015-11-20T07:46:18Z

another option is to actually produce that many bytes of hash. for example, the "111 byte (888 bit) sha1 hash of 'foobar' "

x := "foobar"
buf := make([]byte, 111) // 888 bits.
copy(sha1(x), buf[0:20]) // 160 bits of sha1(x)
copy(sha1(x + "1"), buf[20:40]) // 160 bits of sha1(x + "1")
copy(sha1(x + "2"), buf[40:60]) // 160 bits of sha1(x + "2")
copy(sha1(x + "3"), buf[60:80]) // 160 bits of sha1(x + "3")
copy(sha1(x + "4"), buf[80:100]) // 160 bits of sha1(x + "4")
copy(sha1(x + "5"), buf[100:111]) // 88 bits of sha1(x + "5")

candeira · 2015-11-20T10:14:01Z

@jbenet But you can only do that when you have the blob (in this case, "foobar"), so it's useless for, for instance, doing Kademlia XOR comparisons when looking for the content corresponding to a sha1 multihash (your original example, if I understand correctly).

That's what I mean when I say that "padding (of a multihash) should be done by the application (that receives the multihash and has to operate on multihashes only, in absence of the corresponding blob)". Sorry for not being clearer earlier.

jbenet · 2015-11-20T19:12:44Z

Well usually one ends up with a hash by finding it-- it was always generated by someone with the data. And most times the hash used for DHTs and so on is a hash of some data structure (the dag, not "the file"), which needs to be found as well. I'm not sure that use case precludes the thing I mentioned.

But I'm not sure the thing I mentioned is useful in any case

candeira · 2015-11-20T20:39:20Z

I feel like a fool for arguing with you about this. I'm going to put a pin in it as "maybe I need to understand it better".

jbenet · 2015-11-20T20:44:05Z

Not at all, all of this is useful discussion.

jbenet · 2016-01-24T13:22:03Z

cc @eternaleye for feedback

eternaleye · 2016-01-24T13:36:39Z

Honestly, if you need variable-length output longer than the hash, you fundamentally don't want a hash. You want a KDF. @jbenet, your Nov 19 post was basically an attempt at constructing a KDF from scratch, which I strongly recommend against.

Once that's recognized, the sensible options for multihash collapse down to HKDF; HMAC-based Key Derivation Function. Efficient (does only one pass over the IKM), supports both "context" and "salt" parameters, can produce arbitrary length, and can be instantiated with any hash.

In addition, I consider it essential that multihash include the target length in the input to the KDF - the reason for this is that, without doing so, an attacker can silently truncate a multihash in order to permit incorrect data to pass undetected.

Even so, a strict minumum hash length is necessary in order to prevent an attacker from cutting it really short, and then just bruteforcing.

Honestly, anything that is part of the multihash syntax (and thus out-of-band) should be included in the calculation (in HKDF, as part of the context) to avoid truncation attacks, algorithm substitution attacks, etc.

Another benefit of using HKDF: It uses a staged API, in which one calls two functions:

Extract(IKM, salt) -> PRK // Extract a fixed-size pseudorandom key from a variable-sized input keying material and a salt
Expand(PRK, context) -> OKM // Expand the PRK into arbitrary-length output keying material, bound to a specific context

This allows the expensive part (hashing the data) to be done once, and the cheap part (generating relatively short outputs) to reuse the result of that.

jbenet · 2016-08-14T06:42:32Z

@eternaleye is there a way to do what you described with the following constraints:

all current multihashes remain valid as they are
functions that define how to "get more bits" use that as a default
functions that dont define how to "get more bits" use a well-defined HKDF

Sounds to me like no, given that you think we should add lengths to the input to avoid silent truncation.

BTW silent truncation still has to run up against the expected length value. If the attacker can both modify the expected length value, and truncate it, then they could also modify the hash digest too, so not sure what's won. (in practice there may be attacks that depend on being able to modify only one byte (make the len tiny) instead of many (replace the whole digest with an attacker-chosen one).

jbenet · 2016-08-14T06:42:37Z

Maybe TRTTD here is to define a different function "Multihash HKDF", give it a multihash code and all, which, given some other secure multihash function, constructs a HKDF. the function can embed the multihash code of the sub-function it's using in its digest. For example:

<fn-code-of-mhkdf><length-k><fn-code-of-subfn><digest-of-len-k-minus-1>

jbenet · 2016-08-14T06:43:16Z

As for the original question, "how should a digest_size > hash function's output be handled?", it looks like an error for now.

eternaleye · 2016-08-14T07:12:20Z

Maybe TRTTD here is to define a different function "Multihash HKDF"

I would agree with that assessment. However, I would not recommend using the expanded output as the multihash value.

Specifically, I'd suggest the multihash value be the PRK, with the information of the underlying multihash passed in the salt.

Then, use cases which require a specific length use Expand(PRK, context = (length, use_case)) in order to derive what they need. A DHT using 256-bit IDs might use Expand(PRK, context = (256, "locator_dht_v1")), and so forth.

nichtich · 2019-04-09T06:54:07Z

By now the "hash function's output length" is not explicitly known so implementations can't do anything about this case (see #114 for a fix). Given this information I'd say

either multihashes MUST fill the digest length bytes that exceed the hash function's output length with 0x00 bytes and add an explicit warning.
or make this multihashes illegal by definition
or explicitly allow any kind of values in the additional bytes and add an explicit warning.

moreati mentioned this issue Oct 21, 2015

Specification of format and behaviours #17

Closed

nichtich mentioned this issue Apr 9, 2019

Add default hash digest length to table.csv #114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: How should a digest_size > hash function's output be handled? #16

RFC: How should a digest_size > hash function's output be handled? #16

moreati commented Oct 16, 2015

jbenet commented Oct 18, 2015

moreati commented Oct 18, 2015

jbenet commented Oct 18, 2015

moreati commented Oct 18, 2015

jbenet commented Oct 19, 2015

candeira commented Nov 18, 2015

jbenet commented Nov 20, 2015

candeira commented Nov 20, 2015

jbenet commented Nov 20, 2015

candeira commented Nov 20, 2015

jbenet commented Nov 20, 2015

jbenet commented Jan 24, 2016

eternaleye commented Jan 24, 2016

jbenet commented Aug 14, 2016

jbenet commented Aug 14, 2016

jbenet commented Aug 14, 2016

eternaleye commented Aug 14, 2016 •

edited

Loading

nichtich commented Apr 9, 2019

RFC: How should a digest_size > hash function's output be handled? #16

RFC: How should a digest_size > hash function's output be handled? #16

Comments

moreati commented Oct 16, 2015

jbenet commented Oct 18, 2015

moreati commented Oct 18, 2015

jbenet commented Oct 18, 2015

moreati commented Oct 18, 2015

jbenet commented Oct 19, 2015

candeira commented Nov 18, 2015

jbenet commented Nov 20, 2015

candeira commented Nov 20, 2015

jbenet commented Nov 20, 2015

candeira commented Nov 20, 2015

jbenet commented Nov 20, 2015

jbenet commented Jan 24, 2016

eternaleye commented Jan 24, 2016

jbenet commented Aug 14, 2016

jbenet commented Aug 14, 2016

jbenet commented Aug 14, 2016

eternaleye commented Aug 14, 2016 • edited Loading

nichtich commented Apr 9, 2019

eternaleye commented Aug 14, 2016 •

edited

Loading