Hardcoded assumptions that the hash will be encoded in hex #31

wking · 2017-03-07T23:54:13Z

The docs for Algorithm (now algorithm) made it clear that the algorithm identifier was intended to cover both the hash and encoding algorithms. @stevvooe confirmed this interpretation in recent comments as well. The idea is that a future algorithm may chose a non-hex encoding like base 64.

The current implementation, on the other hand, bakes the hex encoding into key locations (e.g. in NewDigestFromBytes and Digest.Validate). I suggest:

Defining an Encoding interface.
Adding an Algorithm.Encoding() Encoding method.
Adding an Algorithm.HashSize() int method.
Updating go-descriptor to use those instead of the currently-hard-coded hex assumptions.

I've floated an implementation in #30 if folks want to see how that works out.

Alternative solutions include giving up on alternatives and just requiring all hashes to be hex-encoded. Or some other API for pushing encoding information into the Algorithm instances.

Thoughts?

The text was updated successfully, but these errors were encountered:

stevvooe · 2017-03-08T00:26:21Z

This is a known limitation of the package. We actually considered parameterizing the encoding with the algorithm itself, but found the variation not to be conducive to wider distribution.

@stevvooe confirmed this interpretation in recent comments as well. The idea is that a future algorithm may chose a non-hex encoding like base 64.

Please don't confuse my desire to reserve something for future implementation as an endorsement of running out and implementing something now.

Updating go-descriptor to use those instead of the currently-hard-coded hex assumptions.

What is this?

IMHO, the implementation in #30 goes a little further than it needs to (not to mention the massive compatibility breaks). In fact, there was early code (not sure if I checked it in), that attempted this. The biggest invariant that the PR misses is that the format, encoding and actual hash algorithm are all linked. For example, sha256 means hex-encoded sha256, prefixed with sha256. Any change in this direction needs to serve that invariant. A variant on encoding would be another "algorithm" from the point of view of the digest package. For example, sha256 with base64 encoding might be sha256+b64, however, the utility of this is really uncertain, in practice.

When you look at digest as an API format, variation isn't a feature. If we look at other approaches to this problem, such as multihash, and actual uses, the most important feature is that they are easy to calculate and easy to compare. When you vary the encoding for things that are "equivalent", you break those features. Anything more complicated than "put the bytes in, get the hash out" means people won't verify and leverage the properties of the system to make it more secure; their eyes glaze over and the exploit surface area gets bigger.

Typically, the need for choosing the encoding for a digest is centered around reducing the storage size. However, they all come with trade offs. The most obvious is base64, which can achieve about a 30% savings over hex but you sacrifice case invariance. base32 will give you back case invariance but don't quite get the same savings. I have experimented with base36, as well, which is compact, case insensitive and easy on the eye. However, none of these provide even close to the benefit of using the same hash. This is powerful, simple and overlooking that value is only for the unwise.

Under storage, this necessity isn't the case. If you are storing a large number of hashes (non-zero percentage of total data), I'd recommend encoding directly to binary. Conversion back to the hex encoding is cheap and efficient. I have considered adding binary unmarshaler/marshalers but have balked on the complexity and expense of maintaining format compatibility over different package versions.

stevvooe · 2017-03-08T00:27:10Z

TL; DR This is a feature, not a bug

wking · 2017-03-08T01:08:06Z

On Tue, Mar 07, 2017 at 04:26:21PM -0800, Stephen Day wrote: This is a known limitation of the package. We actually considered parameterizing the encoding with the algorithm itself, but found the variation not to be conducive to wider distribution.

Can you link to the discussion? I didn't turn up anything that looked like it with [1].

> @stevvooe confirmed this interpretation in recent comments as > well. The idea is that a future algorithm may chose a non-hex > encoding like base 64. Please don't confuse my desire to reserve something for future implementation as an endorsement of running out and implementing something now.

There is a distinction there, but I don't see *how* you'd do that unless you make the encoding part of the algorithm. In #30, I wasn't adding any non-hex encodings, I was just adjusting the existing API so that you could gracefully add non-hex encodings if/when we want to.

> Updating go-descriptor to use those instead of the > currently-hard-coded hex assumptions. What is this?

Changes like [2], where we replace hardcoded hex assumptions with calls to Algorithm.Encoding and similar.

IMHO, the implementation in #30 goes a little further than it needs to (not to mention the massive compatibility breaks).

As I said in #30, let me know if you want me to break it into smaller chunks. But adding an Encoding interface isn't much use unless we have some sort of agreement on the final target. #30 (and now this issue) is my attempt to show that target, but obviously you could make different decisions and still end up with something that works.

The biggest invariant that the PR misses is that the format, encoding and actual hash algorithm are all linked. For example, `sha256` means hex-encoded sha256, prefixed with `sha256`.

In [3]: SHA256 = algorithm{ name: "sha256", hash: crypto.SHA256, encoding: Hex, } which is exactly what you say. So I'm not sure what I'm missing.

For example, `sha256` with base64 encoding might be `sha256+b64`, however, the utility of this is really uncertain, in practice.

Right, which is why “Alternative solutions include giving up on alternatives and just requiring all hashes to be hex-encoded” [4]. But I don't think you can have it both ways. If the intention is to keep the door open to alternative hash encodings, I think we want to make any API adjustments that we need to make to make that possible now. Or at least, as early as possible. Because any unavoidable backwards compat issues only get more painful with time.

When you look at digest as an API format, variation isn't a feature. If we look at other approaches to this problem, such as multihash, and actual uses, the most important feature is that they are easy to calculate and easy to compare. When you vary the encoding for things that are "equivalent", you break those features.

I'm not proposing varying an encoding for a given algorithm identifier. I'm proposing an API that supports new algorithm identifiers which use alternative encodings (which is an option that you're currently holding open [5]). And again, I'm also fine if we stop holding that option open and just require all hashes to be hex-encoded forever.

Typically, the need for choosing the encoding for a digest is centered around reducing the storage size. However, they all come with trade offs. The most obvious is base64, which can achieve about a 30% savings over hex but you sacrifice case invariance.

We have *theoretical* case invariance now. But the equality check in [6] means that nobody is actually using it. For example: $ cat a.go package main import ( _ "crypto/sha256" "fmt" "github.com/opencontainers/go-digest" ) func main() { digestA := digest.Digest("sha256:B5BB9D8014A0F9B1D61E21E796D78DCCDF1352F23CD32812F4850B878AE4944C") verifier := digestA.Verifier() verifier.Write([]byte("foo")) fmt.Printf("verified? %v\n", verifier.Verified()) } $ go run x.go verified? false

However, none of these provide even close to the benefit of using the same hash. This is powerful, simple and overlooking that value is only for the unwise.

I think you mean “using the same algorithm identifier”, since encodings are completely orthogonal to which *hash* you use. And I agree that “we'll always use hex forever” is simpler. And in that case, we should close the door on other encodings.

Under storage, this necessity isn't the case. If you are storing a large number of hashes (non-zero percentage of total data), I'd recommend encoding directly to binary. Conversion back to the hex encoding is cheap and efficient. I have considered adding binary unmarshaler/marshalers but have balked on the complexity and expense of maintaining format compatibility over different package versions.

It's not really all that complicated or expensive to maintain this [7]. Although I'm not clear on the wrinkle you're adding with “over different package versions”. Maybe you're just saying “but I don't want to write the data migration for a registry based on the hex-encoded hashes”. And that seems fair, but I don't think it means there are any problems with supporting decoding to the raw hash bytes in go-digest. Registries based on go-digest can use that decoding option or not as they see fit. [1]: https://github.com/docker/distribution/issues?utf8=%E2%9C%93&q=digest%20encoding [2]: https://github.com/opencontainers/go-digest/pull/30/files#diff-f6da2430393ce391e4517013b6d3a95bL48 [3]: https://github.com/opencontainers/go-digest/pull/30/files#diff-eef025f7a17623416f835581c12f0a3cR71 [4]: #31 (comment) [5]: #3 (comment) [6]: https://github.com/opencontainers/go-digest/blob/v1.0.0-rc0/verifiers.go#L44 [7]: https://github.com/opencontainers/go-digest/pull/30/files#diff-c1e0b73f3e74e3389ee4dab001672f51R43

jonboulle · 2017-03-08T14:14:23Z

If this is a feature, can we at least make it an explicit feature

stevvooe · 2017-03-08T19:39:58Z

If this is a feature, can we at least make it an explicit feature

What do you propose?

It is already explicit and part of the format and design of this package. It has been validated in use shipping millions of container images.

Also fixes a typo and adds one clarifying link in the README. Fixes opencontainers#31

jonboulle · 2017-03-09T16:04:55Z

Uh, I have no idea what the relevance of that is.

Posted my proposal here: #32

Also fixes a typo and adds one clarifying link in the README. Fixes opencontainers#31 Signed-off-by: Jonathan Boulle <jonathanboulle@gmail.com>

wking · 2017-05-10T19:24:39Z

Do we want to revisit this (and #30 / #32) now that #33 has landed? Or are we sticking with “maintainers will reject attempts to remove this TODO and not support folks who would like to use this package with non-hex algorithms” for the time being?

stevvooe changed the title ~~Hardcoded assumtions that the hash will be encoded in hex~~ Hardcoded assumptions that the hash will be encoded in hex Mar 8, 2017

jonboulle added a commit to jonboulle/go-digest that referenced this issue Mar 9, 2017

*: clarify we only deal with hex-encoded digests

a5acf29

Also fixes a typo and adds one clarifying link in the README. Fixes opencontainers#31

jonboulle mentioned this issue Mar 9, 2017

*: clarify we only deal with hex-encoded digests #32

Merged

jonboulle added a commit to jonboulle/go-digest that referenced this issue Mar 9, 2017

*: clarify we only deal with hex-encoded digests

b74b840

Also fixes a typo and adds one clarifying link in the README. Fixes opencontainers#31 Signed-off-by: Jonathan Boulle <jonathanboulle@gmail.com>

wking mentioned this issue Apr 25, 2017

digest: allow separators in algorithm field #33

Merged

dmcgowan closed this as completed in #32 Jan 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardcoded assumptions that the hash will be encoded in hex #31

Hardcoded assumptions that the hash will be encoded in hex #31

wking commented Mar 7, 2017 •

edited

Loading

stevvooe commented Mar 8, 2017

stevvooe commented Mar 8, 2017

wking commented Mar 8, 2017 via email •

edited

Loading

jonboulle commented Mar 8, 2017

stevvooe commented Mar 8, 2017

jonboulle commented Mar 9, 2017

wking commented May 10, 2017

Hardcoded assumptions that the hash will be encoded in hex #31

Hardcoded assumptions that the hash will be encoded in hex #31

Comments

wking commented Mar 7, 2017 • edited Loading

stevvooe commented Mar 8, 2017

stevvooe commented Mar 8, 2017

wking commented Mar 8, 2017 via email • edited Loading

jonboulle commented Mar 8, 2017

stevvooe commented Mar 8, 2017

jonboulle commented Mar 9, 2017

wking commented May 10, 2017

wking commented Mar 7, 2017 •

edited

Loading

wking commented Mar 8, 2017 via email •

edited

Loading