-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardcoded assumptions that the hash will be encoded in hex #31
Comments
This is a known limitation of the package. We actually considered parameterizing the encoding with the algorithm itself, but found the variation not to be conducive to wider distribution.
Please don't confuse my desire to reserve something for future implementation as an endorsement of running out and implementing something now.
What is this? IMHO, the implementation in #30 goes a little further than it needs to (not to mention the massive compatibility breaks). In fact, there was early code (not sure if I checked it in), that attempted this. The biggest invariant that the PR misses is that the format, encoding and actual hash algorithm are all linked. For example, When you look at digest as an API format, variation isn't a feature. If we look at other approaches to this problem, such as multihash, and actual uses, the most important feature is that they are easy to calculate and easy to compare. When you vary the encoding for things that are "equivalent", you break those features. Anything more complicated than "put the bytes in, get the hash out" means people won't verify and leverage the properties of the system to make it more secure; their eyes glaze over and the exploit surface area gets bigger. Typically, the need for choosing the encoding for a digest is centered around reducing the storage size. However, they all come with trade offs. The most obvious is base64, which can achieve about a 30% savings over hex but you sacrifice case invariance. base32 will give you back case invariance but don't quite get the same savings. I have experimented with base36, as well, which is compact, case insensitive and easy on the eye. However, none of these provide even close to the benefit of using the same hash. This is powerful, simple and overlooking that value is only for the unwise. Under storage, this necessity isn't the case. If you are storing a large number of hashes (non-zero percentage of total data), I'd recommend encoding directly to binary. Conversion back to the hex encoding is cheap and efficient. I have considered adding binary unmarshaler/marshalers but have balked on the complexity and expense of maintaining format compatibility over different package versions. |
TL; DR This is a feature, not a bug |
On Tue, Mar 07, 2017 at 04:26:21PM -0800, Stephen Day wrote:
This is a known limitation of the package. We actually considered
parameterizing the encoding with the algorithm itself, but found the
variation not to be conducive to wider distribution.
Can you link to the discussion? I didn't turn up anything that looked
like it with [1].
> @stevvooe confirmed this interpretation in recent comments as
> well. The idea is that a future algorithm may chose a non-hex
> encoding like base 64.
Please don't confuse my desire to reserve something for future
implementation as an endorsement of running out and implementing
something now.
There is a distinction there, but I don't see *how* you'd do that
unless you make the encoding part of the algorithm. In #30, I wasn't
adding any non-hex encodings, I was just adjusting the existing API so
that you could gracefully add non-hex encodings if/when we want to.
> Updating go-descriptor to use those instead of the
> currently-hard-coded hex assumptions.
What is this?
Changes like [2], where we replace hardcoded hex assumptions with
calls to Algorithm.Encoding and similar.
IMHO, the implementation in #30 goes a little further than it needs
to (not to mention the massive compatibility breaks).
As I said in #30, let me know if you want me to break it into smaller
chunks. But adding an Encoding interface isn't much use unless we
have some sort of agreement on the final target. #30 (and now this
issue) is my attempt to show that target, but obviously you could make
different decisions and still end up with something that works.
The biggest invariant that the PR misses is that the format,
encoding and actual hash algorithm are all linked. For example,
`sha256` means hex-encoded sha256, prefixed with `sha256`.
In [3]:
SHA256 = algorithm{
name: "sha256",
hash: crypto.SHA256,
encoding: Hex,
}
which is exactly what you say. So I'm not sure what I'm missing.
For example, `sha256` with base64 encoding might be `sha256+b64`,
however, the utility of this is really uncertain, in practice.
Right, which is why “Alternative solutions include giving up on
alternatives and just requiring all hashes to be hex-encoded” [4]. But
I don't think you can have it both ways. If the intention is to keep
the door open to alternative hash encodings, I think we want to make
any API adjustments that we need to make to make that possible now.
Or at least, as early as possible. Because any unavoidable backwards
compat issues only get more painful with time.
When you look at digest as an API format, variation isn't a
feature. If we look at other approaches to this problem, such as
multihash, and actual uses, the most important feature is that they
are easy to calculate and easy to compare. When you vary the
encoding for things that are "equivalent", you break those features.
I'm not proposing varying an encoding for a given algorithm
identifier. I'm proposing an API that supports new algorithm
identifiers which use alternative encodings (which is an option that
you're currently holding open [5]). And again, I'm also fine if we
stop holding that option open and just require all hashes to be
hex-encoded forever.
Typically, the need for choosing the encoding for a digest is
centered around reducing the storage size. However, they all come
with trade offs. The most obvious is base64, which can achieve about
a 30% savings over hex but you sacrifice case invariance.
We have *theoretical* case invariance now. But the equality check in
[6] means that nobody is actually using it. For example:
$ cat a.go
package main
import (
_ "crypto/sha256"
"fmt"
"github.com/opencontainers/go-digest"
)
func main() {
digestA := digest.Digest("sha256:B5BB9D8014A0F9B1D61E21E796D78DCCDF1352F23CD32812F4850B878AE4944C")
verifier := digestA.Verifier()
verifier.Write([]byte("foo"))
fmt.Printf("verified? %v\n", verifier.Verified())
}
$ go run x.go
verified? false
However, none of these provide even close to the benefit of using
the same hash. This is powerful, simple and overlooking that value
is only for the unwise.
I think you mean “using the same algorithm identifier”, since
encodings are completely orthogonal to which *hash* you use. And I
agree that “we'll always use hex forever” is simpler. And in that
case, we should close the door on other encodings.
Under storage, this necessity isn't the case. If you are storing a
large number of hashes (non-zero percentage of total data), I'd
recommend encoding directly to binary. Conversion back to the hex
encoding is cheap and efficient. I have considered adding binary
unmarshaler/marshalers but have balked on the complexity and expense
of maintaining format compatibility over different package versions.
It's not really all that complicated or expensive to maintain this
[7]. Although I'm not clear on the wrinkle you're adding with “over
different package versions”. Maybe you're just saying “but I don't
want to write the data migration for a registry based on the
hex-encoded hashes”. And that seems fair, but I don't think it means
there are any problems with supporting decoding to the raw hash bytes
in go-digest. Registries based on go-digest can use that decoding
option or not as they see fit.
[1]: https://github.com/docker/distribution/issues?utf8=%E2%9C%93&q=digest%20encoding
[2]: https://github.com/opencontainers/go-digest/pull/30/files#diff-f6da2430393ce391e4517013b6d3a95bL48
[3]: https://github.com/opencontainers/go-digest/pull/30/files#diff-eef025f7a17623416f835581c12f0a3cR71
[4]: #31 (comment)
[5]: #3 (comment)
[6]: https://github.com/opencontainers/go-digest/blob/v1.0.0-rc0/verifiers.go#L44
[7]: https://github.com/opencontainers/go-digest/pull/30/files#diff-c1e0b73f3e74e3389ee4dab001672f51R43
|
If this is a feature, can we at least make it an explicit feature |
What do you propose? It is already explicit and part of the format and design of this package. It has been validated in use shipping millions of container images. |
Also fixes a typo and adds one clarifying link in the README. Fixes opencontainers#31
Uh, I have no idea what the relevance of that is. Posted my proposal here: #32 |
Also fixes a typo and adds one clarifying link in the README. Fixes opencontainers#31 Signed-off-by: Jonathan Boulle <jonathanboulle@gmail.com>
The docs for
Algorithm
(nowalgorithm
) made it clear that the algorithm identifier was intended to cover both the hash and encoding algorithms. @stevvooe confirmed this interpretation in recent comments as well. The idea is that a future algorithm may chose a non-hex encoding like base 64.The current implementation, on the other hand, bakes the hex encoding into key locations (e.g. in
NewDigestFromBytes
andDigest.Validate
). I suggest:Encoding
interface.Algorithm.Encoding() Encoding
method.Algorithm.HashSize() int
method.I've floated an implementation in #30 if folks want to see how that works out.
Alternative solutions include giving up on alternatives and just requiring all hashes to be hex-encoded. Or some other API for pushing encoding information into the
Algorithm
instances.Thoughts?
The text was updated successfully, but these errors were encountered: