Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize algorithm for directory hashing #100

Merged
merged 11 commits into from
Dec 19, 2024
104 changes: 104 additions & 0 deletions cep-00??.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# CEP XX - Computing the hash of the contents in a directory

<table>
<tr><td> Title </td><td> Computing the hash of the contents in a directory </td>
<tr><td> Status </td><td> Draft </td></tr>
<tr><td> Author(s) </td><td> Jaime Rodríguez-Guerra &lt;jaime.rogue@gmail.com&gt;</td></tr>
<tr><td> Created </td><td> Nov 19, 2024</td></tr>
<tr><td> Updated </td><td> Nov 19, 2024</td></tr>
<tr><td> Discussion </td><td> https://github.com/conda/ceps/pull/100 </td></tr>
<tr><td> Implementation </td><td> https://github.com/conda/conda-build/pull/5277 </td></tr>
</table>

## Abstract

Given a directory, propose an algorithm to compute the aggregated hash of its contents in a cross-platform way. This is useful to check the integrity of remote sources regardless the compression method used.

> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT",
"RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as
described in [RFC2119][RFC2119] when, and only when, they appear in all capitals, as shown here.

## Specification

Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path. For each entry in the contents table, compute the hash for the concatenation of:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to better specify what "sort" means, and in particular, what collation ordering we want to use. Because:

$ echo -e "aa\na-a\nab\na-b" | LC_ALL=en_US.UTF-8 sort
a-a
aa
a-b
ab

$ echo -e "aa\na-a\nab\na-b" | LC_ALL=C.UTF-8 sort
a-a
a-b
aa
ab

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like Python's sort([str, ...]), it's locale agnostic. See 378d1fd (#100)

- UTF-8 encoded bytes of the path, relative to the input directory. Backslashes MUST be normalized to forward slashes before encoding.
- Then, depending on the type:
- For text files, the UTF-8 encoded bytes of an `F` separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (`\r\n` replaced with `\n`). A file is considered
a text file if all the contents can be UTF-8 decoded. Otherwise it's considered binary.
- For binary files, the UTF-8 encoded bytes of an `F` separator, followed by the bytes of its contents.
- For a directory, the UTF-8 encoded bytes of a `D` separator, and nothing else.
- For a symlink, the UTF-8 encoded bytes of an `L` separator, followed by the UTF-8 encoded bytes of the path it points to. Backslashes MUST be normalized to forward slashes before encoding.
- UTF-8 encoded bytes of the string `-`.

Example implementation in Python:

```python
import hashlib
from pathlib import Path

def contents_hash(directory: str, algorithm: str) -> str:
hasher = hashlib.new(algorithm)
for path in sorted(Path(directory).rglob("*")):
hasher.update(path.relative_to(directory).replace("\\", "/").encode("utf-8"))
if path.is_symlink():
hasher.update(b"L")
hasher.update(str(path.readlink(path)).replace("\\", "/").encode("utf-8"))
elif path.is_dir():
hasher.update(b"D")
elif path.is_file():
hasher.update(b"F")
try:
# assume it's text
lines = []
with open(path) as fh:
for line in fh:
lines.append(line.replace("\r\n", "\n")
for line in lines:
hasher.update(line.encode("utf-8")))
except UnicodeDecodeError:
# file must be binary
with open(path, "rb") as fh:
for chunk in iter(partial(fh.read, 8192), b""):
hasher.update(chunk)
hasher.update(b"-")
return hasher.hexdigest()
```

## Motivation

Build tools like `conda-build` and `rattler-build` need to fetch the source of the project being packaged. The integrity of the download is checked by comparing its known hash (usually SHA256) against the obtained file. If they don't match, an error is raised.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they don't match, an error is raised.

This doesn't say what build tools should do when the files can't be read or an error is raised during computation, and the implementation in https://github.com/conda/conda-build/pull/5277/files#r1861269213 shows that it ignores file read errors (which can have many reasons) and only creates the content hash with the name of the file. I'd think that's a failure state for a content hash algorithm that aims to be consistent since it would expose it to file permission attacks etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow how this could be weaponized as an attack vector. Let's say we have a directory with three files with (name, content):

  • ('file1.txt', '123')
  • ('file2.txt', '456')
  • ('file3.txt', '789')

We end up computing the hash of these Utf-8 bytes:

file1.txtF123-file2.txtF456-file3.txtF789-
>>> import hashlib
>>> hashlib.md5(b"file1.txtF123-file2.txtF456-file3.txtF789-").hexdigest()
'54866bc311f08b2e082466b090cbe560'

If let's say file1.txt changes permissions and becomes unreadable, then the string would be:

file1.txtF-file2.txtF456-file3.txtF789-

, which has a different hash.

If file1.txt becomes a directory it would then be:

file1.txtD-file2.txtF456-file3.txtF789-

Same for an unknown file type (but ? instead of D).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, let's say that we had an empty file, and then a new source adds some malicious content there, but makes it unreadable. They would have the same hash because we are not marking errors... The build script would just need to make it readable with chmod later and then it can be weaponized. Ok, now I get it: we need to mark errors with a different separator: maybe E. Right?

Thank you for reading along!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we do need to error out, because an unreadable file might change contents arbitrarily and its hash would never change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 77e35c0


However, the hash of the compressed archive is sensitive to superfluous changes like which compression method was used, the version of the archiving tool and other details that are not concerned with the contents of the archive, which is what a build tool actually cares about.
This happens often with archives fetched live from Github repository references, for example.
It is also useful to verify the integrity of `git clone` operation on a dynamic reference like a branch name.

With this proposal, build tools could add a new family of hash checks that are more robust for content reproducibility.

## Rationale

The proposed algorithm could simply concatenate all the bytes together, once the directory contents had been sorted. Instead, it also encodes relative paths and separators to prevent [preimage attacks][preimage].

Merkle trees were not used for simplicity, since it's not necessary to update the hash often or to point out which file is responsible for the hash change.

The implementation of this algorithm as specific options in build tools is a non-goal of this CEP. That goal is deferred to further CEPs, which could simply say something like:

> The `source` section is a list of objects, with keys [...] `contents_sha256` and `contents_md5` (which implement CEP XX for SHA256 and MD5, respectively).

## References

- The original issue suggesting this idea is [`conda-build#4762`][conda-build-issue].
- The Nix ecosystem has a similar feature called [`fetchzip`][fetchzip].
- There are several [Rust crates][crates] and [Python projects][pymerkletools] implementing similar strategies using Merkle trees. Some of the details here were inspired by [`dasher`][dasher].

## Copyright

All CEPs are explicitly [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/).

<!-- links -->

[fetchzip]: https://nixos.org/manual/nixpkgs/stable/#fetchurl
[preimage]: https://flawed.net.nz/2018/02/21/attacking-merkle-trees-with-a-second-preimage-attack/
[dasher]: https://github.com/DrSLDR/dasher#hashing-scheme
[pymerkletools]: https://github.com/Tierion/pymerkletools
[crates]: https://crates.io/search?q=content%20hash
[conda-build-issue]: https://github.com/conda/conda-build/issues/4762