From c90049b757a83390bc607c8fb4e6473155d433e4 Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Wed, 20 Nov 2024 01:39:58 +0100
Subject: [PATCH 01/11] Standardize algorithm for directory hashing

---
 cep-00??.md | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)
 create mode 100644 cep-00??.md
diff --git a/cep-00??.md b/cep-00??.md
new file mode 100644
index 0000000..66b10b6
--- /dev/null
+++ b/cep-00??.md
@@ -0,0 +1,90 @@
+# CEP XX - Computing the hash of the contents in a directory
+
+<table>
+<tr><td> Title </td><td> Computing the hash of the contents in a directory </td>
+<tr><td> Status </td><td> Draft </td></tr>
+<tr><td> Author(s) </td><td> Jaime Rodríguez-Guerra &lt;jaime.rogue@gmail.com&gt;</td></tr>
+<tr><td> Created </td><td> Nov 19, 2024</td></tr>
+<tr><td> Updated </td><td> Nov 19, 2024</td></tr>
+<tr><td> Discussion </td><td> link to the PR where the CEP is being discussed, NA is circulated initially </td></tr>
+<tr><td> Implementation </td><td> https://github.com/conda/conda-build/pull/5277 </td></tr>
+</table>
+
+## Abstract
+
+Given a directory, propose an algorithm to compute the aggregated hash of its contents in a cross-platform way. This is useful to check the integrity of remote sources regardless the compression method used.
+
+> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT",
+  "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as
+  described in [RFC2119][RFC2119] when, and only when, they appear in all capitals, as shown here.
+
+## Specification
+
+Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path. For each entry in the contents table, compute the hash for the concatenation of:
+- UTF-8 encoded bytes of the path, relative to the input directory. Backslashes MUST be normalized to forward slashes before encoding.
+- Then, depending on the type:
+    - For regular files, the UTF-8 encoded bytes of an `F` separator, followed by the bytes of its contents.
+    - For a directory, the UTF-8 encoded bytes of a `D` separator, and nothing else.
+    - For a symlink, the UTF-8 encoded bytes of an `L` separator, followed by the UTF-8 encoded bytes of the path it points to. Backslashes MUST be normalized to forward slashes before encoding.
+- UTF-8 encoded bytes of the string `-`.
+
+Example implementation in Python:
+
+```python
+import hashlib
+from pathlib import Path
+
+def contents_hash(directory: str, algorithm: str) -> str:
+    hasher = hashlib.new(algorithm)
+    for path in sorted(Path(directory).rglob("*")):
+        hasher.update(path.relative_to(directory).replace("\\", "/").encode("utf-8"))
+        if path.is_symlink():
+            hasher.update(b"L")
+            hasher.update(str(path.readlink(path)).replace("\\", "/").encode("utf-8"))
+        elif path.is_dir():
+            hasher.update(b"D")
+        elif path.is_file():
+            hasher.update(b"F")
+            with open(path, "rb") as fh:
+                for chunk in iter(partial(fh.read, 8192), b""):
+                    hasher.update(chunk)
+        hasher.update(b"-")
+    return hasher.hexdigest()
+```
+
+## Motivation
+
+Build tools like `conda-build` and `rattler-build` need to fetch the source of the project being packaged. The integrity of the download is checked by comparing its known hash (usually SHA256) against the obtained file. If they don't match, an error is raised.
+
+However, the hash of the compressed archive is sensitive to superfluous changes like which compression method was used, the version of the archiving tool and other details that are not concerned with the contents of the archive, which is what a build tool actually cares about.
+This happens often with archives fetched live from Github repository references, for example.
+It is also useful to verify the integrity of `git clone` operation on a dynamic reference like a branch name.
+
+With this proposal, build tools could add a new family of hash checks that are more robust for content reproducibility.
+
+## Rationale
+
+The proposed algorithm could simply concatenate all the bytes together, once the directory contents had been sorted. Instead, it also encodes relative paths and separators to prevent [preimage attacks][preimage].
+
+Merkle trees were not used for simplicity, since it's not necessary to update the hash often or to point out which file is responsible for the hash change.
+
+The implementation of this algorithm as specific options in build tools is a non-goal of this CEP. That goal is deferred to further CEPs, which could simply say something like:
+
+> The `source` section is a list of objects, with keys [...] `contents_sha256` and `contents_md5` (which implement CEP XX for SHA256 and MD5, respectively).
+
+## References
+
+- The Nix ecosystem has a similar feature called [`fetchzip`][fetchzip].
+- There are several [Rust crates][crates] and [Python projects][pymerkletools] implementing similar strategies using Merkle trees. Some of the details here were inspired by [`dasher`][dasher]
+
+## Copyright
+
+All CEPs are explicitly [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/).
+
+<!-- links -->
+
+[fetchzip]: https://nixos.org/manual/nixpkgs/stable/#fetchurl
+[preimage]: https://flawed.net.nz/2018/02/21/attacking-merkle-trees-with-a-second-preimage-attack/
+[dasher]: https://github.com/DrSLDR/dasher#hashing-scheme
+[pymerkletools]: https://github.com/Tierion/pymerkletools
+[crates]: https://crates.io/search?q=content%20hash

From 6290e91b928982324a3c185e53ac7f9246d61a76 Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Wed, 20 Nov 2024 01:41:10 +0100
Subject: [PATCH 02/11] add link to CEP PR

---
 cep-00??.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cep-00??.md b/cep-00??.md
index 66b10b6..9a08d13 100644
--- a/cep-00??.md
+++ b/cep-00??.md
@@ -6,7 +6,7 @@
 <tr><td> Author(s) </td><td> Jaime Rodríguez-Guerra &lt;jaime.rogue@gmail.com&gt;</td></tr>
 <tr><td> Created </td><td> Nov 19, 2024</td></tr>
 <tr><td> Updated </td><td> Nov 19, 2024</td></tr>
-<tr><td> Discussion </td><td> link to the PR where the CEP is being discussed, NA is circulated initially </td></tr>
+<tr><td> Discussion </td><td> https://github.com/conda/ceps/pull/100 </td></tr>
 <tr><td> Implementation </td><td> https://github.com/conda/conda-build/pull/5277 </td></tr>
 </table>
 

From b403f7123a1f0162afe139412fb27ad9508bda76 Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Wed, 20 Nov 2024 01:42:37 +0100
Subject: [PATCH 03/11] add link to og issue

---
 cep-00??.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/cep-00??.md b/cep-00??.md
index 9a08d13..de43e1d 100644
--- a/cep-00??.md
+++ b/cep-00??.md
@@ -74,8 +74,9 @@ The implementation of this algorithm as specific options in build tools is a non
 
 ## References
 
+- The original issue suggesting this idea is [`conda-build#4762`][conda-build-issue].
 - The Nix ecosystem has a similar feature called [`fetchzip`][fetchzip].
-- There are several [Rust crates][crates] and [Python projects][pymerkletools] implementing similar strategies using Merkle trees. Some of the details here were inspired by [`dasher`][dasher]
+- There are several [Rust crates][crates] and [Python projects][pymerkletools] implementing similar strategies using Merkle trees. Some of the details here were inspired by [`dasher`][dasher].
 
 ## Copyright
 
@@ -88,3 +89,4 @@ All CEPs are explicitly [CC0 1.0 Universal](https://creativecommons.org/publicdo
 [dasher]: https://github.com/DrSLDR/dasher#hashing-scheme
 [pymerkletools]: https://github.com/Tierion/pymerkletools
 [crates]: https://crates.io/search?q=content%20hash
+[conda-build-issue]: https://github.com/conda/conda-build/issues/4762

From 21f66faac959fb33040411e09f3e481c90bc7a0e Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Wed, 20 Nov 2024 15:59:41 +0100
Subject: [PATCH 04/11] normalize line endings in text files

---
 cep-00??.md | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/cep-00??.md b/cep-00??.md
index de43e1d..a247a94 100644
--- a/cep-00??.md
+++ b/cep-00??.md
@@ -23,7 +23,8 @@ Given a directory, propose an algorithm to compute the aggregated hash of its co
 Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path. For each entry in the contents table, compute the hash for the concatenation of:
 - UTF-8 encoded bytes of the path, relative to the input directory. Backslashes MUST be normalized to forward slashes before encoding.
 - Then, depending on the type:
-    - For regular files, the UTF-8 encoded bytes of an `F` separator, followed by the bytes of its contents.
+    - For regular binary files, the UTF-8 encoded bytes of an `F` separator, followed by the bytes of its contents.
+    - For regular text files, the UTF-8 encoded bytes of an `F` separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (`\r\n` replaced with `\n`).
     - For a directory, the UTF-8 encoded bytes of a `D` separator, and nothing else.
     - For a symlink, the UTF-8 encoded bytes of an `L` separator, followed by the UTF-8 encoded bytes of the path it points to. Backslashes MUST be normalized to forward slashes before encoding.
 - UTF-8 encoded bytes of the string `-`.
@@ -44,10 +45,17 @@ def contents_hash(directory: str, algorithm: str) -> str:
         elif path.is_dir():
             hasher.update(b"D")
         elif path.is_file():
-            hasher.update(b"F")
-            with open(path, "rb") as fh:
-                for chunk in iter(partial(fh.read, 8192), b""):
-                    hasher.update(chunk)
+            hasher.update(b"F") 
+            try:
+                # assume it's text
+                with open(path) as fh:
+                    for line in fh:
+                        hasher.update(line.replace("\r\n", "\n").encode("utf-8"))
+            except UnicodeDecodeError:
+                # file must be binary
+                with open(path, "rb") as fh:
+                    for chunk in iter(partial(fh.read, 8192), b""):
+                        hasher.update(chunk)
         hasher.update(b"-")
     return hasher.hexdigest()
 ```

From 6742d7ef50c3ab583a2501fbcfca0727dc462d80 Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Wed, 20 Nov 2024 16:14:04 +0100
Subject: [PATCH 05/11] simpler sentence

---
 cep-00??.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cep-00??.md b/cep-00??.md
index a247a94..57e1ee4 100644
--- a/cep-00??.md
+++ b/cep-00??.md
@@ -23,8 +23,8 @@ Given a directory, propose an algorithm to compute the aggregated hash of its co
 Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path. For each entry in the contents table, compute the hash for the concatenation of:
 - UTF-8 encoded bytes of the path, relative to the input directory. Backslashes MUST be normalized to forward slashes before encoding.
 - Then, depending on the type:
-    - For regular binary files, the UTF-8 encoded bytes of an `F` separator, followed by the bytes of its contents.
-    - For regular text files, the UTF-8 encoded bytes of an `F` separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (`\r\n` replaced with `\n`).
+    - For binary files, the UTF-8 encoded bytes of an `F` separator, followed by the bytes of its contents.
+    - For text files, the UTF-8 encoded bytes of an `F` separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (`\r\n` replaced with `\n`).
     - For a directory, the UTF-8 encoded bytes of a `D` separator, and nothing else.
     - For a symlink, the UTF-8 encoded bytes of an `L` separator, followed by the UTF-8 encoded bytes of the path it points to. Backslashes MUST be normalized to forward slashes before encoding.
 - UTF-8 encoded bytes of the string `-`.

From 917e7f7bca59a6270ebe9807e57c727030f8d9b5 Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Wed, 20 Nov 2024 16:32:44 +0100
Subject: [PATCH 06/11] do not allow partial updates

---
 cep-00??.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/cep-00??.md b/cep-00??.md
index 57e1ee4..0c2ec55 100644
--- a/cep-00??.md
+++ b/cep-00??.md
@@ -48,9 +48,12 @@ def contents_hash(directory: str, algorithm: str) -> str:
             hasher.update(b"F") 
             try:
                 # assume it's text
+                lines = []
                 with open(path) as fh:
                     for line in fh:
-                        hasher.update(line.replace("\r\n", "\n").encode("utf-8"))
+                        lines.append(line.replace("\r\n", "\n")
+                for line in lines:
+                    hasher.update(line.encode("utf-8")))
             except UnicodeDecodeError:
                 # file must be binary
                 with open(path, "rb") as fh:

From ffd70168e52460034ca136f784622e5cb140db2a Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Wed, 20 Nov 2024 16:33:46 +0100
Subject: [PATCH 07/11] clarify text/binary

---
 cep-00??.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/cep-00??.md b/cep-00??.md
index 0c2ec55..03c051a 100644
--- a/cep-00??.md
+++ b/cep-00??.md
@@ -23,8 +23,9 @@ Given a directory, propose an algorithm to compute the aggregated hash of its co
 Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path. For each entry in the contents table, compute the hash for the concatenation of:
 - UTF-8 encoded bytes of the path, relative to the input directory. Backslashes MUST be normalized to forward slashes before encoding.
 - Then, depending on the type:
+    - For text files, the UTF-8 encoded bytes of an `F` separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (`\r\n` replaced with `\n`). A file is considered
+    a text file if all the contents can be UTF-8 decoded. Otherwise it's considered binary.
     - For binary files, the UTF-8 encoded bytes of an `F` separator, followed by the bytes of its contents.
-    - For text files, the UTF-8 encoded bytes of an `F` separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (`\r\n` replaced with `\n`).
     - For a directory, the UTF-8 encoded bytes of a `D` separator, and nothing else.
     - For a symlink, the UTF-8 encoded bytes of an `L` separator, followed by the UTF-8 encoded bytes of the path it points to. Backslashes MUST be normalized to forward slashes before encoding.
 - UTF-8 encoded bytes of the string `-`.

From 378d1fd90b2e6bfc467c69d940b1a7c52ac9d35b Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Wed, 20 Nov 2024 16:55:57 +0100
Subject: [PATCH 08/11] clarify sort

---
 cep-00??.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/cep-00??.md b/cep-00??.md
index 03c051a..ff8c7eb 100644
--- a/cep-00??.md
+++ b/cep-00??.md
@@ -20,7 +20,9 @@ Given a directory, propose an algorithm to compute the aggregated hash of its co
 
 ## Specification
 
-Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path. For each entry in the contents table, compute the hash for the concatenation of:
+Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path as a Unicode string. More specifically, it MUST follow an ascending lexicographical comparison using the numerical Unicode code points (i.e. the result of Python's built-in function `ord()`) of their characters [^1]. 
+
+For each entry in the contents table, compute the hash for the concatenation of:
 - UTF-8 encoded bytes of the path, relative to the input directory. Backslashes MUST be normalized to forward slashes before encoding.
 - Then, depending on the type:
     - For text files, the UTF-8 encoded bytes of an `F` separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (`\r\n` replaced with `\n`). A file is considered
@@ -102,3 +104,4 @@ All CEPs are explicitly [CC0 1.0 Universal](https://creativecommons.org/publicdo
 [pymerkletools]: https://github.com/Tierion/pymerkletools
 [crates]: https://crates.io/search?q=content%20hash
 [conda-build-issue]: https://github.com/conda/conda-build/issues/4762
+[^1]: This is what Python does. See "strings" in [Value comparisons](https://docs.python.org/3/reference/expressions.html#value-comparisons).

From 54e2ba4c52c25063686b55f0f8aa5d14afdfeb7a Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Tue, 26 Nov 2024 13:05:21 +0100
Subject: [PATCH 09/11] deal with unknown types

---
 cep-00??.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/cep-00??.md b/cep-00??.md
index ff8c7eb..f7d077b 100644
--- a/cep-00??.md
+++ b/cep-00??.md
@@ -26,10 +26,12 @@ For each entry in the contents table, compute the hash for the concatenation of:
 - UTF-8 encoded bytes of the path, relative to the input directory. Backslashes MUST be normalized to forward slashes before encoding.
 - Then, depending on the type:
     - For text files, the UTF-8 encoded bytes of an `F` separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (`\r\n` replaced with `\n`). A file is considered
-    a text file if all the contents can be UTF-8 decoded. Otherwise it's considered binary.
+    a text file if all the contents can be UTF-8 decoded. Otherwise it's considered binary. If the
+    file can't be opened, it's handled as if it were empty.
     - For binary files, the UTF-8 encoded bytes of an `F` separator, followed by the bytes of its contents.
     - For a directory, the UTF-8 encoded bytes of a `D` separator, and nothing else.
     - For a symlink, the UTF-8 encoded bytes of an `L` separator, followed by the UTF-8 encoded bytes of the path it points to. Backslashes MUST be normalized to forward slashes before encoding.
+    - For any other paths, the UTF-8 encoded bytes of a `?` separator, and nothing else.
 - UTF-8 encoded bytes of the string `-`.
 
 Example implementation in Python:
@@ -62,6 +64,8 @@ def contents_hash(directory: str, algorithm: str) -> str:
                 with open(path, "rb") as fh:
                     for chunk in iter(partial(fh.read, 8192), b""):
                         hasher.update(chunk)
+        else:
+            hasher.update(b"?")
         hasher.update(b"-")
     return hasher.hexdigest()
 ```

From 77e35c0766d90f38f2a51f11f78b700ccb7f5004 Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Wed, 27 Nov 2024 23:41:05 +0100
Subject: [PATCH 10/11] update details on error handling

---
 cep-00??.md | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/cep-00??.md b/cep-00??.md
index f7d077b..ea44e1a 100644
--- a/cep-00??.md
+++ b/cep-00??.md
@@ -25,15 +25,17 @@ Given a directory, recursively scan all its contents (without following symlinks
 For each entry in the contents table, compute the hash for the concatenation of:
 - UTF-8 encoded bytes of the path, relative to the input directory. Backslashes MUST be normalized to forward slashes before encoding.
 - Then, depending on the type:
-    - For text files, the UTF-8 encoded bytes of an `F` separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (`\r\n` replaced with `\n`). A file is considered
-    a text file if all the contents can be UTF-8 decoded. Otherwise it's considered binary. If the
-    file can't be opened, it's handled as if it were empty.
-    - For binary files, the UTF-8 encoded bytes of an `F` separator, followed by the bytes of its contents.
+    - For regular files:
+        - If text, the UTF-8 encoded bytes of an `F` separator, followed by the UTF-8 encoded bytes of its line-ending-normalized contents (`\r\n` replaced with `\n`). A file is considered a text file if all the contents can be UTF-8 decoded. Otherwise it's considered binary. If the file can't be opened, it's handled as if it were empty.
+        - If binary, the UTF-8 encoded bytes of an `F` separator, followed by the bytes of its contents.
+        - If it can't be read, error out.
     - For a directory, the UTF-8 encoded bytes of a `D` separator, and nothing else.
     - For a symlink, the UTF-8 encoded bytes of an `L` separator, followed by the UTF-8 encoded bytes of the path it points to. Backslashes MUST be normalized to forward slashes before encoding.
-    - For any other paths, the UTF-8 encoded bytes of a `?` separator, and nothing else.
+    - For any other types, error out.
 - UTF-8 encoded bytes of the string `-`.
 
+Note that the algorithm MUST error out on unreadable files and unknown file types because we can't verify its contents. An attacker could hide malicious content in those paths known to be "unhashable" and later reveal then again in the build script (e.g. by `chmod`ing them as readable).
+
 Example implementation in Python:
 
 ```python
@@ -65,7 +67,7 @@ def contents_hash(directory: str, algorithm: str) -> str:
                     for chunk in iter(partial(fh.read, 8192), b""):
                         hasher.update(chunk)
         else:
-            hasher.update(b"?")
+            raise RuntimeError(f"Unknown file type: {path}")
         hasher.update(b"-")
     return hasher.hexdigest()
 ```

From 24a9f8f8db720f6cf539885df0db40b94f77f0c5 Mon Sep 17 00:00:00 2001
From: jaimergp <jaimergp@users.noreply.github.com>
Date: Wed, 27 Nov 2024 23:42:05 +0100
Subject: [PATCH 11/11] trim

---
 cep-00??.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cep-00??.md b/cep-00??.md
index ea44e1a..73a5d6b 100644
--- a/cep-00??.md
+++ b/cep-00??.md
@@ -20,7 +20,7 @@ Given a directory, propose an algorithm to compute the aggregated hash of its co
 
 ## Specification
 
-Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path as a Unicode string. More specifically, it MUST follow an ascending lexicographical comparison using the numerical Unicode code points (i.e. the result of Python's built-in function `ord()`) of their characters [^1]. 
+Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path as a Unicode string. More specifically, it MUST follow an ascending lexicographical comparison using the numerical Unicode code points (i.e. the result of Python's built-in function `ord()`) of their characters [^1].
 
 For each entry in the contents table, compute the hash for the concatenation of:
 - UTF-8 encoded bytes of the path, relative to the input directory. Backslashes MUST be normalized to forward slashes before encoding.
@@ -52,7 +52,7 @@ def contents_hash(directory: str, algorithm: str) -> str:
         elif path.is_dir():
             hasher.update(b"D")
         elif path.is_file():
-            hasher.update(b"F") 
+            hasher.update(b"F")
             try:
                 # assume it's text
                 lines = []

Title	Computing the hash of the contents in a directory
Status	Draft
Author(s)	Jaime Rodríguez-Guerra <jaime.rogue@gmail.com>
Created	Nov 19, 2024
Updated	Nov 19, 2024
Discussion	link to the PR where the CEP is being discussed, NA is circulated initially
Implementation	https://github.com/conda/conda-build/pull/5277