-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standardize algorithm for directory hashing #100
Conversation
cc @wolfv |
In pixi we often compute these hashes in a similar way. We also normalize line endings in text files because on windows git often normalizes these as well. |
Nice that you mentioned this because I think I just run into that very issue: https://github.com/conda/conda-build/actions/runs/11924575237/job/33235172020?pr=5277#step:11:424 |
cep-00??.md
Outdated
|
||
## Specification | ||
|
||
Given a directory, recursively scan all its contents (without following symlinks) and sort them by their full path. For each entry in the contents table, compute the hash for the concatenation of: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to better specify what "sort" means, and in particular, what collation ordering we want to use. Because:
$ echo -e "aa\na-a\nab\na-b" | LC_ALL=en_US.UTF-8 sort
a-a
aa
a-b
ab
$ echo -e "aa\na-a\nab\na-b" | LC_ALL=C.UTF-8 sort
a-a
a-b
aa
ab
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like Python's sort([str, ...])
, it's locale agnostic. See 378d1fd
(#100)
I'm happy with how this is looking now. The implementation in conda-build is passing the necessary tests. Any thoughts about this @wolfv @baszalmstra? Otherwise I'll probably start a vote soon once the @conda/builds-tools team is happy with the PR. |
We have not started an implementation yet but it seems straightforward enough. Id be happy to vote on this now. |
@conda/steering-council, I'd like to start a vote starting December the 4th (so we can pair it up with the ABI3 one). So, if you'd like to make any CEP-altering comments, this is your chance. Once the vote starts, we should only make editorial corrections. Thanks! |
|
||
## Motivation | ||
|
||
Build tools like `conda-build` and `rattler-build` need to fetch the source of the project being packaged. The integrity of the download is checked by comparing its known hash (usually SHA256) against the obtained file. If they don't match, an error is raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they don't match, an error is raised.
This doesn't say what build tools should do when the files can't be read or an error is raised during computation, and the implementation in https://github.com/conda/conda-build/pull/5277/files#r1861269213 shows that it ignores file read errors (which can have many reasons) and only creates the content hash with the name of the file. I'd think that's a failure state for a content hash algorithm that aims to be consistent since it would expose it to file permission attacks etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow how this could be weaponized as an attack vector. Let's say we have a directory with three files with (name, content)
:
('file1.txt', '123')
('file2.txt', '456')
('file3.txt', '789')
We end up computing the hash of these Utf-8 bytes:
file1.txtF123-file2.txtF456-file3.txtF789-
>>> import hashlib
>>> hashlib.md5(b"file1.txtF123-file2.txtF456-file3.txtF789-").hexdigest()
'54866bc311f08b2e082466b090cbe560'
If let's say file1.txt
changes permissions and becomes unreadable, then the string would be:
file1.txtF-file2.txtF456-file3.txtF789-
, which has a different hash.
If file1.txt
becomes a directory it would then be:
file1.txtD-file2.txtF456-file3.txtF789-
Same for an unknown file type (but ?
instead of D
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, let's say that we had an empty file, and then a new source adds some malicious content there, but makes it unreadable. They would have the same hash because we are not marking errors... The build script would just need to make it readable with chmod
later and then it can be weaponized. Ok, now I get it: we need to mark errors with a different separator: maybe E
. Right?
Thank you for reading along!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we do need to error out, because an unreadable file might change contents arbitrarily and its hash would never change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 77e35c0
Hello @conda/steering-council. Requesting a vote for this CEP. This vote falls under the "Enhancement Proposal Approval" policy of the conda governance policy, It needs 60% of the Steering Council to vote To vote, please use the following form to vote. This vote will end on 2024-12-18, End of Day, Anywhere on Earth (AoE). @xhochy (Uwe Korn)
@CJ-Wright (Christopher J. 'CJ' Wright)
@mariusvniekerk (Marius van Niekerk)
@chenghlee (Cheng H. Lee)
@ocefpaf (Filipe Fernandes)
@marcelotrevisani (Marcelo Duarte Trevisani)
@msarahan (Michael Sarahan)
@mbargull (Marcel Bargull)
@jakirkham (John Kirkham)
@jezdez (Jannis Leidel)
@wolfv (Wolf Vollprecht)
@jaimergp (Jaime Rodríguez-Guerra)
@kkraus14 (Keith Kraus)
@baszalmstra (Bas Zalmstra)
@beckermr (Matthew R. Becker)
@Hind-M (Hind Montassif)
@trallard (Tania Allard)
|
How much value is there in the utf8 and text normalization pieces. Presumable the hash of a prefix for different platforms will basically never be the same due to differences in packages. |
This is mostly directed at hashing the sources we fetch from remotes. It's usually the same source for all platforms. Not sure how useful this could be for hashing |
The vote is closed, and we have the following result:
Yes votes (10): No votes (0). Abstain votes (0). Not voted (7): Invalid votes (0). 10/17 votes = 58.8% does not meet the minimum quorum (60%), but timeout rules can be invoked because:
Thus, the affirmative votes are computed against the valid votes, which results in 10/10 YES votes. This is 100%, which is more than the required 60%. As per the governance, this means the vote passed the moment another @conda/steering-council member confirms that timeout quorum rules were met. |
Standardizes findings in conda/conda-build#5277.
🔎 Preview in Markdown