Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation for Hybrid Hash Functions #91

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

oluiscabral
Copy link

Hello!

This pull request introduces a new implementation of the ResourceId trait using two different hash approaches:

  1. Blake3 for files bigger than a hardcoded threshold. This approach is used to improve performance and security when dealing with larger files.
  2. Fowler-Noll-Vo (FNV) hash for files smaller than the hardcoded threshold. For smaller files, this approach provides a faster hashing algorithm while still maintaining acceptable collision resistance.

The dev-hash/benches/hybrid.rs file is a modification of dev-hash/benches/blake3.rs, with the new Hybrid struct being used instead of the Blake3 struct. This allows us to compare and analyze the performance differences
between the two approaches.

Note: I have not implemented tests for files larger than the threshold yet. This will be added in a future update. Please let me know if you have any suggestions or concerns regarding this approach.

@kirillt
Copy link
Member

kirillt commented Oct 6, 2024

Hi @oluiscabral, I'm really glad to see this PR!

  1. First of all, big thanks for good description of this PR!
  2. It's also great that you've included changes into the benchmark. Could you also add preliminary info about performance boost for small files as a comment here?
  3. The original idea of hybrid functions was a bit different, but we can include both ideas. My idea described in Asana was that we could optimize hashing of huge files, by using non-cryptographic hash functions for them. Particular hash function is not decided yet: it can be Blake3 on subset of bytes (e.g. reading file contents with some step and then hashing only scanned chunks), or it can be simply faster hash function like CRC32 or FNV.
  4. But your idea addresses valid and realistic usage scenario, too. Indeed, we also want to optimize hashing of small files. Our hybrid hash function could be composed of three different functions for different file sizes.
  5. However, for each size category, chosen hash function should have as low collision ratio as possible. We have ideas on how to allow end user working with collisions, but proper implementation of such means will take time, so we'll ship initial prototypes with Blake3 enabled by default, and hybrid function available somewhere in settings as experimental mode. To reduce collision ratio of non-cryptographic hash functions we can simply concatenate file size and the hash but this still gives around 0.1% collisions. It's also probably good idea to create additional benchmarks measuring collision ratios for our hash functions, on collections of files of different sizes.

In general, I've recently discovered the class of software we actually target with our framework: DAM, which could be used for categorizing various assets like photos, videos, 3D-models. So ideally would be great to cover all file sizes. That doesn't mean that the framework will be used only for DAM, but that's pretty good reference because it requires meticulous work with every single files and its metadata like tags, scores, attributes etc.

}
}

const THRESHOLD: u64 = 1024 * 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A wild idea, is it difficult to make this constant a type parameter? So we could instantiate same class using different thresholds? It would be really great to have benchmarks of optimized "skip-chunks" hash function for different sizes. The goal of such benchmarks is not only to see the speed improvement, but also to see collisions ratio.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nop, it is not difficult. I just haven't done it already, because I wanted to keep the implementation as similar as possible to the other implementations (Blake3 and CRC32) in this PoC

Comment on lines +96 to +109
if size < THRESHOLD {
// Use Blake3 for small files
log::debug!("Computing BLAKE3 hash for bytes");

let mut hasher = Blake3Hasher::new();
hasher.update(bytes);
let hash = hasher.finalize();
Ok(Hybrid(encode(hash.as_bytes())))
} else {
// Use fnv hashing for large files
log::debug!("Computing simple hash for bytes");

let hash = fnv_hash_bytes(bytes);
Ok(Hybrid(format!("{}_{}", size, hash)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The original idea is the opposite: use Blake3 for small and medium files, and use faster function for large files where size of contents is large enough to make collision ratio low enough.
  2. FNV hashing can be added separately as dedicated hash function. Same as "skip-chunk" hash function.
  3. A wild idea: can we parameterize this hybrid hash function with other hash functions? So we could compose 2 "dedicated" hash functions into threshold-based hash function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, any file that has size below the THRESHOLD is being hashed by Blake3 already.
  2. 100%
  3. Yes, totally. I'm not sure if there are higher priority things to do before it, but we could even create a fully parameterized implementation, that allows indefinite pairs composed of a hash function and its related threshold. I've done something similar to this in JavaScript once

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants