Initial implementation for Hybrid Hash Functions #91

oluiscabral · 2024-10-05T18:11:13Z

Hello!

This pull request introduces a new implementation of the ResourceId trait using two different hash approaches:

Blake3 for files bigger than a hardcoded threshold. This approach is used to improve performance and security when dealing with larger files.
Fowler-Noll-Vo (FNV) hash for files smaller than the hardcoded threshold. For smaller files, this approach provides a faster hashing algorithm while still maintaining acceptable collision resistance.

The dev-hash/benches/hybrid.rs file is a modification of dev-hash/benches/blake3.rs, with the new Hybrid struct being used instead of the Blake3 struct. This allows us to compare and analyze the performance differences
between the two approaches.

Note: I have not implemented tests for files larger than the threshold yet. This will be added in a future update. Please let me know if you have any suggestions or concerns regarding this approach.

kirillt · 2024-10-06T07:39:38Z

Hi @oluiscabral, I'm really glad to see this PR!

First of all, big thanks for good description of this PR!
It's also great that you've included changes into the benchmark. Could you also add preliminary info about performance boost for small files as a comment here?
The original idea of hybrid functions was a bit different, but we can include both ideas. My idea described in Asana was that we could optimize hashing of huge files, by using non-cryptographic hash functions for them. Particular hash function is not decided yet: it can be Blake3 on subset of bytes (e.g. reading file contents with some step and then hashing only scanned chunks), or it can be simply faster hash function like CRC32 or FNV.
But your idea addresses valid and realistic usage scenario, too. Indeed, we also want to optimize hashing of small files. Our hybrid hash function could be composed of three different functions for different file sizes.
However, for each size category, chosen hash function should have as low collision ratio as possible. We have ideas on how to allow end user working with collisions, but proper implementation of such means will take time, so we'll ship initial prototypes with Blake3 enabled by default, and hybrid function available somewhere in settings as experimental mode. To reduce collision ratio of non-cryptographic hash functions we can simply concatenate file size and the hash but this still gives around 0.1% collisions. It's also probably good idea to create additional benchmarks measuring collision ratios for our hash functions, on collections of files of different sizes.

In general, I've recently discovered the class of software we actually target with our framework: DAM, which could be used for categorizing various assets like photos, videos, 3D-models. So ideally would be great to cover all file sizes. That doesn't mean that the framework will be used only for DAM, but that's pretty good reference because it requires meticulous work with every single files and its metadata like tags, scores, attributes etc.

kirillt · 2024-10-12T16:24:35Z

dev-hash/src/hybrid.rs

+    }
+}
+
+const THRESHOLD: u64 = 1024 * 1024 * 1024;


A wild idea, is it difficult to make this constant a type parameter? So we could instantiate same class using different thresholds? It would be really great to have benchmarks of optimized "skip-chunks" hash function for different sizes. The goal of such benchmarks is not only to see the speed improvement, but also to see collisions ratio.

Nop, it is not difficult. I just haven't done it already, because I wanted to keep the implementation as similar as possible to the other implementations (Blake3 and CRC32) in this PoC

kirillt · 2024-10-12T16:28:09Z

dev-hash/src/hybrid.rs

+        if size < THRESHOLD {
+            // Use Blake3 for small files
+            log::debug!("Computing BLAKE3 hash for bytes");
+
+            let mut hasher = Blake3Hasher::new();
+            hasher.update(bytes);
+            let hash = hasher.finalize();
+            Ok(Hybrid(encode(hash.as_bytes())))
+        } else {
+            // Use fnv hashing for large files
+            log::debug!("Computing simple hash for bytes");
+
+            let hash = fnv_hash_bytes(bytes);
+            Ok(Hybrid(format!("{}_{}", size, hash)))


The original idea is the opposite: use Blake3 for small and medium files, and use faster function for large files where size of contents is large enough to make collision ratio low enough.

FNV hashing can be added separately as dedicated hash function. Same as "skip-chunk" hash function.

A wild idea: can we parameterize this hybrid hash function with other hash functions? So we could compose 2 "dedicated" hash functions into threshold-based hash function.

Yes, any file that has size below the THRESHOLD is being hashed by Blake3 already.

100%

Yes, totally. I'm not sure if there are higher priority things to do before it, but we could even create a fully parameterized implementation, that allows indefinite pairs composed of a hash function and its related threshold. I've done something similar to this in JavaScript once

Initial implementation for Hybrid Hash Functions

f64ac2d

kirillt reviewed Oct 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation for Hybrid Hash Functions #91

Initial implementation for Hybrid Hash Functions #91

oluiscabral commented Oct 5, 2024

kirillt commented Oct 6, 2024

kirillt Oct 12, 2024

oluiscabral Oct 13, 2024

kirillt Oct 12, 2024

oluiscabral Oct 13, 2024

Initial implementation for Hybrid Hash Functions #91

Are you sure you want to change the base?

Initial implementation for Hybrid Hash Functions #91

Conversation

oluiscabral commented Oct 5, 2024

kirillt commented Oct 6, 2024

kirillt Oct 12, 2024

Choose a reason for hiding this comment

oluiscabral Oct 13, 2024

Choose a reason for hiding this comment

kirillt Oct 12, 2024

Choose a reason for hiding this comment

oluiscabral Oct 13, 2024

Choose a reason for hiding this comment