Releases: ekzhu/datasketch
Releases · ekzhu/datasketch
Fix bug in storage
Fix a bug with UnorderedStorage.get_many (#56)
Fix bug in LSH Forest for Weighted MinHash
- Fix issue #35
- Test cases for checking consistency of hash value length in LSH.
Optional redis storage requirement.
Thanks @vmarkovtsev
Redis storage layer for MinHash LSH
- Introduced a Redis storage layer for MinHash LSH. Thanks to @ae-foster
- Added
__hash__
method for Lean MinHash.
LSH Ensemble
- Added a slightly simplified version of LSH Ensemble that supports containment search with MinHash data sketches.
- An introduction on containment link.
- Update documentations
Consistent MinHash hash values across Python versions
MinHash now uses Numpy's random number generator instead of Python's built-in random. This makes MinHash generate consistent hash values across different Python versions.
The side-effect is that now MinHash created before version 1.1.3 won’t work (i.e., jaccard
, merge
and union
) correctly with those created after.
Introduce Lean MinHash and better documentation
LeanMinHash
is a subclass ofMinHash
. It uses less memory and allows faster (de)serialization. See documentation for details.- Removed
serialize
,deserialize
, andbytesize
methods fromMinHash
. These are supported inLeanMinHash
instead. - Serialized
MinHash
objects before this version will not be deserialized properly. To migrate see here. - Documentation now have its own website!
First stable release
After nearly 2 years working on this project on-and-off, the API is now stable, and the features of MinHash-related sketches are completed.
I will continue to add more data sketches and indexes.
MinHash LSH Forest
- MinHash LSH Forest implementation and benchmark using synthetic data
- Improve existing MinHash LSH benchmark using synthetic data for more tunable data distributions
- Improve MinHash and LSH performance
Windows compatibility
- Fixed Issue #4 - int overflow error on Windows platform
- Use Python build-in random number generator for better MinHash accuracy