aka Probabilistic data structures for mining in data streams, in pure Python.
python setup.py install
Original paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
More on: http://research.neustar.biz/tag/hyperloglog/
Usage:
from sketches import HyperLogLog
h = HyperLogLog(10)
for i in range(100000):
h.add(i)
print(h.estimate())
> 99860.5333365
Original paper: here
More on: https://sites.google.com/site/countminsketch/
Usage:
from sketches import CountMin
s = CountMin(10, 10)
data = np.random.zipf(2, 10000)
for v in data:
s.add(v)
print(s.estimate(1))
> 6130.0
print(len([x for x in data if x == 1]))
> 6110
- HLL improvements:
- HLL++
- Sliding window HLL
- Count-Mean-Min
- Stream-Summary
- Min-Hash
- Bloom filter
- Frugal sketches