As datasets become larger and more high-dimensional, it becomes increasingly important to find data representations that allow compact storage and efficient distance computation and retrieval. Improved Consistent Weighted Sampling(ICWS, ICDM, 2010) is the state-of-the-art of the sampling methods in weighted sets. There are some variants of ICWS which decreases the time of sampling or increase the quality of samples.
This repository implements ICWS and its variants.
$ git clone git@github.com:yuuun/Consistent-Weighted-Sampling.git
$ cd Consistent-Weighted-Sampling
$ pip3 install -r requirements.txt
$ mkdir dataset
After Downloading datasets to 'dataset/' file, you can run the file by the following command
$ python3 main.py
- Datasets can be downloaded in LIBSVM
[label1] [idx1_1]:[weight1_1] [idx1_2]:[weight1_2] ...
[label2] [idx2_1]:[weight2_1] [idx2_2]:[weight2_2] ...
[label3] [idx3_1]:[weight3_1] [idx3_2]:[weight3_2] ...
1 1:5 3:7 10:9
2 2:6 3:1 8:10
- ICWS
- state of the CWS method that improves the effectiveness and efficiency
- 0-bit CWS
- improves the space efficiency and GJS estimation time
- CCWS
- improves the time efficiency of ICWS by removing the cacluation of logarithm
- PCWS
- replace one of the two gamma distributions with uniform distribution which improves both time and space complexities in ICWS
- I2CWS
- hashes yk and zk separately which makes k' and yk' dependent
- BCWS
- applies OPH on CWS and increased the time efficiency dramatically
Classification Accuracy | |
---|---|
Mnist | 93.6% |
-
Hashing Time
-
Classification Accuracy
-
Precision(the criteria of Jaccard Similarity)
-
The experiments will be depicted as follows,
[Name of Method] | [Dataset]([Number of Samples])
Method | Mnist(100) | Mnist(500) |
---|---|---|
ICWS | 175.4s | 869.6s |
0-bit CWS | 169.3s | 842.1s |
CCWS | 129.2s | 647.4s |
PCWS | 154.4s | 767.0s |
I2CWS | 162.3s | 805.7s |
BCWS | 6.3s | 17.9s |
Method | Mnist(100) | Mnist(500) |
---|---|---|
ICWS | 88.7% | 92.6% |
0-bit CWS | 90.7% | 94.1% |
CCWS | 79.3% | 81.8% |
PCWS | 90.6% | 93.2% |
I2CWS | 89.7% | 92.5% |
BCWS | 84.3% | 84.4% |
Method | Mnist(100) | Mnist(500) |
---|---|---|
ICWS | 59.9% | 80.4% |
0-bit CWS | 61.6% | 80.5% |
CCWS | 35.43% | 40.8% |
PCWS | 60.9% | 79.5% |
I2CWS | 59.2% | 78.8% |
BCWS | 46.4% | 46.4% |
Method | Mnist(100) | Mnist(500) |
---|---|---|
ICWS | 71.1% | 86.8% |
0-bit CWS | 71.8% | 86.3% |
CCWS | 48.7% | 54.8% |
PCWS | 70.5% | 85.8% |
I2CWS | 69.9% | 84.8% |
BCWS | 57.5% | 52.8% |
- ICWS: Improved consistent sampling, weighted minhash and l1 sketching, Sergey Ioffe, ICDM, 2010
- 0-bit CWS: 0-bit consistent weighted sampling, Ping Li, KDD, 2015
- CCWS: Canonical consistent weighted sampling for real-value weighted min-hash, Wei Wu, NIPS, 2016
- PCWS: Consistent weighted sampling mad more practical, Wei Wu, WWW, 2016
- I2CWS: Improved consistent weighted sampling revisited, Wei Wu, WWW, 2017
- BCWS: Re-randomized densification for one permutation hashing and bin-wise consistent weighted sampling, Ping Li, NIPS, 2019