N2 Benchmark explanation

Benchmark Focus

First, the new approximate nearest neighborhoods algorithm should run faster for large datasets. Second, online content services like news portal, where dataset is frequently updated(e.g. create/update/delete), building the index file should be done in minimum time. Therefore, these are our main criteria:

How long does it take to build the index file?
How long does it take to get results from the large dataset?
How large memory does it take to run large dataset?

Test dataset

Dataset description

The dataset consists of two files. youtube.txt contains 14520986 samples, each sample has 40 data points.

feature1(float32)	feature2(float32)	……	feature2(float32)	feature40(float32)
-0.167898	0.160478	……	0.104421	0.0503584

youtube.txt.vids is a metadata informations of the dataset. Each line is the metadata corresponding to each sample of youtube.txt.

DSID	VID	Youtube link
34XnPr4YKpo8wE_mEl	Z1Jilm0TZHY	http://www.youtube.com/watch?v=Z1Jilm0TZHY

How to get it

We share our dataset through google drive.

Index build times

Generally N2 is faster than the nmslib to build index file. Compared to the annoy, N2 begins to show the similar performance of the annoy when N2 uses 5 threads, and from then on it shows a faster build performance than the annoy.

Library	1 thread	5 threads	10 threads	20 threads
N2 (3.1Gb)	4505.3669540 9 sec	1002.475105 05 sec	591.6419599 06 sec	478.1210601 33 sec
nmslib (3.4Gb)	7130.7202520 4 sec	1453.570172 07 sec	826.9151070 12 sec	602.1200079 92 sec
annoy (4.4Gb)	915.41107487 7 sec	915.4110748 77 sec	915.4110748 77 sec	915.4110748 77 sec

Search speed

Library	Search time	Accuracy
Linear search (numpy based)	0.358749273825 sec	1.0
N2 (efCon = 100, efSearch = 10)	2.98758983612e-05 sec	0.054243
N2 (efCon = 100, efSearch = 100)	0.000128486037254 sec	0.48313
N2 (efCon = 100, efSearch = 1000)	0.000824773144722 sec	0.840634
N2 (efCon = 100, efSearch = 10000)	0.00720949418545 sec	0.926739
N2 (efCon = 100, efSearch = 100000)	0.0763142487288 sec	0.940606
Nmslib (efCon = 100, efSearch = 10)	9.8201584816e-05 sec	0.226192
Nmslib (efCon = 100, efSearch = 100)	0.000225761222839 sec	0.672228
Nmslib (efCon = 100, efSearch = 1000)	0.00140970699787 sec	0.882695
Nmslib (efCon = 100, efSearch = 10000)	0.0143689704418 sec	0.935395
Nmslib (efCon = 100, efSearch = 100000)	0.159999159241 sec	0.94283
Annoy(n_trees=10, search_k=7)	4.04834747314e-05 sec	0.05471
Annoy(n_trees=10, search_k=3000)	0.00096510682106 sec	0.481099
Annoy(n_trees=10, search_k=50000)	0.0144059297085 sec	0.835895
Annoy(n_trees=10, search_k=200000)	0.053891249156 sec	0.918569
Annoy(n_trees=10, search_k=500000)	0.108285815144 sec	0.940851

Overall, we can see that N2 has a much higher accuracy than the annoy, and N2 has better performance than the other two libraries at high precision points.

Memory usage

Library	Peak memory usage	Search time peak memory usage
N2	5360.93750Mb	3441.13281Mb
annoy	5360.89844Mb	3441.09375Mb
nmslib	5360.97656Mb	3441.17188 Mb

The three libraries do not show much difference in memory usage.

Conclusion

In short, on multi-core CPU, N2 performs best. annoy is a good choice for small datasets that can be handled by a single thread. However, when dataset is large, where high indexing performance is critical, N2 is where to go. N2 runs almost 2x faster than annoy. When high precision is required, both nmslib and N2 are good.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark.rst

benchmark.rst

N2 Benchmark explanation

Benchmark Focus

Test dataset

Dataset description

How to get it

Index build times

Search speed

Memory usage

Conclusion

Files

benchmark.rst

Latest commit

History

benchmark.rst

File metadata and controls

N2 Benchmark explanation

Benchmark Focus

Test dataset

Dataset description

How to get it

Index build times

Search speed

Memory usage

Conclusion