This repo hosts the code used in the paper
"Efficiently Searching In-Memory Sorted Arrays:Revenge of the Interpolation
Search?"
to be presented at ACM SIGMOD 2019.
We have implemented various search methods and a benchmarking framework that measures the performance of these methods. We measure the time to search for a record of a dataset. Each record consists of a key and a payload, the keys are 8 Byte integers but the payloads can be varied across datasets. The benchmarking framework checks the results that are returned from each search algorithm are correct.
Using the benchmark framework we can easily control the following parameters:
- size of the dataset
(minimum size: 1000 records) - distribution of the keys
- parameter of the distribution (described in detail later)
- search algorithm
(Interpolation Search, SIP, TIP, Binary Search, ...) - record size
(each record contains an 8 Byte Key and a Payload, the record size parameter specifies the size of the entire record) Valid options : 8, 32, 128 - number of threads, the search for a single record is handled by one thread.
To measure the performance of an algorithm for a given dataset, we randomly
permute the keys contained in the dataset and search for all of them, this is
called a run.
We measure the execution time to search subsets of 1,000 keys for each run.
To compare algorithms we use the time required to seacrh for one record.
The time to search for one record is calculated as an average over the time to
search each subset.
In our paper we used clang5 to compile our code and to replicate our experiments we provide the option to dowload and use the same version of clang. As an alternative you can use gcc.
- The installation scripts are validated on Ubuntu 16.04 and 18.04
- Run the install script "install_plus_clang.sh", which will download clang5, openmp, python3 and pandas. clang5 will be downloaded inside this repo.
- Run "make clang5_searchbench" to produce the "searchbench" executable.
- Run the install script "install.sh", which will download openmp, python3 and pandas.
- Run "make" to produce the "searchbench" executable.
The "searchbench" execution expects one argument, a tsv file of experiments. Each line of the tsv specifies the parameters of one experiment, as described in Introduction.
Example "experiments.tsv" :
$ cat experiments.tsv
DatasetSize Distribution Parameter SearchAlgorithm RecordSizeBytes #threads
2000 uniform 42 bs 8 1
2000 uniform 42 sip 8 1
"searchbench" runs each experiment and reports the time required to search each subset of 1000 records, in nanoseconds.
$ ./searchbench experiments.tsv
Loading Dataset size:2000, distribution: uniform, distribution parameter: 42
Running experiment: 2000 uniform 42 8 bs 1
Run DatasetSize Distribution Parameter #threads SearchAlgorithm RecordSizeBytes TimeNS
0 2000 uniform 42 1 bs 8 130.34
0 2000 uniform 42 1 bs 8 120.16
Running experiment: 2000 uniform 42 8 sip 1
Run DatasetSize Distribution Parameter #threads SearchAlgorithm RecordSizeBytes TimeNS
1 2000 uniform 42 1 sip 8 77.27
1 2000 uniform 42 1 sip 8 65.243
We provide a helper function implemented in Python "getTimes.py" that runs the "searchbench" using as input the file named "experiments.tsv" and reports back for each run the time to search one record, calculated as described in Section Performance Evaluation This is the easiest way to benchmark different search methods. The getTimes.py can be easily modified to report more statistics from each experiment.
$ python3 getTimes.py
make: 'searchbench' is up to date.
Time to search one record:
Run DatasetSize Distribution Parameter #threads SearchAlgorithm RecordSizeBytes
0 2000 uniform 42 1 bs 8 112.695
1 2000 uniform 42 1 sip 8 64.251
The experiment tsv should include the following header:
DatasetSize Distribution Parameter SearchAlgorithm RecordSizeBytes #threads
All the values must be tab separated.
A dataset is identified by its name and one parameter as described in the following table:
Dataset | Parameter |
---|---|
uniform | seed for the ranfom generator (integer) |
gap | seed for the ranfom generator (integer), gap parameter (double from 0.1 to 1.0) |
fal | shape parameter (double) |
cfal | shape parameter (double) |
file | path of file |
When the dataset is "file" then the file identified by "path of file" specifies the keys that will be used in the dataset. When a dataset from a file is used the DatasetSize parameter does not affect the size of the dataset.The file should contain one key per line. Examples of dataset file can be found in the src/datasets folder.
For explanation of the parameters and dataset please refer to our paper: "Efficiently Searching In-Memory Sorted Arrays:Revenge of the Interpolation Search?".
All datasets that are not loaded from a file are generated by the benchmarking framework. We have build a utility that outputs the keys used in the datasets. You can build this utility using "make dump" and use the utility by executing "dump experiments.tsv". The output will be the keys of the records of the datasets specified in the "experiments.tsv".
The algorithm we have implemented in our code are:
SearchAlgorithm | Description |
---|---|
is | Interpolation Search |
bs | Binary Search |
sip | SIP - Slope Reuse Interpolation Search |
tip | TIP - Three Point Interpolation Search |
isseq | Interpolation Sequential Search |
b-eyt | Search using the "Eytzinger" format |
b-eyt-p | Search using the "Eytzinger" format with prefetch |
- If the searchbench is not producing output for an experiment the most probable cause it that a parameter is not tab separated in the "experiment.tsv"
- The results of experiments with small datasets can have significant variance due to uncontrollable hardware factors (i.e. caching). Repeating such experiments multiple times is suggested to get stable results.
Please consult the README in the src folder for more information on how the code is structured.