This repository contains all material used for the experiments of the following paper and is designed to enable anyone to reproduce them:
S. Irimescu, C. Berker Cikis, I. Müller, G. Fourny, G. Alonso. "Rumble: Data Independence for Large Messy Data Sets." In: PVLDB 14(4), 2020. DOI: 10.14778/3436905.3436910.
Each system used in the comparison has its own directory, all with the same structure: a subfolder with singlecore
experiments and one with cluster
experiments.
rumble/
cluster/
queries/
...
deploy.sh
upload.sh
run.sh
terminate.sh
singlecore/
queries/
...
deploy.sh
...
run_experiments.sh
zorba/
...
The flow for running the experiments is roughly the following:
- Configure the scripts, your local machine, and some cloud resources.
- In particular, this includes the capitalized constants in the first few lines of the scripts, setting up the AWS CLI such that has permissions and uses the correct region with additional flags, and creating some buckets and instance profiles.
- Generate the data (see below).
- Deploy the resources for the singlecore and/or cluster experiment of one system using the corresponding
deploy.sh
. - Do one of the following:
- Modifiy
run_experiments.sh
to run the desired subset of the configurations. Runrun_experiments.sh
. - Manually run the desired queries:
- Run
echo "$filelist" | upload.sh
to upload the files stored in$filelist
. - Run
cat queries/some-query.jq
| run.sh` to run an individual query.
- Run
- Modifiy
- Terminate the resources by running the corresponding
terminate.sh
. - Run
make -f path/to/common/make.mk -C results/results_date-of-experiment/
to parse the log files and produceresult.jsonl
with the statistics of all runs.
We use a "prefix" of a sample for the single-core experiments and a "prefix" of the full data set for the cluster experiments. To download the sample or the full data set, use datasets/github/download-{sample,full}.sh
.
We use the scripts of the original authors to download the data set and convert it to XML. Then we use datasets/vxquery-weather/convert.sh
(which is based on a query by the original authors) to convert it to JSON.
The extract_prefix.sh
and extract_prefix_s3.sh
scripts help in producing a sub set of each data set and uploading it to S3.