Experiments for VLDB 2021 paper

This repository contains all material used for the experiments of the following paper and is designed to enable anyone to reproduce them:

S. Irimescu, C. Berker Cikis, I. Müller, G. Fourny, G. Alonso. "Rumble: Data Independence for Large Messy Data Sets." In: PVLDB 14(4), 2020. DOI: 10.14778/3436905.3436910.

Structure of the repository

Each system used in the comparison has its own directory, all with the same structure: a subfolder with singlecore experiments and one with cluster experiments.

rumble/
    cluster/
        queries/
           ...
        deploy.sh
        upload.sh
        run.sh
        terminate.sh
    singlecore/
        queries/
           ...
        deploy.sh
        ...
    run_experiments.sh
zorba/
    ...

Running experiments

The flow for running the experiments is roughly the following:

Configure the scripts, your local machine, and some cloud resources.
- In particular, this includes the capitalized constants in the first few lines of the scripts, setting up the AWS CLI such that has permissions and uses the correct region with additional flags, and creating some buckets and instance profiles.
Generate the data (see below).
Deploy the resources for the singlecore and/or cluster experiment of one system using the corresponding deploy.sh.
Do one of the following:
- Modifiy run_experiments.sh to run the desired subset of the configurations. Run run_experiments.sh.
- Manually run the desired queries:
  1. Run echo "$filelist" | upload.sh to upload the files stored in $filelist.
  2. Run cat queries/some-query.jq | run.sh` to run an individual query.
Terminate the resources by running the corresponding terminate.sh.
Run make -f path/to/common/make.mk -C results/results_date-of-experiment/ to parse the log files and produce result.jsonl with the statistics of all runs.

Generate data sets

Github

We use a "prefix" of a sample for the single-core experiments and a "prefix" of the full data set for the cluster experiments. To download the sample or the full data set, use datasets/github/download-{sample,full}.sh.

Weather

We use the scripts of the original authors to download the data set and convert it to XML. Then we use datasets/vxquery-weather/convert.sh (which is based on a query by the original authors) to convert it to JSON.

Common

The extract_prefix.sh and extract_prefix_s3.sh scripts help in producing a sub set of each data set and uploading it to S3.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
asterixdb		asterixdb
common		common
datasets		datasets
rumble		rumble
snowflake		snowflake
sparksql		sparksql
vxquery		vxquery
xidel		xidel
zorba		zorba
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Experiments for VLDB 2021 paper

Structure of the repository

Running experiments

Generate data sets

Github

Weather

Common

About

Releases

Packages

Contributors 2

Languages

License

RumbleDB/experiments-vldb21

Folders and files

Latest commit

History

Repository files navigation

Experiments for VLDB 2021 paper

Structure of the repository

Running experiments

Generate data sets

Github

Weather

Common

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages