This repository contains implementations of High-energy Physics (HEP) analysis queries from the IRIS HEP benchmark written in SQL to be run on PrestoDB.
The purpose of this repository is to study the suitability of SQL for HEP analyses and to serve as a use case for improving database technologies. While SQL has often been considered unsuited for HEP analyses, we believe that the support for arrays and structured types introduced in SQL:1999 make SQL actually a rather good fit. As a high-level declarative language, it has the potential to bring many of its benefits to HEP including transparent query optimization, transparent caching, and a separation of the logical level (in the form of data model and query) from the physical level (in the form of storage formats and performance optimizations).
- Install Python 3 with pip.
- Install the Python requirements:
pip3 install -r requirements.txt
- Install docker and docker-compose.
- Clone this repository and bring up the services with Docker compose.
- Optionally download the Presto CLI client matching the version in the docker image.
- Set up
scripts/presto.sh
, either based onscripts/presto.local.sh
by modifying it to point to the Presto CLI executable from the previous step, or with the following command (which uses a Presto CLI in one of the docker images):cp scripts/presto.docker.sh scripts/presto.sh
The benchmark defines a data set in the ROOT format, which is not supported by Presto. However, the Rumble implementation of the benchmark provides scripts to convert the data to Parquet, which Presto can query in-place.
You can run the queries against "external tables" consisting of Parquet files on HDFS. A basic HDFS installation is part of the services brought up by Docker compose. Read the instructions of that repository for details. The main steps are as follows:
- Copy
Run2012B_SingleMu-restructured-1000.parquet
from this repository to thedata/
repository of the Docker compose project. - Upload it to HFDS:
docker compose exec -it namenode hadoop fs -mkdir /Run2012B_SingleMu-restructured-1000/ docker compose exec -it namenode hadoop fs -put /data/Run2012B_SingleMu-restructured-1000.parquet /Run2012B_SingleMu-restructured-1000/
- Create an external table with the provided script:
Check out the help of that script in case you want to connect to Presto with non-default parameters.
scripts/create_table.py \ --table-name Run2012B_SingleMu_1000 \ --location hdfs://namenode/Run2012B_SingleMu-restructured-1000/ \ --variant native \ --view-name Run2012B_SingleMu_1000_view # ignored for "native" variant
You can also read the data from an S3 bucket if you run Presto in an EC2 instance with a properly configured instance profile. Assume the data is uploaded to S3, use the following command to create an external table based on these files:
scripts/create_table.py \
--table-name Run2012B_SingleMu_1000 \
--location s3a://my_bucket/Run2012B_SingleMu_1000/ \
--variant native \
--view-name Run2012B_SingleMu_1000_view # ignored for "native" variant
It is also possible to read Parquet files where all structs are "shredded" into columns (see the Rumble implementation for details). Use a command along the following lines for that purpose:
scripts/create_table.py \
--table-name Run2012B_SingleMu_1000 \
--location s3a://my_bucket/Run2012B_SingleMu_shredded_1000/ \
--variant shredded \
--view-name Run2012B_SingleMu_shredded_1000_view
The queries should then be run agains Run2012B_SingleMu_shredded_1000_view
which exposes the data in the same format as the non-shredded "native" Parquet files.
Queries are run through test_queries.py
. Run the following command to see its options:
$ ./test_queries.py --help
usage: test_queries.py [options] [file_or_dir] [file_or_dir] [...]
...
custom options:
-Q QUERY_ID, --query-id=QUERY_ID
Folder name of query to run.
-F FREEZE_RESULT, --freeze-result=FREEZE_RESULT
Whether the results of the query should be persisted to disk.
-N NUM_EVENTS, --num-events=NUM_EVENTS
Number of events taken from the input file. This influences which reference file should be taken.
-I INPUT_TABLE, --input-table=INPUT_TABLE
Name of input table or view.
-P PRESTO_CMD, --presto-cmd=PRESTO_CMD
Path to the script that runs the Presto CLI.
-S PRESTO_SERVER, --presto-server=PRESTO_SERVER
URL as <host>:<port> of the Presto server.
-C PRESTO_CATALOGUE, --presto-catalogue=PRESTO_CATALOGUE
Default catalogue to use in Presto.
--presto-schema=PRESTO_SCHEMA
Default schema to use in Presto.
--plot-histogram Plot resulting histogram as PNG file.
For example, the following command runs queries 6-1
and 6-2
against the table created above:
./test_queries.py -vs --num-events 1000 --query-id query-6-1 --query-id query-6-2