Skip to content

Commit

Permalink
Inline sub-modules.
Browse files Browse the repository at this point in the history
  • Loading branch information
ingomueller-net committed Apr 27, 2022
1 parent f716eb7 commit d563eba
Show file tree
Hide file tree
Showing 361 changed files with 13,539 additions and 32 deletions.
24 changes: 0 additions & 24 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,24 +0,0 @@
[submodule "presto/docker-presto"]
path = presto/docker-presto
url = ../docker-presto.git
[submodule "presto/queries"]
path = presto/queries
url = ../iris-hep-benchmark-presto.git
[submodule "rumble/queries"]
path = rumble/queries
url = ../iris-hep-benchmark-rumble.git
[submodule "athena/queries"]
path = athena/queries
url = ../iris-hep-benchmark-athena.git
[submodule "bigquery/queries"]
path = bigquery/queries
url = ../iris-hep-benchmark-bigquery.git
[submodule "rdataframes/queries"]
path = rdataframes/queries
url = git@github.com:masonproffitt/opendata-benchmarks.git
[submodule "asterixdb-sqlpp/queries"]
path = asterixdb-sqlpp/queries
url = ../iris-hep-benchmark-sqlpp.git
[submodule "postgresql/queries"]
path = postgresql/queries
url = ../iris-hep-benchmark-postgresql.git
1 change: 0 additions & 1 deletion asterixdb-sqlpp/queries
Submodule queries deleted from 8800f8
1 change: 1 addition & 0 deletions asterixdb-sqlpp/queries/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__pycache__/
88 changes: 88 additions & 0 deletions asterixdb-sqlpp/queries/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# High-energy Physics Analysis Queries using SQL++ (AsterixDB)

This repository contains implementations of High-energy Physics (HEP) analysis queries from [the IRIS HEP benchmark](https://github.com/iris-hep/adl-benchmarks-index) written in [SQL++](https://asterixdb.apache.org/docs/0.9.6/sqlpp/manual.html) to be run on [AsterixDB](https://asterixdb.apache.org/).

## Motivation

The purpose of this repository is to study the suitability of SQL++ for HEP analyses and to serve as a use case for improving database technologies. Since SQL++ was designed to deal with semi-structured data such as JSON documents, which often has similar or more nestedness than HEP data, it seems like a promising candidate for this benchmark.

## Prerequisites and Setup

1. Install Python 3 with pip.
1. Install the Python requirements:
```bash
pip3 install -r requirements.txt
```
1. Install [docker](https://docs.docker.com/get-docker/) and [docker-compose](https://docs.docker.com/compose/install/).
1. Clone [this repository](https://github.com/ingomueller-net/docker-asterixdb) and bring up the services with Docker compose.

## Data

The benchmark defines a data set in the ROOT format, which is not supported by AsterixDB. However, the [Rumble implementation](https://github.com/RumbleDB/hep-iris-benchmark-jsoniq) of the benchmark provides scripts to convert the data to Parquet, which AsterixDB can load or query in-place.

### HDFS

You can run the queries against "external tables" consisting of files on HDFS. A basic HDFS installation is part of the services brought up by `docker-compose`. Read the instructions of that repository for details. The main steps are as follows:

1. Copy [`Run2012B_SingleMu-restructured-1000.parquet`](/data/Run2012B_SingleMu-restructured-1000.parquet) from this repository to the `data/` repository of the Docker compose project.
1. Upload it to HFDS:
```bash
docker exec -it docker-asterixdb_namenode_1 hadoop fs -mkdir /Run2012B_SingleMu-restructured-1000/
docker exec -it docker-asterixdb_namenode_1 hadoop fs -put /data/Run2012B_SingleMu-restructured-1000.parquet /Run2012B_SingleMu-restructured-1000/
```
1. Create an external table with the provided [script](/scripts/create_table.py):
```bash
scripts/create_table.py \
--asterixdb-server localhost:19002 \
--external-server hdfs://namenode:8020 \
--external-path "/Run2012B_SingleMu-restructured-1000/*.parquet" \
--dataset-name Run2012B_SingleMu_1000_typed_external_parquet \
--datatype eventType \
--file-format parquet \
--storage-location external \
--log-level INFO
```
Other configurations of that script are discussed below.

#### External vs Internal Tables

Instead of querying files on HDFS, you can also load the data into the internal storage of AsterixDB. To do so, use `--storage-location internal` (and adapt the name of the dataset).

#### Parquet vs JSON Files

You can also query (or load) data in the JSON format (converted with [`scripts/parquet2json.py`](/scripts/parquet2json.py)). To do so, use `--file-format json` (and adapt the name of the dataset and the files).

#### Typed vs Untyped Dataset

You can create the tables either with or without specifying a schema (i.e., either with an empty open type or a closed type with all possible attributes). Use `--datatype anyType` or `--datatype eventType`, respectively (and adapt the name of the dataset).

Queries are run through [`test_queries.py`](/test_queries.py). Run the following command to see its options:

```
$ ./test_queries.py --help
usage: test_queries.py [options] [file_or_dir] [file_or_dir] [...]
...
custom options:
-Q QUERY_ID, --query-id=QUERY_ID
Folder name of query to run.
-F FREEZE_RESULT, --freeze-result=FREEZE_RESULT
Whether the results of the query should be persisted to disk.
-N NUM_EVENTS, --num-events=NUM_EVENTS
Number of events taken from the input file. This influences which reference file should be taken.
-I INPUT_TABLE, --input-table=INPUT_TABLE
Name of input table or view.
-S ASTERIXDB_SERVER, --asterixdb-server=ASTERIXDB_SERVER
URL as <host>:<port> of the AsterixDB REST interface.
-C ASTERIXDB_DATAVERSE, --asterixdb-dataverse=ASTERIXDB_DATAVERSE
Default dataverse to use.
--plot-histogram Plot resulting histogram as PNG file.
```

For example, the following command runs queries `6-1` and `6-2` against the table created above:

```bash
./test_queries.py -vs --num-events 1000 \
--input-table Run2012B_SingleMu_1000_typed_external_parquet \
--query-id query-6-1 --query-id query-6-2
```
37 changes: 37 additions & 0 deletions asterixdb-sqlpp/queries/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import glob
from os.path import dirname, join
from socket import getfqdn

def pytest_addoption(parser):
parser.addoption('-Q', '--query-id', action='append', default=[],
help='Folder name of query to run.')
parser.addoption('-F', '--freeze-result', action='store', default=False,
help='Whether the results of the query should be '
'persisted to disk.')
parser.addoption('-N', '--num-events', action='store', default=1000,
help='Number of events taken from the input file. '
'This influences which reference file should be '
'taken.')
parser.addoption('-I', '--input-table', action='store',
help='Name of input table or view.')
parser.addoption('-S', '--asterixdb-server', action='store',
default=getfqdn() + ':19002',
help='URL as <host>:<port> of the AsterixDB REST '
'interface.')
parser.addoption('-C', '--asterixdb-dataverse', action='store',
help='Default dataverse to use.')
parser.addoption('--plot-histogram', action='store_true', default=False,
help='Plot resulting histogram as PNG file.')


def find_queries():
basedir = join(dirname(__file__), 'queries')
queryfiles = glob.glob(join(basedir, '**/query.sqlpp'), recursive=True)
# Lexicographically sort the queries based on their TLD name
return sorted([s[len(basedir)+1:-len('/query.sqlpp')] for s in queryfiles])


def pytest_generate_tests(metafunc):
if 'query_id' in metafunc.fixturenames:
queries = metafunc.config.getoption('query_id') or find_queries()
metafunc.parametrize('query_id', queries)
Binary file not shown.
Loading

0 comments on commit d563eba

Please sign in to comment.