Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dev guide for benchmarks #13988

Merged
merged 4 commits into from
Oct 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -656,7 +656,7 @@
<dependency>
<groupId>io.trino.benchto</groupId>
<artifactId>benchto-driver</artifactId>
<version>0.19</version>
<version>0.20</version>
</dependency>

<dependency>
Expand Down
42 changes: 25 additions & 17 deletions testing/trino-benchto-benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Presto Benchto benchmarks
# Trino Benchto benchmarks

The Benchto benchmarks utilize [Benchto](https://github.com/trinodb/benchto) benchmarking
utility to do macro benchmarking of Presto. As opposed to micro benchmarking which exercises
a class or a small, coherent set of classes, macro benchmarks done with Benchto use Presto
end-to-end, by accessing it through its API (usually with `presto-jdbc`), executing queries,
utility to do macro benchmarking of Trino. As opposed to micro benchmarking which exercises
a class or a small, coherent set of classes, macro benchmarks done with Benchto use Trino
end-to-end, by accessing it through its API (usually with `trino-jdbc`), executing queries,
measuring time and gathering various metrics.

## Benchmarking suites

Even though benchmarks exercise Presto end-to-end, a single benchmark cannot use all Presto
Even though benchmarks exercise Trino end-to-end, a single benchmark cannot use all Trino
features. Therefore benchmarks are organized in suites, like:

* *tpch* - queries closely following the [TPC-H](http://www.tpc.org/tpch/) benchmark
Expand All @@ -18,7 +18,7 @@ features. Therefore benchmarks are organized in suites, like:

### Requirements

* Presto already installed on the target environment
* Trino already installed on the target environment
MiguelWeezardo marked this conversation as resolved.
Show resolved Hide resolved
* Basic understanding of Benchto [components and architecture](https://github.com/trinodb/benchto)
* Benchto service [configured and running](https://github.com/trinodb/benchto/tree/master/benchto-service)
* An environment [defined in Benchto service](https://github.com/trinodb/benchto/tree/master/benchto-service#creating-environment)
Expand All @@ -27,29 +27,33 @@ features. Therefore benchmarks are organized in suites, like:

Benchto driver needs to know two things: what benchmark is to be run and what environment
it is to be run on. For the purpose of the following example, we will use `tpch` benchmark
and Presto server running at `localhost:8080`, with Benchto service running at `localhost:8081`.
and Trino server running at `localhost:8080`, with Benchto service running at `localhost:8081`.

Benchto driver uses Spring Boot to locate environment configuration file, so to pass the
configuration. To continue with our example, one needs to place an `application-presto-devenv.yaml`
configuration. To continue with our example, one needs to place an `application.yaml`
file in the current directory (i.e. the directory from which the benchmark will be invoked),
with the following content:

```yaml
benchmarks: src/main/resources/benchmarks
sql: sql/main/resources/sql
query-results-dir: target/results

benchmark-service:
url: http://localhost:8081

data-sources:
trino:
url: jdbc:trino://localhost:8080
username: na
password: na
driver-class-name: io.trino.jdbc.TrinoDriver

environment:
name: TRINO-DEV

presto:
url: http://localhost:8080
username: na

benchmark:
feature:
Expand All @@ -63,10 +67,10 @@ macros:

### Bootstrapping benchmark data

* Make sure you have configured [Presto TPC-H connector](https://trino.io/docs/current/connector/tpch.html).
* Make sure you have configured [Trino TPC-H connector](https://trino.io/docs/current/connector/tpch.html).
* Bootstrap benchmark data:
```bash
python presto-benchto-benchmarks/generate_schemas/generate-tpch.py | presto-cli-[version]-executable.jar --server [presto_coordinator-url]:[port]
testing/trino-benchto-benchmarks/generate_schemas/generate-tpch.py --factors sf1 --formats orc | trino-cli-[version]-executable.jar --server [trino_coordinator-url]:[port]
MiguelWeezardo marked this conversation as resolved.
Show resolved Hide resolved
```

### Configuring overrides file
Expand All @@ -77,17 +81,21 @@ runs or different underlying schemas. Create a simple `overrides.yaml` file:

```yaml
runs: 10
tpch_medium: tpcds_10gb_txt
tpch_300: tpch_sf1_orc
scale_300: 1
tpch_1000: tpch_sf1_orc
scale_1000: 1
tpch_3000: tpch_sf1_orc
MiguelWeezardo marked this conversation as resolved.
Show resolved Hide resolved
scale_3000: 1
prefix: ""
```

### Running benchto-driver

With the scene set up as in the previous section, the benchmark can be run with:
```bash
./mvnw clean package -pl :trino-benchto-benchmarks
java -Xmx1g -jar trino-benchto-benchmarks/target/trino-benchto-benchmarks-*-executable.jar \
--sql trino-benchto-benchmarks/src/main/resources/sql \
--benchmarks trino-benchto-benchmarks/src/main/resources/benchmarks \
--activeBenchmarks=presto/tpch --profile=presto-devenv \
--overrides overrides.yaml
java -jar "$HOME/.m2/repository/io/trino/benchto/benchto-driver/0.20/benchto-driver-0.20-exec.jar" \
--activeBenchmarks=trino/tpch \
--overrides "overrides.yaml"
```
134 changes: 83 additions & 51 deletions testing/trino-benchto-benchmarks/generate_schemas/generate-tpcds.py
Original file line number Diff line number Diff line change
@@ -1,53 +1,85 @@
#!/usr/bin/env python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you use the tool to generate schemas manually as a test?

Copy link
Member Author

@MiguelWeezardo MiguelWeezardo Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, running this unveiled a trivial bug. It took a while to setup Trino with a fully functional Hive connector, but I finally found a product test environment with both hive and tpcds.


schemas = [
# (new_schema, source_schema)
('tpcds_sf10_orc', 'tpcds.sf10'),
('tpcds_sf30_orc', 'tpcds.sf30'),
('tpcds_sf100_orc', 'tpcds.sf100'),
('tpcds_sf300_orc', 'tpcds.sf300'),
('tpcds_sf1000_orc', 'tpcds.sf1000'),
('tpcds_sf3000_orc', 'tpcds.sf3000'),
('tpcds_sf10000_orc', 'tpcds.sf10000'),
]

tables = [
'call_center',
'catalog_page',
'catalog_returns',
'catalog_sales',
'customer',
'customer_address',
'customer_demographics',
'date_dim',
'household_demographics',
'income_band',
'inventory',
'item',
'promotion',
'reason',
'ship_mode',
'store',
'store_returns',
'store_sales',
'time_dim',
'warehouse',
'web_page',
'web_returns',
'web_sales',
'web_site',
]

for (new_schema, source_schema) in schemas:

if new_schema.endswith('_orc'):
format = 'ORC'
elif new_schema.endswith('_text'):
format = 'TEXTFILE'
else:
raise ValueError(new_schema)

print('CREATE SCHEMA hive.{};'.format(new_schema,))
for table in tables:
print('CREATE TABLE "hive"."{}"."{}" WITH (format = \'{}\') AS SELECT * FROM {}."{}";'.format(
new_schema, table, format, source_schema, table))
import argparse


def generate(factors, formats, tables):
for format in formats:
for factor in factors:
new_schema = "tpcds_" + factor + "_" + format
source_schema = "tpcds." + factor
print(
"CREATE SCHEMA IF NOT EXISTS hive.{};".format(
new_schema,
)
)
for table in tables:
print(
'CREATE TABLE IF NOT EXISTS "hive"."{}"."{}" WITH (format = \'{}\') AS SELECT * FROM {}."{}";'.format(
new_schema, table, format, source_schema, table
)
)


def main():
parser = argparse.ArgumentParser(description="Generate test data.")
parser.add_argument(
"--factors",
type=csvtype(
["tiny", "sf1", "sf10", "sf30", "sf100", "sf300", "sf1000", "sf3000", "sf10000"]
),
default=["sf10", "sf30", "sf100", "sf300", "sf1000", "sf3000", "sf10000"],
)
parser.add_argument("--formats", type=csvtype(["orc", "text"]), default=["orc"])
default_tables = [
"call_center",
"catalog_page",
"catalog_returns",
"catalog_sales",
"customer",
"customer_address",
"customer_demographics",
"date_dim",
"household_demographics",
"income_band",
"inventory",
"item",
"promotion",
"reason",
"ship_mode",
"store",
"store_returns",
"store_sales",
"time_dim",
"warehouse",
"web_page",
"web_returns",
"web_sales",
"web_site",
]
parser.add_argument(
"--tables", type=csvtype(default_tables), default=default_tables
)
args = parser.parse_args()
generate(args.factors, args.formats, args.tables)


def csvtype(choices):
"""Return a function that splits and checks comma-separated values."""

def splitarg(arg):
values = arg.split(",")
for value in values:
if value not in choices:
raise argparse.ArgumentTypeError(
"invalid choice: {!r} (choose from {})".format(
value, ", ".join(map(repr, choices))
)
)
return values

return splitarg


if __name__ == "__main__":
main()
103 changes: 68 additions & 35 deletions testing/trino-benchto-benchmarks/generate_schemas/generate-tpch.py
Original file line number Diff line number Diff line change
@@ -1,37 +1,70 @@
#!/usr/bin/env python

schemas = [
# (new_schema, source_schema)
('tpch_sf300_orc', 'tpch.sf300'),
('tpch_sf1000_orc', 'tpch.sf1000'),
('tpch_sf3000_orc', 'tpch.sf3000'),

('tpch_sf300_text', 'hive.tpch_sf300_orc'),
('tpch_sf1000_text', 'hive.tpch_sf1000_orc'),
('tpch_sf3000_text', 'hive.tpch_sf3000_orc'),
]

tables = [
'customer',
'lineitem',
'nation',
'orders',
'part',
'partsupp',
'region',
'supplier',
]

for (new_schema, source_schema) in schemas:

if new_schema.endswith('_orc'):
format = 'ORC'
elif new_schema.endswith('_text'):
format = 'TEXTFILE'
else:
raise ValueError(new_schema)

print('CREATE SCHEMA hive.{};'.format(new_schema,))
for table in tables:
print('CREATE TABLE "hive"."{}"."{}" WITH (format = \'{}\') AS SELECT * FROM {}."{}";'.format(
new_schema, table, format, source_schema, table))
import argparse


def generate(factors, formats, tables):
for format in formats:
for factor in factors:
new_schema = "tpch_" + factor + "_" + format
source_schema = "tpch." + factor
print(
"CREATE SCHEMA IF NOT EXISTS hive.{};".format(
new_schema,
)
)
for table in tables:
print(
'CREATE TABLE IF NOT EXISTS "hive"."{}"."{}" WITH (format = \'{}\') AS SELECT * FROM {}."{}";'.format(
new_schema, table, format, source_schema, table
)
)


def main():
parser = argparse.ArgumentParser(description="Generate test data.")
parser.add_argument(
"--factors",
type=csvtype(["tiny", "sf1", "sf100", "sf300", "sf1000", "sf3000"]),
default=["sf300", "sf1000", "sf3000"],
)
parser.add_argument(
"--formats", type=csvtype(["orc", "text"]), default=["orc", "text"]
)
default_tables = [
"customer",
"lineitem",
"nation",
"orders",
"part",
"partsupp",
"region",
"supplier",
]
parser.add_argument(
"--tables", type=csvtype(default_tables), default=default_tables
)

args = parser.parse_args()
generate(args.factors, args.formats, args.tables)


def csvtype(choices):
"""Return a function that splits and checks comma-separated values."""

def splitarg(arg):
values = arg.split(",")
for value in values:
if value not in choices:
raise argparse.ArgumentTypeError(
"invalid choice: {!r} (choose from {})".format(
value, ", ".join(map(repr, choices))
)
)
return values

return splitarg


if __name__ == "__main__":
main()