-
Notifications
You must be signed in to change notification settings - Fork 67
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Scripts and AWS results for perf section of super command doc (#5506)
- Loading branch information
Showing
17 changed files
with
1,562 additions
and
840 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# Query Performance From `super` Command Doc | ||
|
||
These scripts were used to generate the results in the | ||
[Performance](https://zed.brimdata.io/docs/next/commands/super#performance) | ||
section of the [`super` command doc](https://zed.brimdata.io/docs/next/commands/super). | ||
The scripts have been made available to allow for easy reproduction of the | ||
results under different conditions and/or as tested systems evolve. | ||
|
||
# Environments | ||
|
||
The scripts were written to be easily run in two different environments. | ||
|
||
## AWS | ||
|
||
As an environment that's available to everyone, the scripts were developed | ||
primarily for use on a "scratch" EC2 instance in [AWS](https://aws.amazon.com/). | ||
Specifically, we chose the [`m6idn.2xlarge`](https://aws.amazon.com/ec2/instance-types/m6i/) | ||
instance that has the following specifications: | ||
|
||
* 8x vCPU | ||
* 32 GB of RAM | ||
* 474 GB NVMe instance SSD | ||
|
||
The instance SSD in particular was seen as important to ensure consistent I/O | ||
performance. | ||
|
||
Assuming a freshly-created `m6idn.2xlarge` instance running Ubuntu 24.04, to | ||
start the run: | ||
|
||
``` | ||
curl -s https://github.com/brimdata/super/blob/main/scripts/super-cmd-perf/benchmark.sh | bash -xv 2>&1 | tee runlog.txt | ||
``` | ||
|
||
The run proceeds in three phases: | ||
|
||
1. **(AWS only)** Instance SSD is formatted and required tools & data platforms tools are downloaded/installed | ||
2. Test data is downloaded and loaded into needed storage formats | ||
3. Queries are executed on all data platforms | ||
|
||
As the benchmarks may take a long time to run, the use of [`screen`](https://www.gnu.org/software/screen/) | ||
or a similar "detachable" terminal tool is recommended in case your remote | ||
network connection drops during a run. | ||
|
||
## macOS/other | ||
|
||
Whereas on [AWS](#aws) the scripts assume they're in a "scratch" environment | ||
where it may format the instance SSD for optimal storage and install required | ||
software, on other systems such as macOS it's assumed the required data | ||
platforms are already installed, and it will skip ahead right to | ||
downloading/loading test data and then running queries. | ||
|
||
For instance on macOS, the software needed can be first installed via: | ||
|
||
``` | ||
brew install hyperfine datafusion duckdb clickhouse go | ||
go install github.com/brimdata/super/cmd/super@main | ||
``` | ||
|
||
Then clone the [super repo](https://github.com/brimdata/super.git) and run the | ||
benchmarks. | ||
|
||
``` | ||
git clone https://github.com/brimdata/super.git | ||
cd scripts/super-cmd-perf | ||
./benchmark.sh | ||
``` | ||
|
||
All test data will remain in this directory. | ||
|
||
# Results | ||
|
||
Results from the run will accumulate in a subdirectory named for the date/time | ||
when the run started, e.g., `2024-11-19_01:10:30/`. In this directory, summary | ||
reports will be created in files ending in `.md` and `.csv` extensions, and | ||
details from each individual step in generating the results will be in files | ||
ending in `.out`. If run on AWS using the [`curl` command line shown above](#aws), | ||
the `runlog.txt` will also be present that holds the full console output of the | ||
entire run. | ||
|
||
An archive of results from our most recent run of the benchmarks on November | ||
26, 2024 can be downloaded [here](https://super-cmd-perf.s3.us-east-2.amazonaws.com/2024-11-26_03-17-25.tgz). | ||
|
||
# Debugging | ||
|
||
The scripts are configured to exit immediately if failures occur during the | ||
run. If you encounter a failure, look in the results directory for the `.out` | ||
file mentioned last in the console output as this will contain any detailed | ||
error message from the operation that experienced the failure. | ||
|
||
A problem that was encountered when developing the scripts that you may also | ||
encounter is DuckDB running out of memory. Specifically, this happened when | ||
we tried to run the scripts on an Intel-based Macbook with only 16 GB of | ||
RAM, and this is part of why we used an AWS instance with 32 GB of RAM as the | ||
reference platform. On the Macbooks, we found we could work around the memory | ||
problem by telling DuckDB it had the use of more memory than its default | ||
[80% heuristic for `memory_limit`](https://duckdb.org/docs/configuration/overview.html). | ||
The scripts support an environment variable to make it easy to increase this | ||
value, e.g., we found the scripts ran successfully at 16 GB: | ||
|
||
``` | ||
$ DUCKDB_MEMORY_LIMIT="16GB" ./benchmark.sh | ||
``` | ||
|
||
Of course, this ultimately caused swapping on our Macbook and a significant | ||
hit to performance, but it at least allowed the scripts to run without | ||
failure. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
#!/bin/bash -xv | ||
set -euo pipefail | ||
export RUNNING_ON_AWS_EC2="" | ||
|
||
# If we can detect we're running on an AWS EC2 m6idn.2xlarge instance, we'll | ||
# treat it as a scratch host, installing all needed software and using the | ||
# local SSD for best I/O performance. | ||
if command -v dmidecode && [ "$(sudo dmidecode --string system-uuid | cut -c1-3)" == "ec2" ] && [ "$(TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") && curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-type)" == "m6idn.2xlarge" ]; then | ||
|
||
export RUNNING_ON_AWS_EC2=true | ||
|
||
sudo apt-get -y update | ||
sudo apt-get -y upgrade | ||
sudo apt-get -y install make gcc unzip hyperfine | ||
|
||
# Prepare local SSD for best I/O performance | ||
sudo fdisk -l /dev/nvme1n1 | ||
sudo mkfs.ext4 -E discard -F /dev/nvme1n1 | ||
sudo mount /dev/nvme1n1 /mnt | ||
sudo chown ubuntu:ubuntu /mnt | ||
sudo chmod 777 /mnt | ||
echo 'export TMPDIR="/mnt/tmpdir"' >> "$HOME"/.profile | ||
mkdir /mnt/tmpdir | ||
|
||
# Install ClickHouse | ||
if ! command -v clickhouse-client > /dev/null 2>&1; then | ||
sudo apt-get install -y apt-transport-https ca-certificates curl gnupg | ||
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg | ||
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee \ | ||
/etc/apt/sources.list.d/clickhouse.list | ||
sudo apt-get update | ||
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y clickhouse-client | ||
fi | ||
|
||
# Install DuckDB | ||
if ! command -v duckdb > /dev/null 2>&1; then | ||
curl -L -O https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip | ||
unzip duckdb_cli-linux-amd64.zip | ||
sudo mv duckdb /usr/local/bin | ||
fi | ||
|
||
# Install Rust | ||
curl -L -O https://static.rust-lang.org/dist/rust-1.82.0-x86_64-unknown-linux-gnu.tar.xz | ||
tar xf rust-1.82.0-x86_64-unknown-linux-gnu.tar.xz | ||
sudo rust-1.82.0-x86_64-unknown-linux-gnu/install.sh | ||
# shellcheck disable=SC2016 | ||
echo 'export PATH="$PATH:$HOME/.cargo/bin"' >> "$HOME"/.profile | ||
|
||
# Install DataFusion CLI | ||
if ! command -v datafusion-cli > /dev/null 2>&1; then | ||
cargo install datafusion-cli | ||
fi | ||
|
||
# Install Go | ||
if ! command -v go > /dev/null 2>&1; then | ||
curl -L -O https://go.dev/dl/go1.23.3.linux-amd64.tar.gz | ||
rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.23.3.linux-amd64.tar.gz | ||
# shellcheck disable=SC2016 | ||
echo 'export PATH="$PATH:/usr/local/go/bin:$HOME/go/bin"' >> "$HOME"/.profile | ||
source "$HOME"/.profile | ||
fi | ||
|
||
# Install SuperDB | ||
if ! command -v super > /dev/null 2>&1; then | ||
git clone https://github.com/brimdata/super.git | ||
cd super | ||
make install | ||
fi | ||
|
||
cd scripts/super-cmd-perf | ||
|
||
fi | ||
|
||
rundir="$(date +%F_%T)" | ||
mkdir "$rundir" | ||
report="$rundir/report_$rundir.md" | ||
|
||
echo -e "|**Software**|**Version**|\n|-|-|" | tee -a "$report" | ||
for software in super duckdb datafusion-cli clickhouse | ||
do | ||
if ! command -v $software > /dev/null; then | ||
echo "error: \"$software\" not found in PATH" | ||
exit 1 | ||
fi | ||
echo "|$software|$($software --version)|" | tee -a "$report" | ||
done | ||
echo >> "$report" | ||
|
||
# Prepare the test data | ||
./prep-data.sh "$rundir" | ||
|
||
# Run the queries and generate the summary report | ||
./run-queries.sh "$rundir" | ||
|
||
if [ -n "$RUNNING_ON_AWS_EC2" ]; then | ||
mv "$HOME/runlog.txt" "$rundir" | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
#!/bin/bash -xv | ||
set -euo pipefail | ||
pushd "$(cd "$(dirname "$0")" && pwd)" | ||
|
||
if [ "$#" -ne 1 ]; then | ||
echo "Specify results directory string" | ||
exit 1 | ||
fi | ||
rundir="$(pwd)/$1" | ||
mkdir -p "$rundir" | ||
|
||
RUNNING_ON_AWS_EC2="${RUNNING_ON_AWS_EC2:-}" | ||
if [ -n "$RUNNING_ON_AWS_EC2" ]; then | ||
cd /mnt | ||
fi | ||
|
||
function run_cmd { | ||
outputfile="$1" | ||
shift | ||
{ hyperfine \ | ||
--show-output \ | ||
--warmup 0 \ | ||
--runs 1 \ | ||
--time-unit second \ | ||
"$@" ; | ||
} \ | ||
> "$outputfile" \ | ||
2>&1 | ||
} | ||
|
||
mkdir gharchive_gz | ||
cd gharchive_gz | ||
for num in $(seq 0 23) | ||
do | ||
curl -L -O "https://data.gharchive.org/2023-02-08-${num}.json.gz" | ||
done | ||
cd .. | ||
|
||
DUCKDB_MEMORY_LIMIT="${DUCKDB_MEMORY_LIMIT:-}" | ||
if [ -n "$DUCKDB_MEMORY_LIMIT" ]; then | ||
increase_duckdb_memory_limit='SET memory_limit = '\'"${DUCKDB_MEMORY_LIMIT}"\''; ' | ||
else | ||
increase_duckdb_memory_limit="" | ||
fi | ||
|
||
run_cmd \ | ||
"$rundir/duckdb-table-create.out" \ | ||
"duckdb gha.db -c \"${increase_duckdb_memory_limit}CREATE TABLE gha AS FROM read_json('gharchive_gz/*.json.gz', union_by_name=true)\"" | ||
|
||
run_cmd \ | ||
"$rundir/duckdb-parquet-create.out" \ | ||
"duckdb gha.db -c \"${increase_duckdb_memory_limit}COPY (from gha) TO 'gha.parquet'\"" | ||
|
||
run_cmd \ | ||
"$rundir/super-bsup-create.out" \ | ||
"super -o gha.bsup gharchive_gz/*.json.gz" | ||
|
||
du -h gha.db gha.parquet gha.bsup gharchive_gz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
SELECT count(),type | ||
FROM '__SOURCE__' | ||
WHERE repo.name='duckdb/duckdb' | ||
GROUP BY type |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
SELECT count() | ||
FROM '__SOURCE__' | ||
WHERE actor.login='johnbieren' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
SELECT count() | ||
FROM '__SOURCE__' | ||
WHERE grep('in case you have any feedback 😊') |
Oops, something went wrong.