Redset is a dataset containing three months worth of user query metadata that ran on a selected sample of instances in the Amazon Redshift fleet. We provide query metadata for 200 provisioned and serverless instances each.
As stated in the paper, Redset is not intended to be representative of Redshift as a whole. Instead, Redset provides biased sample data to support the development of new benchmarks for these specific workloads. For fleet analysis and sampling methodology please take a look at the paper.
See CONTRIBUTING for more information.
Redset © 2024 by Amazon is licensed under Creative Commons Attribution-NonCommercial 4.0 International.
Does the Paper study Redset? No. The paper studies metadata generated by the entire Amazon Redshift fleet. Redset is a non-representative biased random sample of this metadata. The dataset is released as a standalone contribution from the paper with the express purpose of aiding the development or augmentation of future benchmarks, as well as enabling exploration of ML based techniques, e.g., for workload forecasting.
Is Redset a representative sample? No. As stated in section 6 in the paper, the Redset workloads is a biased sample based on a “busy-ness“ score (see the paper for the exact definition). This ensures a diverse set of example workloads that can be studied individually. However, aggregating over the workloads in the dataset will not yield a representative view of the overall Amazon Redshift fleet.
Does each Redset cluster equal one Customer? No. A customer/organization often has several clusters of various sizes and purposes. Redset does not disclose whom a cluster belongs to. It is therefore impossible to draw conclusions about the number of customers in Redset and, for the same reason, impossible to make statements like “X% of customers do/have Y”.
Can I use scanned bytes to approximate table sizes? No. There a various factors in the execution engine and Amazon Redshift’s overall design that simply make it impossible to derive or even approximate table sizes from scanned bytes. To give two examples: 1) Redshift uses a columnar storage and thus accesses only data that is actually queried 2) Redshift implements block skipping (and other execution techniques) to tremendously limit the amount of data it needs to process.
Folder structure:
- s3://redshift-downloads/redset
- README
- LICENSE
- provisioned/
- full.parquet
- sample_0.01.parquet (1% uniform random data sample)
- sample_0.001.parquet (0.1% uniform random data sample)
- parts/
- One individual
<id>.parquet
file per cluster
- One individual
- serverless/
- full.parquet
- sample_0.01.parquet (1% uniform random sample)
- sample_0.001.parquet (0.1% uniform random data sample)
- parts/
- One individual
<id>.parquet
file per cluster
- One individual
You can either download files using their http link, e.g., https://s3.amazonaws.com/redshift-downloads/redset/LICENSE Or interact with the s3 bucket using the AWS CLI. For example, to download the full serverless dataset you can run:
aws s3 cp --no-sign-request s3://redshift-downloads/redset/serverless/full.parquet .
Column | Name Description |
---|---|
instance_id | Uniquely identifies a redshift cluster |
cluster_size | Size of the cluster (only available for provisioned) |
user_id | Identifies the user that issued the query |
database_id | Identifies the database that was queried |
query_id | Unique per instance |
arrival_timestamp | Timestamp when the query arrived on the system |
compile_duration_ms | Time the query spent compiling in milliseconds |
queue_duration_ms | Time the query spent queueing in milliseconds |
execution_duration_ms | Time the query spent executing in milliseconds |
feature_fingerprint | Hash value of the query fingerprint. A proxy for query-likeness, though not based on text. Will overestimate repetition. |
was_aborted | Whether the query was aborted during its lifetime |
was_cached | Whether the query was answered from result cache |
cache_source_query_id | If query was answered from result cache, this is the query id for the query which populated the cache |
query_type | Type of query, e.g.., select , copy , ... |
num_permanent_tables_accessed | Number of permanent table accesses by the query (regular database table) |
num_external_tables_accessed | Number of external tables accessed by the query |
num_system_tables_accessed | Number of system tables accessed by the query |
read_table_ids | Comma separated list of unique permanent table ids read by the query |
write_table_ids | Comma separated list of unique table ids written to by the query |
mbytes_scanned | Total number of megabytes scanned by the query |
mbytes_spilled | Total number of megabytes spilled by the query |
num_joins | Number of joins in the query plan |
num_scans | Number of scans in the query plan |
num_aggregations | Number of aggregations in the query plan |
You may find the paper here.
@Inproceedings{Renen2024,
author = {Alexander van Renen and Dominik Horn and Pascal Pfeil and Kapil Eknath Vaidya and Wenjian Dong and Murali Narayanaswamy and Zhengchun Liu and Gaurav Saxena and Andreas Kipf and Tim Kraska},
title = {Why TPC is not enough: An analysis of the Amazon Redshift fleet},
year = {2024},
url = {https://www.amazon.science/publications/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet},
booktitle = {VLDB 2024},
}