Redset

Redset is a dataset containing three months worth of user query metadata that ran on a selected sample of instances in the Amazon Redshift fleet. We provide query metadata for 200 provisioned and serverless instances each.

As stated in the paper, Redset is not intended to be representative of Redshift as a whole. Instead, Redset provides biased sample data to support the development of new benchmarks for these specific workloads. For fleet analysis and sampling methodology please take a look at the paper.

Security

See CONTRIBUTING for more information.

License

FAQ

Does the Paper study Redset? No. The paper studies metadata generated by the entire Amazon Redshift fleet. Redset is a non-representative biased random sample of this metadata. The dataset is released as a standalone contribution from the paper with the express purpose of aiding the development or augmentation of future benchmarks, as well as enabling exploration of ML based techniques, e.g., for workload forecasting.

Is Redset a representative sample? No. As stated in section 6 in the paper, the Redset workloads is a biased sample based on a “busy-ness“ score (see the paper for the exact definition). This ensures a diverse set of example workloads that can be studied individually. However, aggregating over the workloads in the dataset will not yield a representative view of the overall Amazon Redshift fleet.

Does each Redset cluster equal one Customer? No. A customer/organization often has several clusters of various sizes and purposes. Redset does not disclose whom a cluster belongs to. It is therefore impossible to draw conclusions about the number of customers in Redset and, for the same reason, impossible to make statements like “X% of customers do/have Y”.

Can I use scanned bytes to approximate table sizes? No. There a various factors in the execution engine and Amazon Redshift’s overall design that simply make it impossible to derive or even approximate table sizes from scanned bytes. To give two examples: 1) Redshift uses a columnar storage and thus accesses only data that is actually queried 2) Redshift implements block skipping (and other execution techniques) to tremendously limit the amount of data it needs to process.

Download

Folder structure:

s3://redshift-downloads/redset
- README
- LICENSE
- provisioned/
  - full.parquet
  - sample_0.01.parquet (1% uniform random data sample)
  - sample_0.001.parquet (0.1% uniform random data sample)
  - parts/
    - One individual <id>.parquet file per cluster
- serverless/
  - full.parquet
  - sample_0.01.parquet (1% uniform random sample)
  - sample_0.001.parquet (0.1% uniform random data sample)
  - parts/
    - One individual <id>.parquet file per cluster

You can either download files using their http link, e.g., https://s3.amazonaws.com/redshift-downloads/redset/LICENSE Or interact with the s3 bucket using the AWS CLI. For example, to download the full serverless dataset you can run:

aws s3 cp --no-sign-request s3://redshift-downloads/redset/serverless/full.parquet .

Schema

Column	Name Description
instance_id	Uniquely identifies a redshift cluster
cluster_size	Size of the cluster (only available for provisioned)
user_id	Identifies the user that issued the query
database_id	Identifies the database that was queried
query_id	Unique per instance
arrival_timestamp	Timestamp when the query arrived on the system
compile_duration_ms	Time the query spent compiling in milliseconds
queue_duration_ms	Time the query spent queueing in milliseconds
execution_duration_ms	Time the query spent executing in milliseconds
feature_fingerprint	Hash value of the query fingerprint. A proxy for query-likeness, though not based on text. Will overestimate repetition.
was_aborted	Whether the query was aborted during its lifetime
was_cached	Whether the query was answered from result cache
cache_source_query_id	If query was answered from result cache, this is the query id for the query which populated the cache
query_type	Type of query, e.g.., `select`, `copy`, ...
num_permanent_tables_accessed	Number of permanent table accesses by the query (regular database table)
num_external_tables_accessed	Number of external tables accessed by the query
num_system_tables_accessed	Number of system tables accessed by the query
read_table_ids	Comma separated list of unique permanent table ids read by the query
write_table_ids	Comma separated list of unique table ids written to by the query
mbytes_scanned	Total number of megabytes scanned by the query
mbytes_spilled	Total number of megabytes spilled by the query
num_joins	Number of joins in the query plan
num_scans	Number of scans in the query plan
num_aggregations	Number of aggregations in the query plan

Citation

You may find the paper here.

@Inproceedings{Renen2024,
  author = {Alexander van Renen and Dominik Horn and Pascal Pfeil and Kapil Eknath Vaidya and Wenjian Dong and Murali Narayanaswamy and Zhengchun Liu and Gaurav Saxena and Andreas Kipf and Tim Kraska},
  title = {Why TPC is not enough: An analysis of the Amazon Redshift fleet},
  year = {2024},
  url = {https://www.amazon.science/publications/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet},
  booktitle = {VLDB 2024},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Redset

Security

License

FAQ

Download

Schema

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Redset

Security

License

FAQ

Download

Schema

Citation