Skip to content

Latest commit

 

History

History
89 lines (77 loc) · 6.13 KB

README.md

File metadata and controls

89 lines (77 loc) · 6.13 KB

Redset

Redset is a dataset containing three months worth of user query metadata that ran on a selected sample of instances in the Amazon Redshift fleet. We provide query metadata for 200 provisioned and serverless instances each.

As stated in the paper, Redset is not intended to be representative of Redshift as a whole. Instead, Redset provides biased sample data to support the development of new benchmarks for these specific workloads. For fleet analysis and sampling methodology please take a look at the paper.

Security

See CONTRIBUTING for more information.

License

Redset © 2024 by Amazon is licensed under Creative Commons Attribution-NonCommercial 4.0 International.

FAQ

Does the Paper study Redset? No. The paper studies metadata generated by the entire Amazon Redshift fleet. Redset is a non-representative biased random sample of this metadata. The dataset is released as a standalone contribution from the paper with the express purpose of aiding the development or augmentation of future benchmarks, as well as enabling exploration of ML based techniques, e.g., for workload forecasting.

Is Redset a representative sample? No. As stated in section 6 in the paper, the Redset workloads is a biased sample based on a “busy-ness“ score (see the paper for the exact definition). This ensures a diverse set of example workloads that can be studied individually. However, aggregating over the workloads in the dataset will not yield a representative view of the overall Amazon Redshift fleet.

Does each Redset cluster equal one Customer? No. A customer/organization often has several clusters of various sizes and purposes. Redset does not disclose whom a cluster belongs to. It is therefore impossible to draw conclusions about the number of customers in Redset and, for the same reason, impossible to make statements like “X% of customers do/have Y”.

Can I use scanned bytes to approximate table sizes? No. There a various factors in the execution engine and Amazon Redshift’s overall design that simply make it impossible to derive or even approximate table sizes from scanned bytes. To give two examples: 1) Redshift uses a columnar storage and thus accesses only data that is actually queried 2) Redshift implements block skipping (and other execution techniques) to tremendously limit the amount of data it needs to process.

Download

Folder structure:

  • s3://redshift-downloads/redset
    • README
    • LICENSE
    • provisioned/
      • full.parquet
      • sample_0.01.parquet (1% uniform random data sample)
      • sample_0.001.parquet (0.1% uniform random data sample)
      • parts/
        • One individual <id>.parquet file per cluster
    • serverless/
      • full.parquet
      • sample_0.01.parquet (1% uniform random sample)
      • sample_0.001.parquet (0.1% uniform random data sample)
      • parts/
        • One individual <id>.parquet file per cluster

You can either download files using their http link, e.g., https://s3.amazonaws.com/redshift-downloads/redset/LICENSE Or interact with the s3 bucket using the AWS CLI. For example, to download the full serverless dataset you can run:

aws s3 cp --no-sign-request s3://redshift-downloads/redset/serverless/full.parquet .

Schema

Column Name Description
instance_id Uniquely identifies a redshift cluster
cluster_size Size of the cluster (only available for provisioned)
user_id Identifies the user that issued the query
database_id Identifies the database that was queried
query_id Unique per instance
arrival_timestamp Timestamp when the query arrived on the system
compile_duration_ms Time the query spent compiling in milliseconds
queue_duration_ms Time the query spent queueing in milliseconds
execution_duration_ms Time the query spent executing in milliseconds
feature_fingerprint Hash value of the query fingerprint. A proxy for query-likeness, though not based on text. Will overestimate repetition.
was_aborted Whether the query was aborted during its lifetime
was_cached Whether the query was answered from result cache
cache_source_query_id If query was answered from result cache, this is the query id for the query which populated the cache
query_type Type of query, e.g.., select, copy, ...
num_permanent_tables_accessed Number of permanent table accesses by the query (regular database table)
num_external_tables_accessed Number of external tables accessed by the query
num_system_tables_accessed Number of system tables accessed by the query
read_table_ids Comma separated list of unique permanent table ids read by the query
write_table_ids Comma separated list of unique table ids written to by the query
mbytes_scanned Total number of megabytes scanned by the query
mbytes_spilled Total number of megabytes spilled by the query
num_joins Number of joins in the query plan
num_scans Number of scans in the query plan
num_aggregations Number of aggregations in the query plan

Citation

You may find the paper here.

@Inproceedings{Renen2024,
  author = {Alexander van Renen and Dominik Horn and Pascal Pfeil and Kapil Eknath Vaidya and Wenjian Dong and Murali Narayanaswamy and Zhengchun Liu and Gaurav Saxena and Andreas Kipf and Tim Kraska},
  title = {Why TPC is not enough: An analysis of the Amazon Redshift fleet},
  year = {2024},
  url = {https://www.amazon.science/publications/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet},
  booktitle = {VLDB 2024},
}