Skip to content

Latest commit

 

History

History
512 lines (390 loc) · 17.5 KB

README.md

File metadata and controls

512 lines (390 loc) · 17.5 KB

Optimization Scripts

This folder contains scripts that (when executed) create a dataset named, optimization_workshop, with several tables inside the dataset.
These tables are populated with information to help you optimize your BigQuery tables, views, and queries.

Run all the .sql scripts within this folder using the following commands:

gcloud auth login &&
bash run_all_scripts.sh

The anti-pattern-recognittion-tool-scripts subfolder contains additional scripts that need to be executed separetly to run the Anti-pattern Recognition Tool:

bash ./anti-pattern-recognittion-tool-scripts/run_anti_pattern_tool.sh \
--input_table_name="optimization_workshop.viewable_queries_grouped_by_hash" \
--input_table_id_col_name="Query_Hash" \
--input_table_query_text_col_name="Query_Raw_Sample" \
--input_table_slots_col_name="Total_Slot_Hours"

bash ./anti-pattern-recognittion-tool-scripts/run_anti_pattern_tool.sh \
--input_table_name="optimization_workshop.queries_grouped_by_hash_project" \
--input_table_id_col_name="query_hash" \
--input_table_query_text_col_name="top_10_jobs[SAFE_OFFSET(0)].query_text" \
--input_table_slots_col_name="avg_total_slots"

Te above command tales the <dataset>.<tablename> as input. In can be executed on any table with a schema similar to the one generated by viewable_queries_grouped_by_hash.

The scripts are described in more detail in the following sections.


Project Analysis

Project level analysis enables us to understand key metrics such as slot_time, bytes_scanned, bytes_shuffled and bytes_spilled on a daily basis within a project. The metrics are examined as averages, medians and p80s. This enables us to understand at a high level what jobs within a project consume 80% of the time and 50% of the time daily.

🔍 Daily project metrics

Daily project metrics

The daily_project_analysis.sql script creates a table called, daily_project_analysis of daily slot consumption metrics (for a 30day period) for all your projects.

Examples of querying script results

  • Top 100 tables with the highest slot consumption

    SELECT *
    FROM optimization_workshop.daily_project_analysis
    ORDER BY total_slot_ms DESC
    LIMIT 100

Table Analysis

🔍 BigQuery Clustering/Partitioning Recommender Tool

BigQuery Clustering/Partitioning Recommender Tool

The BigQuery partitioning and clustering recommender analyzes workflows on your BigQuery tables and offers recommendations to better optimize your workflows and query costs using either table partitioning or table clustering. The recommender uses your BigQuery's workload execution data from the past 30 days to analyze each BigQuery table for suboptimal partitioning and clustering configurations.

[!IMPORTANT] Before you can view partition and cluster recommendations, you need to enable the Recommender API as shown in the following sections.

Enable using gcloud

# The following script retrieves all distinct projects from the JOBS_BY_ORGANIZATION view
# and then enables the recommender API for each project.
projects=$(
  bq query \
    --nouse_legacy_sql \
    --format=csv \
    "SELECT DISTINCT project_id FROM \`region-us\`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION" \
    | sed 1d
);
for proj in $projects; do
  gcloud services --project="${proj}" enable recommender.googleapis.com &
done

Enable using Terraform

resource "google_project_service" "recommender_service" {
  project = "your-project"
  service = "recommender.googleapis.com"
}

View your partition and cluster recommendations

Once you've enabled the Recommender API, you can view your partition and cluster recommendations in the Cloud Console or via the gcloud command-line tool.

[!NOTE] The most scalable method for viewing your partition and cluster recommendations is to export your recommendations to BigQuery. You can do this by creating a Data Transfer Service (DTS) job to export your recommendations to BigQuery. See Exporting recommendations to BigQuery for more information.

Once you set up the DTS job to export your recommendations to BigQuery, you can run the following query to get the most recent recommendations for partitioning and clustering your tables.

CREATE TEMP FUNCTION extract_table(target_resources ARRAY<STRING>) AS((
  SELECT ARRAY_AGG(
    REGEXP_REPLACE(REGEXP_EXTRACT(target_resource, r'\/projects\/(.*?\/datasets\/.*?\/tables\/.*)'), "(/datasets/|/tables/)", ".")
  ) 
  FROM UNNEST(target_resources) target_resource)[OFFSET(0)]
);
SELECT
  MAX(last_refresh_time) AS latest_recommendation_time,
  recommender_subtype AS recommendation,
  MAX_BY(JSON_VALUE_ARRAY(COALESCE(
    PARSE_JSON(recommendation_details).overview.partitionColumns,
    PARSE_JSON(recommendation_details).overview.clusterColumns)),
  last_refresh_time) AS columns_to_cluster_or_partition,
  extract_table(target_resources) AS table,
  bqutil.fn.table_url(extract_table(target_resources)) AS table_url,
  JSON_VALUE(PARSE_JSON(recommendation_details).overview.partitionTimeUnit) AS partition_time_unit,
-- Replace the table below with your own table that you created when you 
-- set up the DTS job to export your recommendations to BigQuery.
FROM YOUR_PROJECT.YOUR_DATASET.recommendations_export
WHERE recommender = "google.bigquery.table.PartitionClusterRecommender"
GROUP BY recommendation, table, partition_time_unit, table_url
🔍 Tables with query read patterns

Tables with query read patterns

The table_read_patterns.sql script creates a table named, table_read_patterns, that contains usage data to help you determine:

  • Which tables (when queried) are resulting in high slot consumption.
  • Which tables are most frequently queried.

Examples of querying script results

  • Tables grouped by similar filter predicates

    SELECT
      table_id,
      bqutil.fn.table_url(table_id) AS table_url,
      (SELECT STRING_AGG(column ORDER BY COLUMN) FROM UNNEST(predicates)) column_list,
      (SELECT STRING_AGG(operator ORDER BY COLUMN) FROM UNNEST(predicates)) operator_list,
      (SELECT STRING_AGG(value ORDER BY COLUMN) FROM UNNEST(predicates)) value_list,
      SUM(stage_slot_ms) AS total_slot_ms,
      COUNT(DISTINCT DATE(creation_time)) as num_days_queried,
      COUNT(*) AS num_occurrences,
      COUNT(DISTINCT job_id) as job_count,
      ARRAY_AGG(CONCAT(project_id,':us.',job_id) ORDER BY total_slot_ms LIMIT 10) AS job_id_array,
      ARRAY_AGG(bqutil.fn.job_url(project_id || ':us.' || job_id)) AS job_url_array,
    FROM optimization_workshop.table_read_patterns
    GROUP BY 1,2,3,4,5;
  • Top 100 tables with the highest slot consumption

    SELECT
      table_id,
      bqutil.fn.table_url(table_id) AS table_url,
      SUM(stage_slot_ms) AS total_slot_ms,
      COUNT(DISTINCT DATE(creation_time)) as num_days_queried,
      COUNT(*) AS num_occurrences,
      COUNT(DISTINCT job_id) as job_count,
    FROM optimization_workshop.table_read_patterns
    GROUP BY 1,2
    ORDER BY total_slot_ms DESC
    LIMIT 100
  • Top 100 most frequently queried tables

    SELECT
      table_id,
      bqutil.fn.table_url(table_id) AS table_url,
      SUM(stage_slot_ms) AS total_slot_ms,
      COUNT(DISTINCT DATE(creation_time)) as num_days_queried,
      COUNT(*) AS num_occurrences,
      COUNT(DISTINCT job_id) as job_count,
    FROM optimization_workshop.table_read_patterns
    GROUP BY 1,2
    ORDER BY num_occurrences DESC
    LIMIT 100
🔍 Tables without partitioning or clustering

Tables without partitioning or clustering

The tables_without_partitioning_or_clustering.sql script creates a table named, tables_without_part_clust, that contains a list of tables which meet any of the following conditions:

  • not partitioned
  • not clustered
  • neither partitioned nor clustered

Examples of querying script results

  • Top 100 largest tables without partitioning or clustering

    SELECT *
    FROM optimization_workshop.tables_without_part_clust
    ORDER BY logical_gigabytes DESC
    LIMIT 100
🔍 Actively read tables with partitioning and clustering information

Actively read tables with partitioning and clustering information

[!IMPORTANT] The actively_read_tables_with_partitioning_clustering_info.sql script depends on the table_read_patterns table so you must first run the table_read_patterns.sql script.

The actively_read_tables_with_partitioning_clustering_info.sql script creates a table named, actively_read_tables_with_part_clust_info that contains a list of actively read tables along with their partitioning and clustering information.

Examples of querying script results

  • Top 100 largest actively read tables without partitioning or clustering

    SELECT *
    FROM optimization_workshop.actively_read_tables_with_part_clust_info
    WHERE clustering_columns IS NULL OR partitioning_column IS NULL
    ORDER BY logical_gigabytes DESC
    LIMIT 100
🔍 Tables receiving high quantity of daily DML statements

Tables receiving high quantity of daily DML statements

The frequent_daily_table_dml.sql script creates a table named, frequent_daily_table_dml, that contains tables that have had more than 24 daily DML statements run against them in the past 30 days.

Examples of querying script results

  • Top 100 tables with the most DML statements per table in a day

    SELECT
      table_id,
      table_url,
      ANY_VALUE(dml_execution_date HAVING MAX daily_dml_per_table) AS sample_dml_execution_date,
      ANY_VALUE(job_urls[OFFSET(0)] HAVING MAX daily_dml_per_table) AS sample_dml_job_url,
      MAX(daily_dml_per_table) max_daily_table_dml,
    FROM optimization_workshop.frequent_daily_table_dml
    GROUP BY table_id, table_url
    ORDER BY max_daily_table_dml DESC
    LIMIT 100;
🔍 Views with non-optimal JOIN conditions

Views with non-optimal JOIN conditions

The views_with_nonoptimal_join_condition.sql script creates a table named, views_with_nonoptimal_join_condition, that contains views with JOINs where the JOIN condition is potentially non-optimal.

Query Analysis

🔍 Queries grouped by hash

Queries grouped by hash

The queries_grouped_by_hash.sql script creates a table named, queries_grouped_by_hash. This table groups queries by their normalized query pattern, which ignores comments, parameter values, UDFs, and literals in the query text. This allows us to group queries that are logically the same, but have different literals. The queries_grouped_by_hash table does not expose the raw SQL text of the queries.

The viewable_queries_grouped_by_hash.sql script creates a table named, viewable_queries_grouped_by_hash. This table is similar to the queries_grouped_by_hash table, but it exposes the raw SQL text of the queries. The viewable_queries_grouped_by_hash.sql script runs much slower in execution than the queries_grouped_by_hash.sql script because it has to loop over all projects and for each project query the INFORMATION_SCHEMA.JOBS_BY_PROJECT view.

For example, the following queries would be grouped together because the date literal filters are ignored:

SELECT * FROM my_table WHERE date = '2020-01-01';
SELECT * FROM my_table WHERE date = '2020-01-02';
SELECT * FROM my_table WHERE date = '2020-01-03';

Running the run_anti_pattern_tool.sh bash script will build and run the Anti-Pattern Recognition tool and output the results to the viewable_queries_grouped_by_hash table in the recommendation column. The tool will identify syntaxes that are know to frequently cause performance issues.

Examples of querying script results

  • Top 100 queries with the highest bytes processed

    SELECT *
    FROM optimization_workshop.queries_grouped_by_hash
    ORDER BY total_gigabytes_processed DESC
    LIMIT 100
  • Top 100 recurring queries with the highest slot hours consumed

    SELECT *
    FROM optimization_workshop.queries_grouped_by_hash
    ORDER BY total_slot_hours * days_active * job_count DESC 
    LIMIT 100
🔍 Queries grouped by script

Queries grouped by script

The queries_grouped_by_script.sql script creates a table named, queries_grouped_by_script. This table groups queries by their parent job id which means that any queries that were executed as part of a script (multi-statement query) will be grouped together. This table is helpful if you want to see which particular scripts are most inefficient.

Examples of querying script results

  • Top 100 scripts with the highest bytes processed

    SELECT *
    FROM optimization_workshop.queries_grouped_by_script
    ORDER BY total_gigabytes_processed DESC
    LIMIT 100
  • Top 100 scripts with the highest slot hours consumed

    SELECT *
    FROM optimization_workshop.queries_grouped_by_script
    ORDER BY total_slot_hours DESC
    LIMIT 100
  • Top 100 scripts with the highest slot hours consumed that doesn't include INFO_SCHEMA views

    SELECT *
    FROM optimization_workshop.queries_grouped_by_script
    WHERE NOT EXISTS(
      SELECT 1
      FROM UNNEST(referenced_tables) table
      WHERE table LIKE "INFORMATION_SCHEMA%"
    )
    ORDER BY total_slots DESC
    LIMIT 100
🔍 Queries grouped by labels

Queries grouped by labels

The queries_grouped_by_labels.sql script creates a table named, queries_grouped_by_labels. This table groups queries by their labels (after formatting all labels as a STRUCT string) which allows you to analyze different groups of queries based on their labels.

🔍 Queries with performance insights

Queries with performance insights

The query_performance_insights.sql script creates a table named, query_performance_insights retrieves all queries that have had performance insights generated for them in the past 30 days.

Examples of querying script results

  • Top 100 queries with most # of performance insights

    SELECT
      job_url,
      (SELECT COUNT(1)
      FROM UNNEST(performance_insights.stage_performance_standalone_insights) perf_insight
      WHERE perf_insight.slot_contention
      ) AS num_stages_with_slot_contention,
      (SELECT COUNT(1)
      FROM UNNEST(performance_insights.stage_performance_standalone_insights) perf_insight
      WHERE perf_insight.insufficient_shuffle_quota
      ) AS num_stages_with_insufficient_shuffle_quota,
      (SELECT ARRAY_AGG(perf_insight.input_data_change.records_read_diff_percentage IGNORE NULLS)
      FROM UNNEST(performance_insights.stage_performance_change_insights) perf_insight
      ) AS records_read_diff_percentages
    FROM optimization_workshop.query_performance_insights
    ORDER BY (
      num_stages_with_slot_contention 
      + num_stages_with_insufficient_shuffle_quota
      + ARRAY_LENGTH(records_read_diff_percentages)
    ) DESC
    LIMIT 100

Workload Analysis

🔍 Hourly slot consumption by query hash

Hourly slot consumption by query hash

The hourly_slot_consumption_by_query_hash script creates a table named, hourly_slot_consumption_by_query_hash. This table groups queries by their query hash and then shows for every hour timeslice how much of that hour's slots each query hash consumed.

Examples of querying script results

  • Top 10 hours which had the highest slot consumption along with the top 10 query hashes/patterns that consumed the most slots during those hours.

    SELECT *
    FROM optimization_workshop.hourly_slot_consumption_by_query_hash
    ORDER BY period_total_slot_hours DESC
    LIMIT 10
🔍 Hourly slot consumption by labels

Hourly slot consumption by labels

The hourly_slot_consumption_by_labels script creates a table named, hourly_slot_consumption_by_labels. This table groups queries by their labels and then shows for every hour timeslice how much of that hour's slots each grouping of labels consumed.

Examples of querying script results

  • Top 10 hours which had the highest slot consumption and the top 10 query labels that consumed the most slots during those hours.

    SELECT *
    FROM optimization_workshop.hourly_slot_consumption_by_labels
    ORDER BY period_total_slot_hours DESC
    LIMIT 10