This utility scans a BigQuery SQL in search for several possible anti-patterns. Anti-patterns are specific SQL syntaxes that in some cases might cause performance impact.
We recommend using this tool to scan the top 10% slot consuming jobs of your workload. Addressing these anti-patterns in most cases will provide performance significant benefits.
If you are migrating from another platform (e.g. Teradata, Snowflake, Redshift) into BigQuery, you can use optimizations available in the BigQuery SQL Translator following these instructions.
The following dialects are supported by the SQL translator:
- Amazon Redshift SQL
- Apache HiveQL and Beeline CLI
- IBM Netezza SQL and NZPLSQL
- Teradata and Teradata Vantage
- Apache Spark SQL
- Azure Synapse T-SQL
- Greenplum SQL
- IBM DB2 SQL
- MySQL SQL
- Oracle SQL, PL/SQL, Exadata
- PostgreSQL SQL
- Trino or PrestoSQL
- Snowflake SQL
- SQL Server T-SQL
- SQLite
- Vertica SQL
- Any other listed here
To run the tool use the cloud shell terminal. It has all the pre-requisites.
Build utility
# in cloud shell terminal
gcloud auth application-default login
git clone https://github.com/GoogleCloudPlatform/bigquery-antipattern-recognition.git
cd bigquery-antipattern-recognition
mvn clean package jib:dockerBuild -DskipTests
Run tool for simple inline query
# in cloud shell terminal
docker run \
-i bigquery-antipattern-recognition \
--query "SELECT * FROM \`project.dataset.table1\`"
In the BigQuery console, run the DDL bellow to create th output table
-- in BQ console
CREATE OR REPLACE TABLE <my-project>.<my-dataset>.antipattern_output_table (
job_id STRING,
user_email STRING,
query STRING,
recommendation ARRAY<STRUCT<name STRING, description STRING>>,
slot_hours FLOAT64,
optimized_sql STRING,
process_timestamp TIMESTAMP
);
To read from INFORMATION_SCHEMA and write to the output table, run the following in the command line:
# in cloud shell terminal
gcloud auth login
docker run \
-v ~/.config:/root/.config \
-i bigquery-antipattern-recognition \
--read_from_info_schema \
--info_schema_project <project-name-with-info-schema-to-be-read> \
--info_schema_region us \
--read_from_info_schema_days 1 \
--processing_project_id <my-processing-project> \
--output_table "<my-project>.<my-dataset>.antipattern_output_table" \
--info_schema_top_n_percentage_of_jobs 0.1
Read output in BigQuery Console
-- in BQ console
SELECT
job_id, user_email, query,
recommendation, slot_hours
FROM
`<my-project>.<my-dataset>.antipattern_output_table`
ORDER BY
process_timestamp DESC
LIMIT 10000;
For easier deployment and management, Docker is the recommended approach for running the tool. However, if Docker is unavailable, running with a JAR file is also an option.
To run the tool use the cloud shell terminal. It has all the pre-requisites.
# in cloud shell terminal
gcloud auth application-default login
git clone https://github.com/GoogleCloudPlatform/bigquery-antipattern-recognition.git
cd bigquery-antipattern-recognition
wget https://github.com/GoogleCloudPlatform/bigquery-antipattern-recognition/releases/download/v1.0.0/bigquery-antipattern-recognition.jar
Run tool for simple inline query
# in cloud shell terminal
java -jar bigquery-antipattern-recognition.jar \
--query "SELECT * FROM \`project.dataset.table1\`"
This tool has an AI based feature that will output the rewritten query.
This functionality is activated using the --rewrite_sql
flag.
(This feature is available in all outputs: logs, files, BigQuery table).
Prerequisite: enable Vertex AI API
Input
docker run \
-v ~/.config:/root/.config \
-i bigquery-antipattern-recognition \
--query "select col1 from table1 where col2 like '%abc%' and col3=1" \
--rewrite_sql \
--processing_project_id <my-project>
Output
--------------------------------------------------
Recommendations for query: query provided by cli:
* WhereOrder: LIKE filter in line 1 precedes a more selective filter.
* Optimized query:
select col1 from table1 where col3=1 and col2 like '%abc%'
--------------------------------------------------
Deploying the anti-pattern recognition tool as a remote function UDF allows you to easily call the Antipattern tool within SQL.
For example:
SELECT fns.get_antipatterns("SELECT * from dataset.table ORDER BY 1")
The function returns a JSON string for each query representing the antipatterns found in each query, if any. For example the function would return the following response for the query above:
{
"antipatterns": [
{
"name": "SimpleSelectStar",
"result": "SELECT * on table: dataset.table. Check that all columns are needed."
},
{
"name": "OrderByWithoutLimit",
"result": "ORDER BY clause without LIMIT at line 1."
}
]
}
The remote function is built using Cloud Build, Artifact Registry, and Cloud Run. It can be deployed using terraform or a bash scipt. See more details in the README.
Deploying the anti-pattern recognition tool to a Cloud Run job provides an easy mechanism to periodically scan INFORMATION_SCHEMA for occurrences of anti-patterns in top slot consuming queries.
Refer to this step-by-step on how to deploy this tool using Cloud Run.
See the examples page for information on extracting SQL from Dataform and DBT
--query="SELECT ... FROM ..."
-
To parse SQL string provided via CLI.
--read_from_info_schema
-
To read input queries from INFORMATION_SCHEMA.JOBS.
--read_from_info_schema_days n
-
Specifies how many days of INFORMATION_SCHEMA to read
Must be set along with `--read_from_info_schema`.
Defaults to 1.
--info_schema_project <project-name>
-
Name of the project for which information schema will be read.
This is the project with the queries that you want to optimize.
--info_schema_region us
-
Region from which to read information schema
--read_from_info_schema_start_time "start-timestamp"
--read_from_info_schema_end_time "end-timestamp"
-
Alternative to `read_from_info_schema_days` option,
to specify start and end date or timestamp of INFORMATION_SCHEMA to read.
Must be set along with `--read_from_info_schema`.
Defaults to `--read_from_info_schema_days` option.
--read_from_info_schema_timeout_in_secs n
-
Specifies timeout, in secs, to query INFORMATION SCHEMA
Must be set along with `--read_from_info_schema`.
Defaults to 60.
--info_schema_table_name "\`<my-project>.region-us\`.INFORMATION_SCHEMA.JOBS" \
-
Specifies what variant of INFORMATION_SCHEMA.JOBS to read from.
--info_schema_top_n_percentage_of_jobs n
-
Number between 0 and 1. Uses to specify what fraction of top slot consuming jobs
the tool should consider, e.g. if equal to 0.1 only top 10% slot consuming jobs
will be checked por anti patterns.
--input_bq_table project_id.dataset_id.table_name
-
Specifies a BigQuery table as input, each row is a SQL string to be parsed.
Columns must be "id (String), query (String)"
--input_file_path /path/to/file.sql
-
Specifies path to file with SQL string to be parsed. Can be local file or GCS file.
--input_folder_path /path/to/folder/with/sql/files
-
Specifies path to folder with SQL files to be parsed. Will parse all .sql in directory.
Can be a local path or a GCS path
--input_csv_file_path /path/to/input/file.csv
-
Specifies a CSV file as input, each row is a SQL string to be parsed.
Columns must be ""id,query"
--output_file_path /path/to/output/file.csv
-
Specifies a CSV file as output, each row is a SQL string to be parsed.
Columns are "id,recommendation"
--output_table "my-project.dataset.antipattern_output_table"
-
Specifies table to which write results to. Assumes that the table already exits.
--processing_project_id <my-processing-project>
-
Specifies what project provides the compute used to read from INFORMATION_SCHEMA
and/or to write to output table (i.e. project where BQ jobs will execute)
Needed if the input is INFORMATION_SCHEMA or if the output is a BQ table.
Needed if using sql rewrite.
--rewrite_sql
-
If used a rewritted SQL will be provided. The rewrite will be performed using an LLM.
This is an experimental feature. Requires processing_project_id to be specified.
--service_account_keyfile_path
-
Path to service account json keyfile.
Example:
SELECT
*
FROM
`project.dataset.table1`
Output:
All columns on table: project.dataset.table1 are being selected. Please be sure that all columns are needed
Example:
SELECT
t1.col1
FROM
`project.dataset.table1` t1
WHERE
t1.col2 not in (select col2 from `project.dataset.table2`);
Output:
You are using an IN filter with a subquery without a DISTINCT on the following columns: project.dataset.table1.col2
Example:
WITH
a AS (
SELECT col1,col2 FROM test WHERE col1='abc'
),
b AS (
SELECT col2 FROM a
),
c AS (
SELECT col1 FROM a
)
SELECT
b.col2,
c.col1
FROM
b,c;
Output:
CTE with multiple references: alias a defined at line 2 is referenced 2 times
Example:
SELECT
t.dim1,
t.dim2,
t.metric1
FROM
`dataset.table` t
ORDER BY
t.metric1 DESC;
Output:
ORDER BY clause without LIMIT at line 8.
Example:
SELECT
dim1
FROM
`dataset.table`
WHERE
REGEXP_CONTAINS(dim1, ‘.*test.*’)
Output:
REGEXP_CONTAINS at line 6. Prefer LIKE when the full power of regex is not needed (e.g. wildcard matching).";
Example:
SELECT
taxi_id, trip_seconds, fare
FROM
(
SELECT
taxi_id, trip_seconds, fare,
row_number() over(partition by taxi_id order by fare desc) rn
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
)
WHERE
rn = 1;
Output:
LatestRecordWithAnalyticFun: Seems like you might be using analytical function row_number in line 7 to filter the latest record in line 12.
Example:
SELECT
*
FROM
comments c
JOIN
users u ON c.user_id = u.id
WHERE
u.id IN (
SELECT id
FROM users
WHERE location LIKE '%New York'
GROUP BY id
ORDER BY SUM(up_votes) DESC
LIMIT 10
)
;
Output:
Dynamic Predicate: Using subquery in filter at line 10. Converting this dynamic predicate to static might provide better performance.
Example:
SELECT
repo_name,
id,
ref
FROM
`bigquery-public-data.github_repos.files`
WHERE
ref like '%master%'
and repo_name = 'cdnjs/cdnjs'
;
Output:
WhereOrder: LIKE filter in line 8 precedes a more selective filter.
As a best practice the table with the largest number of rows should be placed first in a JOIN.
This anti-pattern checks the join order based on the number of rows of each
table. To do so this tool must fetch table metadata, for which the advanced_analysis
flag must be used.
Details can be found here.
Example:
SELECT
t1.station_id,
COUNT(1) num_trips_started
FROM
`bigquery-public-data.austin_bikeshare.bikeshare_stations` t1
JOIN
`bigquery-public-data.austin_bikeshare.bikeshare_trips` t2 ON t1.station_id = t2.start_station_id
GROUP BY
t1.station_id
;
Output:
JoinOrder: JOIN on tables: [bikeshare_stations, bikeshare_trips] might perform
better if tables where joined in the following order:
[bikeshare_trips, bikeshare_stations]
Drop temporary tables at the end of your script to avoid unnecesary storage billing.
Example:
CREATE TEMP TABLE `my_dataset.temp_table` (id INT64, name STRING);
Output:
TEMP table created without DROP statement: TEMP table my_dataset.temp_table defined at line 1 is created and not dropped.
Convert persistent tables that are dropped at the end of a script to TEMP tables to save on time travel + fail safe storage.
Example:
CREATE TABLE `my_dataset.table` (id INT64, name STRING);
DROP TABLE `my_dataset.table`;
Output:
Persistent table dropped: Table my_dataset.table defined at line 1 is dropped. Consider converting to temporary.
Copyright 2024 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.