This codelab was made in collaboration with Google Genomics. New Standard SQL versions of many of these queries can be found here.
In Part 3 of the codelab, we perform some quality control analyses that could help to identify any problematic variants which should be excluded from further analysis. The appropriate cut off thresholds will depend upon the input dataset and/or other factors.
- Setup
- Ti/Tv by Genomic Window
- Ti/Tv by Alternate Allele Counts
- Ti/Tv by Depth
- Missingness Rate
- Hardy-Weinberg Equilibrium
- Heterozygous Haplotype
- Blacklisted Variants
queryReplacements <- list("_THE_TABLE_"="va_aaa_pilot_data.all_genomes_gvcfs",
"_THE_EXPANDED_TABLE_"="va_aaa_pilot_data.all_genomes_expanded_vcfs_java2",
"_BLACKLISTED_TABLE_"="resources.blacklisted_positions",
"_PATIENT_INFO_"="va_aaa_pilot_data.patient_info")
sampleData <- read.csv("./data/patient_info.csv")
sampleInfo <- select(sampleData, call_call_set_name=Catalog.ID, gender=Gender)
ggplot2 themes to use throughout this page.
plot_theme = theme_minimal(base_size = 14, base_family = "Helvetica") +
theme(axis.line = element_line(colour = "black"),
panel.grid = element_blank())
boxPlotTheme = theme_minimal(base_size=14, base_family = "Helvetica") +
theme(panel.grid = element_blank())
We want to check whether either variants are occuring more frequently than random chance at regions across the genome. One way we can check this is to check whether the ratio of transitions to transversions is within an expected range. In this analysis we group variants together within 100kb windows accross the entire genome and calculate the ti/tv ratio for each region. Variants within windows that have a ti/tv ratio outside the expected range will be flagged.
result <- DisplayAndDispatchQuery("./sql/ti-tv-ratio.sql",
project=project,
replacements=c("#_WHERE_"="WHERE reference_name = 'chr1'",
"_WINDOW_SIZE_"="100000",
queryReplacements))
# Compute the Ti/Tv ratio for variants within genomic region windows.
SELECT
reference_name,
window * 100000 AS window_start,
transitions,
transversions,
transitions/transversions AS titv,
num_variants_in_window,
FROM (
SELECT
reference_name,
window,
SUM(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
SUM(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
COUNT(mutation) AS num_variants_in_window
FROM (
SELECT
reference_name,
reference_bases,
alternate_bases,
INTEGER(FLOOR(start / 100000)) AS window,
CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation,
COUNT(alternate_bases) WITHIN RECORD AS num_alts,
FROM
[va_aaa_pilot_data.all_genomes_gvcfs]
# Optionally add clause here to limit the query to a particular
# region of the genome.
WHERE reference_name = 'chr1'
HAVING
# Skip 1/2 genotypes _and non-SNP variants
num_alts = 1
AND reference_bases IN ('A','C','G','T')
AND alternate_bases IN ('A','C','G','T'))
GROUP BY
reference_name,
window)
ORDER BY
window_start
Number of rows returned by this query: 2279.
Displaying the first few rows of the dataframe of results:
reference_name | window_start | transitions | transversions | titv | num_variants_in_window |
---|---|---|---|---|---|
chr1 | 0 | 501 | 335 | 1.50 | 836 |
chr1 | 100000 | 253 | 121 | 2.09 | 374 |
chr1 | 200000 | 137 | 91 | 1.51 | 228 |
chr1 | 300000 | 19 | 18 | 1.06 | 37 |
chr1 | 400000 | 30 | 31 | 0.97 | 61 |
chr1 | 500000 | 481 | 159 | 3.03 | 640 |
Visualizing the results:
ggplot(result, aes(x=window_start, y=titv)) +
geom_point() +
stat_smooth() +
scale_x_continuous(labels=comma) +
xlab("Genomic Position") +
ylab("Ti/Tv") +
ggtitle("Ti/Tv by 100,000 base pair windows on Chromosome 1") +
plot_theme
Check whether the ratio of transitions vs. transversions in SNPs appears to be resonable across the range of rare variants to common variants. This query may help to identify problems with rare or common variants.
result <- DisplayAndDispatchQuery("./sql/ti-tv-by-alternate-allele-count.sql",
project=project,
replacements=c(queryReplacements))
# Compute the Ti/Tv ratio for variants binned by alternate allele count.
SELECT
transitions,
transversions,
transitions/transversions AS titv,
alternate_allele_count
FROM (
SELECT
SUM(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
SUM(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
alternate_allele_count
FROM (
SELECT
alternate_bases,
reference_bases,
CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation,
COUNT(alternate_bases) WITHIN RECORD AS num_alts,
SUM(call.genotype = 1) WITHIN RECORD AS alternate_allele_count,
FROM
[va_aaa_pilot_data.all_genomes_gvcfs]
# Optionally add clause here to limit the query to a particular
# region of the genome.
#_WHERE_
HAVING
# Skip 1/2 genotypes _and non-SNP variants
num_alts = 1
AND reference_bases IN ('A','C','G','T')
AND alternate_bases IN ('A','C','G','T'))
GROUP BY
alternate_allele_count)
ORDER BY
alternate_allele_count DESC
Number of rows returned by this query: 961.
Displaying the first few rows of the dataframe of results:
transitions | transversions | titv | alternate_allele_count |
---|---|---|---|
39938 | 19562 | 2.04 | 960 |
19010 | 8344 | 2.28 | 959 |
38832 | 20295 | 1.91 | 958 |
17355 | 8160 | 2.13 | 957 |
11650 | 5755 | 2.02 | 956 |
7414 | 3446 | 2.15 | 955 |
Visualizing the results:
ggplot(result, aes(x=alternate_allele_count, y=titv)) +
geom_point() +
stat_smooth() +
scale_x_continuous() +
xlab("Total Number of Sample Alleles with the Variant") +
ylab("Ti/Tv") +
ggtitle("Ti/Tv by Alternate Allele Count") +
plot_theme
We next want to determine whether mutations at a specific depth seem to be occuring at various coverage depths are occuring more often than would be expected by random chance. We do this by grouping variants from each genome together by coverage depth, e.g. if variant A and variant B both occur in the same sample with 30x coverage they will be grouped for this analysis. Variants at coverage depths that have ti/tv ratios outside the expected range will be flagged.
query <- "./sql/ti-tv-by-depth.sql"
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement))
# Calculate ti/tv ratio for all SNPs for each sample grouped by depth of coverage.
SELECT
call.call_set_name,
(transitions/transversions) AS titv_ratio,
average_depth,
FROM (
SELECT
call.call_set_name,
SUM(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
SUM(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
ROUND(AVG(call.DP)) AS average_depth,
FROM (
SELECT
call.call_set_name,
reference_bases,
alternate_bases,
CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation,
COUNT(alternate_bases) WITHIN RECORD AS num_alts,
call.DP
FROM (
SELECT
call.call_set_name,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
call.genotype,
call.DP,
FROM
[va_aaa_pilot_data.multi_sample_variants_seq_qc]
# Optionally add clause here to limit the query to a particular
# region of the genome.
#_WHERE_
)
WHERE
call.DP is not null
HAVING
# Skip 1/2 genotypes _and non-SNP variants
num_alts = 1
AND reference_bases IN ('A','C','G','T')
AND alternate_bases IN ('A','C','G','T'))
GROUP BY
call.call_set_name,
call.DP,)
WHERE
transversions > 0
GROUP BY
call.call_set_name,
titv_ratio,
average_depth,
Retrieving data: 3.4s
Retrieving data: 4.9s
Retrieving data: 7.7s
Retrieving data: 9.2s
Retrieving data: 10.9s
Retrieving data: 12.4s
Retrieving data: 14.1s
Retrieving data: 15.6s
Retrieving data: 17.4s
Retrieving data: 19.5s
call_call_set_name | titv_ratio | average_depth |
---|---|---|
LP6005692-DNA_G03 | 2.12 | 29.00 |
LP6005692-DNA_C05 | 2.13 | 36.00 |
LP6005144-DNA_F02 | 2.16 | 24.00 |
LP6005144-DNA_H12 | 2.16 | 34.00 |
LP6005038-DNA_D03 | 2.14 | 34.00 |
LP6005051-DNA_F09 | 1.96 | 45.00 |
ggplot(result, aes(x=average_depth, y=titv_ratio, color=call_call_set_name)) +
geom_point() +
ggtitle("Ti/Tv Ratio By Depth") +
xlab("Coverage Depth") +
ylab("Ti/Tv") +
plot_theme +
theme(legend.position="none")
We next want to check how frequenctly or infrequenctly a variant is called across all samples. If a variant has a very low call rate this may mean it is difficult to sequence and even high confidence calls at that position may be suspect.
sortAndLimit <- "ORDER BY missingness_rate DESC, reference_name, start, reference_bases, alternate_bases LIMIT 1000"
cutoff <- list("_CUTOFF_"="0.9")
result <- DisplayAndDispatchQuery("./sql/variant-level-missingness-fail.sql",
project=project,
replacements=c("#_ORDER_BY_"=sortAndLimit,
queryReplacements,
cutoff))
# Select positions where the missingness rate across all samples is above a defined cutoff.
SELECT
variant_id,
"variant_missingness" AS failure_reason,
missingness_rate,
FROM (
SELECT
variant_id,
reference_name,
start,
end,
reference_bases,
alternate_bases,
called_allele_count,
1 - (called_allele_count)/sample_count AS missingness_rate,
sample_count
FROM (
SELECT
variant_id,
reference_name,
start,
end,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
SUM(call.genotype >= 0) WITHIN RECORD AS called_allele_count,
FROM
[va_aaa_pilot_data.all_genomes_expanded_vcfs_java2]
# Optionally add clause here to limit the query to a particular
# region of the genome.
#_WHERE_
) AS g
CROSS JOIN (
SELECT
COUNT(call.call_set_name) AS sample_count
FROM (
SELECT
call.call_set_name
FROM
[va_aaa_pilot_data.all_genomes_expanded_vcfs_java2]
GROUP BY
call.call_set_name)) AS count )
WHERE
missingness_rate > 0.9
# Optionally add a clause here to sort and limit the results.
ORDER BY missingness_rate DESC, reference_name, start, reference_bases, alternate_bases LIMIT 1000
Number of rows returned by this query: 1000.
Displaying the first few rows of the dataframe of results:
variant_id | failure_reason | missingness_rate |
---|---|---|
CJ_JqPj1p5_yIRIEY2hyMRj9UiCG4PHHiLuQ8O8B | variant_missingness | 1.00 |
CJ_JqPj1p5_yIRIEY2hyMRir8QEg8NHvoIrojceRAQ | variant_missingness | 1.00 |
CJ_JqPj1p5_yIRIEY2hyMRjK8QEgyqWChcvz7JIK | variant_missingness | 1.00 |
CJ_JqPj1p5_yIRIEY2hyMRi-9AEgt66dj4yx7a9D | variant_missingness | 1.00 |
CJ_JqPj1p5_yIRIEY2hyMRi-9AEgudWm_IbA20k | variant_missingness | 1.00 |
CJ_JqPj1p5_yIRIEY2hyMRjalAMgmpr66o-86pW0AQ | variant_missingness | 1.00 |
For each variant, compute the expected versus observed relationship between allele frequencies and genotype frequencies per the Hardy-Weinberg Equilibrium.
sortAndLimit <- "ORDER BY ChiSq DESC, reference_name, start, alternate_bases LIMIT 1000"
result <- DisplayAndDispatchQuery("./sql/hardy-weinberg.sql",
project=project,
replacements=c("#_ORDER_BY_"=sortAndLimit,
queryReplacements))
# The following query computes the Hardy-Weinberg equilibrium for variants.
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
OBS_HOM1,
OBS_HET,
OBS_HOM2,
E_HOM1,
E_HET,
E_HOM2,
# Chi Squared Calculation
# SUM(((Observed - Expected)^2) / Expected )
ROUND((POW(OBS_HOM1 - E_HOM1, 2) / E_HOM1)
+ (POW(OBS_HET - E_HET, 2) / E_HET)
+ (POW(OBS_HOM2 - E_HOM2, 2) / E_HOM2), 6)
AS ChiSq,
# Determine if Chi Sq value is significant
IF((POW(OBS_HOM1 - E_HOM1, 2) / E_HOM1)
+ (POW(OBS_HET - E_HET, 2) / E_HET)
+ (POW(OBS_HOM2 - E_HOM2, 2) / E_HOM2)
> 5.991, "TRUE", "FALSE") AS PVALUE_SIG
FROM (
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
OBS_HOM1,
OBS_HET,
OBS_HOM2,
# Expected AA
# p^2
# ((COUNT(AA) + (COUNT(Aa)/2) /
# SAMPLE_COUNT) ^ 2) * SAMPLE_COUNT
ROUND(POW((OBS_HOM1 + (OBS_HET/2)) /
SAMPLE_COUNT, 2) * SAMPLE_COUNT, 2)
AS E_HOM1,
# Expected Aa
# 2pq
# 2 * (COUNT(AA) + (COUNT(Aa)/2) / SAMPLE_COUNT) *
# (COUNT(aa) + (COUNT(Aa)/2) / SAMPLE_COUNT)
# * SAMPLE_COUNT
ROUND(2 * ((OBS_HOM1 + (OBS_HET/2)) / SAMPLE_COUNT) *
((OBS_HOM2 + (OBS_HET/2)) / SAMPLE_COUNT)
* SAMPLE_COUNT, 2)
AS E_HET,
# Expected aa
# q^2
# (COUNT(aa) + (COUNT(Aa)/2) /
# SAMPLE_COUNT) ^ 2 * SAMPLE_COUNT
ROUND(POW((OBS_HOM2 + (OBS_HET/2)) /
SAMPLE_COUNT, 2) * SAMPLE_COUNT, 2)
AS E_HOM2,
FROM (
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
HOM_REF AS OBS_HOM1,
HET AS OBS_HET,
HOM_ALT AS OBS_HOM2,
HOM_REF + HET + HOM_ALT AS SAMPLE_COUNT,
FROM (
SELECT
reference_name,
start,
END,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
COUNT(alternate_bases) WITHIN RECORD AS num_alts,
SUM(EVERY(0 = call.genotype)) WITHIN call AS HOM_REF,
SUM(EVERY(1 = call.genotype)) WITHIN call AS HOM_ALT,
SUM(SOME(0 = call.genotype)
AND SOME(1 = call.genotype)) WITHIN call AS HET,
FROM
[va_aaa_pilot_data.all_genomes_expanded_vcfs_java2]
# Optionally add a clause here to limit the query to a particular
# region of the genome.
#_WHERE_
HAVING
# Skip 1/2 genotypes
num_alts = 1
)))
# Optionally add a clause here to sort and limit the results.
ORDER BY ChiSq DESC, reference_name, start, alternate_bases LIMIT 1000
Number of rows returned by this query: 1000.
Displaying the first few rows of the dataframe of results:
reference_name | start | reference_bases | alternate_bases | OBS_HOM1 | OBS_HET | OBS_HOM2 | E_HOM1 | E_HET | E_HOM2 | ChiSq | PVALUE_SIG |
---|---|---|---|---|---|---|---|---|---|---|---|
chr10 | 1461139 | C | T | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
chr11 | 101415618 | T | A | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
chr11 | 120928708 | T | G | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
chr13 | 59168960 | A | G | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
chr15 | 98813562 | A | G | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
chr2 | 217343776 | G | C | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
For each variant within the X and Y chromosome, identify heterozygous variants in male genomes. Males should not have any heterozygous calls on the X-chromosome outside of the pseudo-autosomal regions. If any are found they will be flagged.
First we use our sample information to determine which genomes are male.
maleSampleIds <- paste("'", filter(sampleInfo, gender == "Male")$call_call_set_name, "'", sep="", collapse=",")
sortAndLimit <- "ORDER BY reference_name, start, alternate_bases, call.call_set_name LIMIT 1000"
result <- DisplayAndDispatchQuery("./sql/sex-chromosome-heterozygous-haplotypes.sql",
project=project,
replacements=c("_MALE_SAMPLE_IDS_"=maleSampleIds,
"#_ORDER_BY_"=sortAndLimit,
queryReplacements))
# Identify heterozygous variants on the sex chromosomes for reportedly male samples.
SELECT
variant_id,
sample_id,
"heterozygous_haplotype" AS failure_reason,
FROM (
SELECT
variant_id,
reference_name,
start,
end,
reference_bases,
call.call_set_name AS sample_id,
GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
FROM(FLATTEN((
[va_aaa_pilot_data.all_genomes_expanded_vcfs_java2]), call.call_set_name))
WHERE
reference_name IN ('chrX', 'chrY')
OMIT
call IF (2 > COUNT(call.genotype))
OR EVERY(call.genotype <= 0)
OR EVERY(call.genotype = 1)
# Pseudoautosomal Region 1
OR (reference_name = 'chrX'
AND start > 60001
AND end < 2699520)
OR (reference_name = 'chrY'
AND start > 10001
AND end < 2649520)
# Pseudoautosomal Region 2
OR (reference_name = 'chrX'
AND start > 155260560
AND end < 155270560)
OR (reference_name = 'chrY'
AND start > 59363566
AND end < 59373566)) AS seq
JOIN (
SELECT
IlluminaID,
SEX
FROM
[va_aaa_pilot_data.patient_info] ) AS info
ON
seq.sample_id = info.IlluminaID
WHERE
SEX = 'M'
# Optionally add a clause here to sort and limit the results.
ORDER BY reference_name, start, alternate_bases, call.call_set_name LIMIT 1000
Error: Field 'alternate_bases' not found on either side of the JOIN
query invalidQuery. Field 'alternate_bases' not found on either side of the JOIN
Number of rows returned by this query: 1000.
Displaying the first few rows of the dataframe of results:
reference_name | start | reference_bases | alternate_bases | OBS_HOM1 | OBS_HET | OBS_HOM2 | E_HOM1 | E_HET | E_HOM2 | ChiSq | PVALUE_SIG |
---|---|---|---|---|---|---|---|---|---|---|---|
chr10 | 1461139 | C | T | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
chr11 | 101415618 | T | A | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
chr11 | 120928708 | T | G | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
chr13 | 59168960 | A | G | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
chr15 | 98813562 | A | G | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
chr2 | 217343776 | G | C | 454 | 0 | 4 | 450.03 | 7.93 | 0.03 | 533.33 | TRUE |
Identify all variants within our cohort that have been blacklisted. For more information on what variants are blacklisted and why see here. All variants lccuring within blacklisted regions will be flagged.
result <- DisplayAndDispatchQuery("./sql/blacklisted-variants.sql",
project=project,
replacements=c(queryReplacements))
# Select variants that are found within blacklisted genes.
# https://sites.google.com/site/anshulkundaje/projects/blacklists
SELECT
seq.variant_id AS variant_id,
"blacklisted" AS failure_reason,
bl.Artifact_Type AS Artifact_Type
FROM (
SELECT
variant_id,
reference_name,
start,
end,
FROM
[va_aaa_pilot_data.all_genomes_expanded_vcfs_java2]) as seq
JOIN (
SELECT
reference_name,
start - 1 AS start,
end,
Artifact_Type
FROM
[resources.blacklisted_positions]) AS bl
ON
seq.reference_name = bl.reference_name
WHERE
seq.start >= bl.start AND
seq.end <= bl.end
#_ORDER_BY_
Retrieving data: 3.8s
Retrieving data: 6.3s
Retrieving data: 9.9s
Retrieving data: 12.5s
Retrieving data: 15.0s
Retrieving data: 17.6s
Retrieving data: 19.9s
Retrieving data: 22.8s
Retrieving data: 26.4s
Retrieving data: 29.0s
Retrieving data: 32.2s
Retrieving data: 36.9s
Retrieving data: 39.5s
Retrieving data: 42.9s
Retrieving data: 46.5s
Retrieving data: 49.7s
Retrieving data: 52.0s
Retrieving data: 54.7s
Retrieving data: 57.4s
Retrieving data: 60.2s
Retrieving data: 62.4s
Retrieving data: 65.2s
Retrieving data: 67.7s
Retrieving data: 70.3s
Retrieving data: 72.5s
Retrieving data: 75.4s
Retrieving data: 77.2s
Retrieving data: 79.0s
Retrieving data: 81.8s
Retrieving data: 84.1s
Retrieving data: 86.1s
Retrieving data: 88.3s
Retrieving data: 89.9s
Retrieving data: 92.3s
Retrieving data: 94.6s
Retrieving data: 96.5s
Retrieving data: 98.4s
Retrieving data: 100.4s
Retrieving data: 102.1s
Retrieving data: 104.1s
Retrieving data: 106.2s
Retrieving data: 109.8s
Retrieving data: 111.6s
Retrieving data: 113.5s
Retrieving data: 115.3s
Retrieving data: 117.3s
Retrieving data: 119.1s
Retrieving data: 120.9s
Retrieving data: 123.1s
Number of rows returned by this query: 509065.
Displaying the first few rows of the dataframe of results:
variant_id | failure_reason | Artifact_Type |
---|---|---|
CJ_JqPj1p5_yIRIEY2hyNRiv_4QWIKeCtZ3phP3_hwE | blacklisted | centromeric_repeat |
CJ_JqPj1p5_yIRIEY2hyNRjT_4QWINuMj6To5JHvAw | blacklisted | centromeric_repeat |
CJ_JqPj1p5_yIRIEY2hyNRjbgIUWIKinoLyFt4GxzwE | blacklisted | centromeric_repeat |
CJ_JqPj1p5_yIRIEY2hyNRjhgoUWII2H6sHWt4XFuAE | blacklisted | centromeric_repeat |
CJ_JqPj1p5_yIRIEY2hyNRj7goUWIOG13aLxo5u_Gw | blacklisted | centromeric_repeat |
CJ_JqPj1p5_yIRIEY2hyNRjeg4UWIPqdwOHH34HXbA | blacklisted | centromeric_repeat |