This codelab was made in collaboration with Google Genomics. New Standard SQL versions of many of these queries can be found here.

Part 3: Variant-Level QC

In Part 3 of the codelab, we perform some quality control analyses that could help to identify any problematic variants which should be excluded from further analysis. The appropriate cut off thresholds will depend upon the input dataset and/or other factors.

Setup
Ti/Tv by Genomic Window
Ti/Tv by Alternate Allele Counts
Ti/Tv by Depth
Missingness Rate
Hardy-Weinberg Equilibrium
Heterozygous Haplotype
Blacklisted Variants

Setup

queryReplacements <- list("_THE_TABLE_"="va_aaa_pilot_data.all_genomes_gvcfs",
                          "_THE_EXPANDED_TABLE_"="va_aaa_pilot_data.all_genomes_expanded_vcfs_java2",
                          "_BLACKLISTED_TABLE_"="resources.blacklisted_positions",
                          "_PATIENT_INFO_"="va_aaa_pilot_data.patient_info")
sampleData <- read.csv("./data/patient_info.csv")
sampleInfo <- select(sampleData, call_call_set_name=Catalog.ID, gender=Gender)

ggplot2 themes to use throughout this page.

plot_theme = theme_minimal(base_size = 14, base_family = "Helvetica") + 
  theme(axis.line = element_line(colour = "black"),
        panel.grid = element_blank())

boxPlotTheme = theme_minimal(base_size=14, base_family = "Helvetica") +
  theme(panel.grid = element_blank())

Ti/Tv by Genomic Window

We want to check whether either variants are occuring more frequently than random chance at regions across the genome. One way we can check this is to check whether the ratio of transitions to transversions is within an expected range. In this analysis we group variants together within 100kb windows accross the entire genome and calculate the ti/tv ratio for each region. Variants within windows that have a ti/tv ratio outside the expected range will be flagged.

result <- DisplayAndDispatchQuery("./sql/ti-tv-ratio.sql",
                                  project=project,
                                  replacements=c("#_WHERE_"="WHERE reference_name = 'chr1'",
                                                 "_WINDOW_SIZE_"="100000",
                                                 queryReplacements))

# Compute the Ti/Tv ratio for variants within genomic region windows.
SELECT
  reference_name,
  window * 100000 AS window_start,
  transitions,
  transversions,
  transitions/transversions AS titv,
  num_variants_in_window,
FROM (
  SELECT
    reference_name,
    window,
    SUM(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
    SUM(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
                     'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
    COUNT(mutation) AS num_variants_in_window
  FROM (
    SELECT
      reference_name,
      reference_bases,
      alternate_bases,
      INTEGER(FLOOR(start / 100000)) AS window,
      CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation,
      COUNT(alternate_bases) WITHIN RECORD AS num_alts,
    FROM
      [va_aaa_pilot_data.all_genomes_gvcfs]
    # Optionally add clause here to limit the query to a particular
    # region of the genome.
    WHERE reference_name = 'chr1'
    HAVING
      # Skip 1/2 genotypes _and non-SNP variants
      num_alts = 1
      AND reference_bases IN ('A','C','G','T')
      AND alternate_bases IN ('A','C','G','T'))
  GROUP BY
    reference_name,
    window)
ORDER BY
  window_start

Number of rows returned by this query: 2279.

Displaying the first few rows of the dataframe of results:

reference_name	window_start	transitions	transversions	titv	num_variants_in_window
chr1	0	501	335	1.50	836
chr1	100000	253	121	2.09	374
chr1	200000	137	91	1.51	228
chr1	300000	19	18	1.06	37
chr1	400000	30	31	0.97	61
chr1	500000	481	159	3.03	640

Visualizing the results:

ggplot(result, aes(x=window_start, y=titv)) +
  geom_point() +
  stat_smooth() +
  scale_x_continuous(labels=comma) +
  xlab("Genomic Position") +
  ylab("Ti/Tv") +
  ggtitle("Ti/Tv by 100,000 base pair windows on Chromosome 1") +
  plot_theme

Ti/Tv by Alternate Allele Counts

Check whether the ratio of transitions vs. transversions in SNPs appears to be resonable across the range of rare variants to common variants. This query may help to identify problems with rare or common variants.

result <- DisplayAndDispatchQuery("./sql/ti-tv-by-alternate-allele-count.sql",
                                  project=project,
                                  replacements=c(queryReplacements))

# Compute the Ti/Tv ratio for variants binned by alternate allele count.
SELECT
  transitions,
  transversions,
  transitions/transversions AS titv,
  alternate_allele_count
FROM (
  SELECT
    SUM(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
    SUM(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
                     'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
    alternate_allele_count
  FROM (
    SELECT
      alternate_bases,
      reference_bases,
      CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation,
      COUNT(alternate_bases) WITHIN RECORD AS num_alts,
      SUM(call.genotype = 1) WITHIN RECORD AS alternate_allele_count,
    FROM
      [va_aaa_pilot_data.all_genomes_gvcfs]
    # Optionally add clause here to limit the query to a particular
    # region of the genome.
    #_WHERE_
    HAVING
      # Skip 1/2 genotypes _and non-SNP variants
      num_alts = 1
      AND reference_bases IN ('A','C','G','T')
      AND alternate_bases IN ('A','C','G','T'))
  GROUP BY
    alternate_allele_count)
ORDER BY
  alternate_allele_count DESC

Number of rows returned by this query: 961.

Displaying the first few rows of the dataframe of results:

transitions	transversions	titv	alternate_allele_count
39938	19562	2.04	960
19010	8344	2.28	959
38832	20295	1.91	958
17355	8160	2.13	957
11650	5755	2.02	956
7414	3446	2.15	955

Visualizing the results:

ggplot(result, aes(x=alternate_allele_count, y=titv)) +
  geom_point() +
  stat_smooth() +
  scale_x_continuous() +
  xlab("Total Number of Sample Alleles with the Variant") +
  ylab("Ti/Tv") +
  ggtitle("Ti/Tv by Alternate Allele Count") +
  plot_theme

Ti/Tv by Depth

We next want to determine whether mutations at a specific depth seem to be occuring at various coverage depths are occuring more often than would be expected by random chance. We do this by grouping variants from each genome together by coverage depth, e.g. if variant A and variant B both occur in the same sample with 30x coverage they will be grouped for this analysis. Variants at coverage depths that have ti/tv ratios outside the expected range will be flagged.

query <- "./sql/ti-tv-by-depth.sql"
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement))

# Calculate ti/tv ratio for all SNPs for each sample grouped by depth of coverage.

SELECT
  call.call_set_name,
  (transitions/transversions) AS titv_ratio,
  average_depth,
FROM (
  SELECT
    call.call_set_name,
    SUM(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
    SUM(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
                     'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
    ROUND(AVG(call.DP)) AS average_depth,
  FROM (

    SELECT
      call.call_set_name,
      reference_bases,
      alternate_bases,
      CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation,
      COUNT(alternate_bases) WITHIN RECORD AS num_alts,
      call.DP
    FROM (
      SELECT
        call.call_set_name,
        reference_bases,
        GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
        call.genotype,
        call.DP,
      FROM
        [va_aaa_pilot_data.multi_sample_variants_seq_qc]
      # Optionally add clause here to limit the query to a particular
      # region of the genome.
      #_WHERE_  
      )
    WHERE
      call.DP is not null
    HAVING
      # Skip 1/2 genotypes _and non-SNP variants
      num_alts = 1
      AND reference_bases IN ('A','C','G','T')
      AND alternate_bases IN ('A','C','G','T'))
    GROUP BY 
      call.call_set_name,
      call.DP,)
WHERE
  transversions > 0
GROUP BY
  call.call_set_name,
  titv_ratio,
  average_depth,

Retrieving data:  3.4s
Retrieving data:  4.9s
Retrieving data:  7.7s
Retrieving data:  9.2s
Retrieving data: 10.9s
Retrieving data: 12.4s
Retrieving data: 14.1s
Retrieving data: 15.6s
Retrieving data: 17.4s
Retrieving data: 19.5s

call_call_set_name	titv_ratio	average_depth
LP6005692-DNA_G03	2.12	29.00
LP6005692-DNA_C05	2.13	36.00
LP6005144-DNA_F02	2.16	24.00
LP6005144-DNA_H12	2.16	34.00
LP6005038-DNA_D03	2.14	34.00
LP6005051-DNA_F09	1.96	45.00

ggplot(result, aes(x=average_depth, y=titv_ratio, color=call_call_set_name)) + 
  geom_point() +
  ggtitle("Ti/Tv Ratio By Depth") +
  xlab("Coverage Depth") + 
  ylab("Ti/Tv") + 
  plot_theme +
  theme(legend.position="none")

Missingness Rate

We next want to check how frequenctly or infrequenctly a variant is called across all samples. If a variant has a very low call rate this may mean it is difficult to sequence and even high confidence calls at that position may be suspect.

sortAndLimit <- "ORDER BY missingness_rate DESC, reference_name, start, reference_bases, alternate_bases LIMIT 1000"
cutoff <- list("_CUTOFF_"="0.9")
result <- DisplayAndDispatchQuery("./sql/variant-level-missingness-fail.sql",
                                  project=project,
                                  replacements=c("#_ORDER_BY_"=sortAndLimit,
                                                 queryReplacements,
                                                 cutoff))

# Select positions where the missingness rate across all samples is above a defined cutoff.

SELECT 
variant_id,
"variant_missingness" AS failure_reason,
missingness_rate,
FROM (
  SELECT
  variant_id,
  reference_name,
  start,
  end,
  reference_bases,
  alternate_bases,
  called_allele_count,
  1 - (called_allele_count)/sample_count AS missingness_rate,
  sample_count
  FROM (
    SELECT
    variant_id,
    reference_name,
    start,
    end,
    reference_bases,
    GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
    SUM(call.genotype >= 0) WITHIN RECORD AS called_allele_count,
    FROM
    [va_aaa_pilot_data.all_genomes_expanded_vcfs_java2]
    # Optionally add clause here to limit the query to a particular
    # region of the genome.
    #_WHERE_
  ) AS g
  CROSS JOIN (
    SELECT
    COUNT(call.call_set_name) AS sample_count
    FROM (
      SELECT 
      call.call_set_name
      FROM
      [va_aaa_pilot_data.all_genomes_expanded_vcfs_java2]
      GROUP BY 
      call.call_set_name)) AS count )
WHERE
missingness_rate > 0.9
# Optionally add a clause here to sort and limit the results.
ORDER BY missingness_rate DESC, reference_name, start, reference_bases, alternate_bases LIMIT 1000

Number of rows returned by this query: 1000.

Displaying the first few rows of the dataframe of results:

variant_id	failure_reason	missingness_rate
CJ_JqPj1p5_yIRIEY2hyMRj9UiCG4PHHiLuQ8O8B	variant_missingness	1.00
CJ_JqPj1p5_yIRIEY2hyMRir8QEg8NHvoIrojceRAQ	variant_missingness	1.00
CJ_JqPj1p5_yIRIEY2hyMRjK8QEgyqWChcvz7JIK	variant_missingness	1.00
CJ_JqPj1p5_yIRIEY2hyMRi-9AEgt66dj4yx7a9D	variant_missingness	1.00
CJ_JqPj1p5_yIRIEY2hyMRi-9AEgudWm_IbA20k	variant_missingness	1.00
CJ_JqPj1p5_yIRIEY2hyMRjalAMgmpr66o-86pW0AQ	variant_missingness	1.00

Hardy-Weinberg Equilibrium

For each variant, compute the expected versus observed relationship between allele frequencies and genotype frequencies per the Hardy-Weinberg Equilibrium.

sortAndLimit <- "ORDER BY ChiSq DESC, reference_name, start, alternate_bases LIMIT 1000"
result <- DisplayAndDispatchQuery("./sql/hardy-weinberg.sql",
                                  project=project,
                                  replacements=c("#_ORDER_BY_"=sortAndLimit,
                                                 queryReplacements))

# The following query computes the Hardy-Weinberg equilibrium for variants.
SELECT
  reference_name,
  start,
  reference_bases,
  alternate_bases,
  OBS_HOM1,
  OBS_HET,
  OBS_HOM2,
  E_HOM1,
  E_HET,
  E_HOM2,

  # Chi Squared Calculation
  # SUM(((Observed - Expected)^2) / Expected )
  ROUND((POW(OBS_HOM1 - E_HOM1, 2) / E_HOM1)
  + (POW(OBS_HET - E_HET, 2) / E_HET)
  + (POW(OBS_HOM2 - E_HOM2, 2) / E_HOM2), 6)
  AS ChiSq,

  # Determine if Chi Sq value is significant
  IF((POW(OBS_HOM1 - E_HOM1, 2) / E_HOM1)
  + (POW(OBS_HET - E_HET, 2) / E_HET)
  + (POW(OBS_HOM2 - E_HOM2, 2) / E_HOM2)
  > 5.991, "TRUE", "FALSE") AS PVALUE_SIG

FROM (
    SELECT
      reference_name,
      start,
      reference_bases,
      alternate_bases,
      OBS_HOM1,
      OBS_HET,
      OBS_HOM2,

      # Expected AA
      # p^2
      # ((COUNT(AA) + (COUNT(Aa)/2) /
      #  SAMPLE_COUNT) ^ 2) * SAMPLE_COUNT
      ROUND(POW((OBS_HOM1 + (OBS_HET/2)) /
        SAMPLE_COUNT, 2) * SAMPLE_COUNT, 2)
        AS E_HOM1,

      # Expected Aa
      # 2pq
      # 2 * (COUNT(AA) + (COUNT(Aa)/2) / SAMPLE_COUNT) *
      # (COUNT(aa) + (COUNT(Aa)/2) / SAMPLE_COUNT)
      # * SAMPLE_COUNT
      ROUND(2 * ((OBS_HOM1 + (OBS_HET/2)) / SAMPLE_COUNT) *
        ((OBS_HOM2 + (OBS_HET/2)) / SAMPLE_COUNT)
        * SAMPLE_COUNT, 2)
        AS E_HET,

      # Expected aa
      # q^2
      # (COUNT(aa) + (COUNT(Aa)/2) /
      #  SAMPLE_COUNT) ^ 2 * SAMPLE_COUNT
      ROUND(POW((OBS_HOM2 + (OBS_HET/2)) /
        SAMPLE_COUNT, 2) * SAMPLE_COUNT, 2)
        AS E_HOM2,

  FROM (
    SELECT
      reference_name,
      start,
      reference_bases,
      alternate_bases,
      HOM_REF AS OBS_HOM1,
      HET AS OBS_HET,
      HOM_ALT AS OBS_HOM2,
      HOM_REF + HET + HOM_ALT AS SAMPLE_COUNT,
    FROM (
      SELECT
        reference_name,
        start,
        END,
        reference_bases,
        GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
        COUNT(alternate_bases) WITHIN RECORD AS num_alts,
        SUM(EVERY(0 = call.genotype)) WITHIN call AS HOM_REF,
        SUM(EVERY(1 = call.genotype)) WITHIN call AS HOM_ALT,
        SUM(SOME(0 = call.genotype)
          AND SOME(1 = call.genotype)) WITHIN call AS HET,
      FROM
        [va_aaa_pilot_data.all_genomes_expanded_vcfs_java2]
      # Optionally add a clause here to limit the query to a particular
      # region of the genome.
      #_WHERE_
      HAVING
        # Skip 1/2 genotypes
        num_alts = 1
        )))
# Optionally add a clause here to sort and limit the results.
ORDER BY ChiSq DESC, reference_name, start, alternate_bases LIMIT 1000

Number of rows returned by this query: 1000.

Displaying the first few rows of the dataframe of results:

reference_name	start	reference_bases	alternate_bases	OBS_HOM1	OBS_HOM2	E_HOM1	E_HET	E_HOM2	ChiSq	PVALUE_SIG
chr10	1461139	C	T	454	4	450.03	7.93	0.03	533.33	TRUE
chr11	101415618	T	A	454	4	450.03	7.93	0.03	533.33	TRUE
chr11	120928708	T	G	454	4	450.03	7.93	0.03	533.33	TRUE
chr13	59168960	A	G	454	4	450.03	7.93	0.03	533.33	TRUE
chr15	98813562	A	G	454	4	450.03	7.93	0.03	533.33	TRUE
chr2	217343776	G	C	454	4	450.03	7.93	0.03	533.33	TRUE

Heterozygous Haplotype

For each variant within the X and Y chromosome, identify heterozygous variants in male genomes. Males should not have any heterozygous calls on the X-chromosome outside of the pseudo-autosomal regions. If any are found they will be flagged.

First we use our sample information to determine which genomes are male.

maleSampleIds <- paste("'", filter(sampleInfo, gender == "Male")$call_call_set_name, "'", sep="", collapse=",")

sortAndLimit <- "ORDER BY reference_name, start, alternate_bases, call.call_set_name LIMIT 1000"
result <- DisplayAndDispatchQuery("./sql/sex-chromosome-heterozygous-haplotypes.sql",
                                  project=project,
                                  replacements=c("_MALE_SAMPLE_IDS_"=maleSampleIds,
                                                 "#_ORDER_BY_"=sortAndLimit,
                                                 queryReplacements))

# Identify heterozygous variants on the sex chromosomes for reportedly male samples. 

SELECT
  variant_id,
  sample_id,
  "heterozygous_haplotype" AS failure_reason,
FROM (
SELECT
  variant_id,
  reference_name,
  start,
  end,
  reference_bases,
  call.call_set_name AS sample_id,
  GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
FROM(FLATTEN((
  [va_aaa_pilot_data.all_genomes_expanded_vcfs_java2]), call.call_set_name))
WHERE
  reference_name IN ('chrX', 'chrY')
OMIT
  call IF (2 > COUNT(call.genotype))
  OR EVERY(call.genotype <= 0)
  OR EVERY(call.genotype = 1)
  # Pseudoautosomal Region 1
  OR (reference_name = 'chrX'
    AND start > 60001
    AND end < 2699520)
  OR (reference_name = 'chrY'
    AND start > 10001
    AND end < 2649520)
  # Pseudoautosomal Region 2
  OR (reference_name = 'chrX'
    AND start > 155260560
    AND end < 155270560)
  OR (reference_name = 'chrY' 
    AND start > 59363566
    AND end < 59373566)) AS seq
JOIN (
  SELECT
    IlluminaID,
    SEX
  FROM
    [va_aaa_pilot_data.patient_info] ) AS info
ON
  seq.sample_id = info.IlluminaID
WHERE
  SEX = 'M'
# Optionally add a clause here to sort and limit the results.
ORDER BY reference_name, start, alternate_bases, call.call_set_name LIMIT 1000

Error: Field 'alternate_bases' not found on either side of the JOIN

query invalidQuery. Field 'alternate_bases' not found on either side of the JOIN

Number of rows returned by this query: 1000.

Displaying the first few rows of the dataframe of results:

reference_name	start	reference_bases	alternate_bases	OBS_HOM1	OBS_HOM2	E_HOM1	E_HET	E_HOM2	ChiSq	PVALUE_SIG
chr10	1461139	C	T	454	4	450.03	7.93	0.03	533.33	TRUE
chr11	101415618	T	A	454	4	450.03	7.93	0.03	533.33	TRUE
chr11	120928708	T	G	454	4	450.03	7.93	0.03	533.33	TRUE
chr13	59168960	A	G	454	4	450.03	7.93	0.03	533.33	TRUE
chr15	98813562	A	G	454	4	450.03	7.93	0.03	533.33	TRUE
chr2	217343776	G	C	454	4	450.03	7.93	0.03	533.33	TRUE

Blacklisted Variants

Identify all variants within our cohort that have been blacklisted. For more information on what variants are blacklisted and why see here. All variants lccuring within blacklisted regions will be flagged.

result <- DisplayAndDispatchQuery("./sql/blacklisted-variants.sql",
                                  project=project,
                                  replacements=c(queryReplacements))

# Select variants that are found within blacklisted genes.
# https://sites.google.com/site/anshulkundaje/projects/blacklists

SELECT
  seq.variant_id AS variant_id,
  "blacklisted" AS failure_reason,
  bl.Artifact_Type AS Artifact_Type 
FROM (
  SELECT
    variant_id,
    reference_name,
    start,
    end,
  FROM
    [va_aaa_pilot_data.all_genomes_expanded_vcfs_java2]) as seq
JOIN (
  SELECT 
    reference_name,
    start - 1 AS start,
    end,
    Artifact_Type
  FROM 
    [resources.blacklisted_positions]) AS bl
ON
  seq.reference_name = bl.reference_name
WHERE 
  seq.start >= bl.start AND
  seq.end <= bl.end
#_ORDER_BY_


Retrieving data:  3.8s
Retrieving data:  6.3s
Retrieving data:  9.9s
Retrieving data: 12.5s
Retrieving data: 15.0s
Retrieving data: 17.6s
Retrieving data: 19.9s
Retrieving data: 22.8s
Retrieving data: 26.4s
Retrieving data: 29.0s
Retrieving data: 32.2s
Retrieving data: 36.9s
Retrieving data: 39.5s
Retrieving data: 42.9s
Retrieving data: 46.5s
Retrieving data: 49.7s
Retrieving data: 52.0s
Retrieving data: 54.7s
Retrieving data: 57.4s
Retrieving data: 60.2s
Retrieving data: 62.4s
Retrieving data: 65.2s
Retrieving data: 67.7s
Retrieving data: 70.3s
Retrieving data: 72.5s
Retrieving data: 75.4s
Retrieving data: 77.2s
Retrieving data: 79.0s
Retrieving data: 81.8s
Retrieving data: 84.1s
Retrieving data: 86.1s
Retrieving data: 88.3s
Retrieving data: 89.9s
Retrieving data: 92.3s
Retrieving data: 94.6s
Retrieving data: 96.5s
Retrieving data: 98.4s
Retrieving data: 100.4s
Retrieving data: 102.1s
Retrieving data: 104.1s
Retrieving data: 106.2s
Retrieving data: 109.8s
Retrieving data: 111.6s
Retrieving data: 113.5s
Retrieving data: 115.3s
Retrieving data: 117.3s
Retrieving data: 119.1s
Retrieving data: 120.9s
Retrieving data: 123.1s

Number of rows returned by this query: 509065.

Displaying the first few rows of the dataframe of results:

variant_id	failure_reason	Artifact_Type
CJ_JqPj1p5_yIRIEY2hyNRiv_4QWIKeCtZ3phP3_hwE	blacklisted	centromeric_repeat
CJ_JqPj1p5_yIRIEY2hyNRjT_4QWINuMj6To5JHvAw	blacklisted	centromeric_repeat
CJ_JqPj1p5_yIRIEY2hyNRjbgIUWIKinoLyFt4GxzwE	blacklisted	centromeric_repeat
CJ_JqPj1p5_yIRIEY2hyNRjhgoUWII2H6sHWt4XFuAE	blacklisted	centromeric_repeat
CJ_JqPj1p5_yIRIEY2hyNRj7goUWIOG13aLxo5u_Gw	blacklisted	centromeric_repeat
CJ_JqPj1p5_yIRIEY2hyNRjeg4UWIPqdwOHH34HXbA	blacklisted	centromeric_repeat

Next: Part 4: QC Implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variant-Level-QC.md

Variant-Level-QC.md

Part 3: Variant-Level QC

Setup

Ti/Tv by Genomic Window

Ti/Tv by Alternate Allele Counts

Ti/Tv by Depth

Missingness Rate

Hardy-Weinberg Equilibrium

Heterozygous Haplotype

Blacklisted Variants

Files

Variant-Level-QC.md

Latest commit

History

Variant-Level-QC.md

File metadata and controls

Part 3: Variant-Level QC

Setup

Ti/Tv by Genomic Window

Ti/Tv by Alternate Allele Counts

Ti/Tv by Depth

Missingness Rate

Hardy-Weinberg Equilibrium

Heterozygous Haplotype

Blacklisted Variants