Comparison of Google Genomics tools vs Academic Tools

The purpose of this codelab is to compare BigQuery queries to academic command line tools for genomic analysis.

Setup
Singleton Rate
Homozygosity Rate and Inbreeding Coefficient
Hardy-Weinberg Equilibrium
Ti/Tv by Alternate Allele Counts
Ti/Tv by Depth
Sample-Level Missingness
Variant-Level Missingness
Sex Inference
Heterozygous Haplotype
Ethnicity Inference
Genome Similarity
Genotype Concordance

Setup

tableReplacement <- list("_THE_TABLE_"="gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2",
                          "_THE_EXPANDED_TABLE_"="gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2",
                         "_REF_TABLE_"="gbsc-gcp-project-mvp:qc_tables.5_genomes_ref_calls_brca1",
                         "_VARIANT_TABLE_"="gbsc-gcp-project-mvp:qc_tables.5_genomes_variants_brca1",
                         "_GENOTYPING_TABLE_"="gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_genotyping_vcfs")
sampleData <- read.csv("./data/patient_info.csv")
sampleInfo <- select(sampleData, call_call_set_name=Catalog.ID, gender=Gender)

constraints <- list("#_WHERE_"="WHERE 
                    reference_name = 'chr17'
                    AND start BETWEEN 41196311
                    AND 41277499")

googleRepository = "https://raw.githubusercontent.com/googlegenomics/codelabs/master/R/PlatinumGenomes-QC/sql/"

Top

Singleton Rate

Singletone Rate in Google Genomics

For each genome, count the number of variants shared by no other member of the cohort. Too many private calls for a particular individual may indicate a problem.

For our 5 genome sample set the singleton rate will be very high due to the small number of genomes.

query <- paste(googleRepository, "private-variants-brca1.sql", sep = "")
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=tableReplacement)

# Private variants within BRCA1.
SELECT
  reference_name AS CHROM,
  start AS POS,
  GROUP_CONCAT(CASE WHEN cnt = 1 THEN 'S'
    WHEN cnt = 2 THEN 'D'
    ELSE STRING(cnt) END) AS SINGLETON_DOUBLETON,
  reference_bases AS REF,
  alternate_bases AS ALT,
  GROUP_CONCAT(call.call_set_name) AS INDV,
  GROUP_CONCAT(genotype) AS genotype,
  COUNT(call.call_set_name) AS num_samples_with_variant
FROM (
  SELECT
    reference_name,
    start,
    reference_bases,
    alternate_bases,
    alt_num,
    call.call_set_name,
    GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
    SUM(call.genotype == alt_num) WITHIN call AS cnt,
  FROM (
      FLATTEN((
        SELECT
          reference_name,
          start,
          reference_bases,
          alternate_bases,
          POSITION(alternate_bases) AS alt_num,
          call.call_set_name,
          call.genotype,
        FROM
          [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
        WHERE
          reference_name = 'chr17'
          AND start BETWEEN 41196311
          AND 41277499
        OMIT call IF EVERY(call.genotype = -1)
          ),
        alternate_bases)
      )
  OMIT RECORD IF alternate_bases IS NULL
  HAVING
    cnt > 0
    )
GROUP BY
  chrom,
  pos,
  ref,
  alt
HAVING
  num_samples_with_variant = 1
ORDER BY
  POS,
  INDV

Number of rows returned by this query: 33.

Displaying the first few results:

CHROM	POS	SINGLETON_DOUBLETON	REF	ALT	INDV	genotype	num_samples_with_variant
chr17	41196944	S	T	C	LP6005038-DNA_B02	"0,1"	1
chr17	41197938	S	A	AT	LP6005038-DNA_B02	"0,1"	1
chr17	41200536	S	T	C	LP6005038-DNA_A02	"0,1"	1
chr17	41202633	S	AG	A	LP6005038-DNA_A01	"0,1"	1
chr17	41204835	S	ACACACT	A	LP6005038-DNA_A01	"0,1"	1
chr17	41208691	S	G	GA	LP6005038-DNA_A03	"0,1"	1

Compare to output from vcftools, which has 85 some of which are for 0/0 genotypes from reference matching blocks (see the vcftools command line used to create this file).

Command line execution

vcftools --gzvcf brca1.merged.sample_set.vcf.gz --singletons --out singletons

expectedResult <- read.table("./data/brca1/singletons.singletons", header=TRUE)
# Convert to zero-based coordinates
expectedResult <- mutate(expectedResult, POS = POS - 1)
# Clean colnames to match
colnames(expectedResult) <- gsub('\\.+', '_', colnames(expectedResult))

How many singletons do the two results have in common?

nrow(inner_join(result, expectedResult))

## Joining by: c("CHROM", "POS", "SINGLETON_DOUBLETON", "INDV")

## [1] 54

How many singletons were only identified by BigQuery?

onlyBQ <- anti_join(result, expectedResult)

## Joining by: c("CHROM", "POS", "SINGLETON_DOUBLETON", "INDV")

nrow(onlyBQ)

## [1] 0

## Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, : length of 'dimnames' [2] not equal to array extent

Which singletons were only identified by vcftools?

onlyVcftools <- anti_join(expectedResult, result)

## Joining by: c("CHROM", "POS", "SINGLETON_DOUBLETON", "INDV")

print(xtable(onlyVcftools), type="html", include.rownames=F)

CHROM	POS	SINGLETON_DOUBLETON	ALLELE	INDV
chr17	41271293.00	D	A	LP6005038-DNA_B02
chr17	41267518.00	D	A	LP6005038-DNA_B02
chr17	41256102.00	D	G	LP6005038-DNA_B02
chr17	41256101.00	D	A	LP6005038-DNA_B02
chr17	41256100.00	D	A	LP6005038-DNA_B02
chr17	41256086.00	D	G	LP6005038-DNA_B02
chr17	41256085.00	D	A	LP6005038-DNA_B02
chr17	41256084.00	D	A	LP6005038-DNA_B02
chr17	41256083.00	D	A	LP6005038-DNA_B02
chr17	41256082.00	D	A	LP6005038-DNA_B02
chr17	41256081.00	D	A	LP6005038-DNA_B02
chr17	41256080.00	D	A	LP6005038-DNA_B02
chr17	41256079.00	D	A	LP6005038-DNA_B02
chr17	41256078.00	D	A	LP6005038-DNA_B02
chr17	41256077.00	D	A	LP6005038-DNA_B02
chr17	41256076.00	D	A	LP6005038-DNA_B02
chr17	41256075.00	D	A	LP6005038-DNA_B02
chr17	41252696.00	D	T	LP6005038-DNA_B01
chr17	41252647.00	D	G	LP6005038-DNA_B02
chr17	41252634.00	D	A	LP6005038-DNA_B02
chr17	41249363.00	D	A	LP6005038-DNA_B02
chr17	41242077.00	D	G	LP6005038-DNA_B02
chr17	41242076.00	D	A	LP6005038-DNA_B02
chr17	41230105.00	D	T	LP6005038-DNA_B02
chr17	41229760.00	D	T	LP6005038-DNA_B02
chr17	41226740.00	D	T	LP6005038-DNA_B02
chr17	41226739.00	D	T	LP6005038-DNA_B02
chr17	41226738.00	D	G	LP6005038-DNA_B02
chr17	41226737.00	D	G	LP6005038-DNA_B02
chr17	41226736.00	D	G	LP6005038-DNA_B02
chr17	41214210.00	D	A	LP6005038-DNA_B02
chr17	41242075.00	D	A	LP6005038-DNA_B02
chr17	41214209.00	D	A	LP6005038-DNA_B02
chr17	41202634.00	D	G	LP6005038-DNA_B02

Retrieving the gVCF data for the singletons identified only by vcftools:

having <- paste("start = ", onlyVcftools$POS,
                sep="", collapse=" OR ")
query <- paste(googleRepository, "examine-data.sql", sep="")
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement,
                                                 "_HAVING_"=having))

# Examine the data for particular calls.
SELECT
  call.call_set_name,
  GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
  reference_name,
  start,
  end,
  reference_bases,
  GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
  quality,
  GROUP_CONCAT(filter) WITHIN RECORD AS filter,
FROM
  [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
WHERE
  reference_name = 'chr17'
HAVING
  start = 41271293 OR start = 41267518 OR start = 41256102 OR start = 41256101 OR start = 41256100 OR start = 41256086 OR start = 41256085 OR start = 41256084 OR start = 41256083 OR start = 41256082 OR start = 41256081 OR start = 41256080 OR start = 41256079 OR start = 41256078 OR start = 41256077 OR start = 41256076 OR start = 41256075 OR start = 41252696 OR start = 41252647 OR start = 41252634 OR start = 41249363 OR start = 41242077 OR start = 41242076 OR start = 41230105 OR start = 41229760 OR start = 41226740 OR start = 41226739 OR start = 41226738 OR start = 41226737 OR start = 41226736 OR start = 41214210 OR start = 41242075 OR start = 41214209 OR start = 41202634
ORDER BY
  start,
  end,
  call.call_set_name

call_call_set_name	genotype	reference_name	start	end	reference_bases	alternate_bases	quality	filter
LP6005038-DNA_B01	0,0	chr17	41252696	41252755	T		31.21

It appears that they correspond either to:

A reference-matching block, so not actually a singleton and just perhaps violating an assumption in the vcftools code. Most of the start sites queried return nothing because they are withing reference matching blocks.
Or a non-singleon variant, perhaps due to a problem in converting the gVCF data to all-positions VCF via gvcftools?

Top

Homozygosity Rate and Inbreeding Coefficient

Homozygosity Rate and Inbreeding Coefficient in Google Genomics

For each genome, compare the expected and observed rates of homozygosity.

query <- paste(googleRepository, "homozygous-variants.sql", sep="") 
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement, constraints))

# Compute the expected and observed homozygosity rate for each individual.
SELECT
  call.call_set_name,
  O_HOM,
  ROUND(E_HOM, 2) as E_HOM,
  N_SITES,
  ROUND((O_HOM - E_HOM) / (N_SITES - E_HOM), 5) AS F
FROM (
  SELECT
    call.call_set_name,
    SUM(first_allele = second_allele) AS O_HOM,
    SUM(1.0 - (2.0 * freq * (1.0 - freq) * (called_allele_count / (called_allele_count - 1.0)))) AS E_HOM,
    COUNT(call.call_set_name) AS N_SITES,
  FROM (
    SELECT
      reference_name,
      start,
      reference_bases,
      GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
      call.call_set_name,
      NTH(1, call.genotype) WITHIN call AS first_allele,
      NTH(2, call.genotype) WITHIN call AS second_allele,
      COUNT(alternate_bases) WITHIN RECORD AS num_alts,
      SUM(call.genotype >= 0) WITHIN RECORD AS called_allele_count,
      IF((SUM(1 = call.genotype) > 0),
        SUM(call.genotype = 1)/SUM(call.genotype >= 0),
        -1)  WITHIN RECORD AS freq
    FROM
      [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2]
    # Optionally add a clause here to limit the query to a particular
    # region of the genome.
    WHERE 
                    reference_name = 'chr17'
                    AND start BETWEEN 41196311
                    AND 41277499
    # Skip no calls and haploid sites
    OMIT call IF SOME(call.genotype < 0) OR (2 > COUNT(call.genotype))
    HAVING
      # Skip 1/2 genotypes _and non-SNP variants
      num_alts = 1
      AND reference_bases IN ('A','C','G','T')
      AND alternate_bases IN ('A','C','G','T')
      )
  GROUP BY
    call.call_set_name
    )
ORDER BY
  call.call_set_name

Number of rows returned by this query: 5.

Displaying the first few results:

call_call_set_name	O_HOM	E_HOM	N_SITES	F
LP6005038-DNA_A01	16	79.87	136	-1.14
LP6005038-DNA_A02	14	68.87	135	-0.83
LP6005038-DNA_A03	15	68.23	134	-0.81
LP6005038-DNA_B01	18	80.00	136	-1.11
LP6005038-DNA_B02	137	81.97	140	0.95

Let's compare to what was found using vcftools.

Command line execution.

vcftools --gzvcf brca1.merged.sample_set.vcf.gz --het --out het_calls

expectedResult <- read.table("./data/brca1/het_calls.het", header=TRUE)

E_HOM.x	E_HOM.y	F.x	F.y	INDV	N_SITES.x	N_SITES.y	O_HOM.x	O_HOM.y
65.50	79.87	-0.85	-1.14	LP6005038-DNA_A01	143	136	0	16
66.40	68.87	-0.70	-0.83	LP6005038-DNA_A02	150	135	8	14
67.60	68.23	-0.72	-0.81	LP6005038-DNA_A03	149	134	9	15
64.00	80.00	-0.87	-1.11	LP6005038-DNA_B01	135	136	2	18
61.80	81.97	0.92	0.95	LP6005038-DNA_B02	133	140	127	137

The logic in the query looks similar to vcftools output_het method but there is clearly a difference.

TODO: investigate the difference further.

Top

Hardy-Weinberg Equilibrium

Hardy-Weinberg Equiibrium in Google Genomics

query <- paste(googleRepository, "hardy-weinberg.sql", sep="")
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement, constraints))

# The following query computes the Hardy-Weinberg equilibrium for variants.
SELECT
  reference_name,
  start,
  reference_bases,
  alternate_bases,
  OBS_HOM1,
  OBS_HET,
  OBS_HOM2,
  E_HOM1,
  E_HET,
  E_HOM2,

  # Chi Squared Calculation
  # SUM(((Observed - Expected)^2) / Expected )
  ROUND((POW(OBS_HOM1 - E_HOM1, 2) / E_HOM1)
  + (POW(OBS_HET - E_HET, 2) / E_HET)
  + (POW(OBS_HOM2 - E_HOM2, 2) / E_HOM2), 6)
  AS ChiSq,

  # Determine if Chi Sq value is significant
  IF((POW(OBS_HOM1 - E_HOM1, 2) / E_HOM1)
  + (POW(OBS_HET - E_HET, 2) / E_HET)
  + (POW(OBS_HOM2 - E_HOM2, 2) / E_HOM2)
  > 5.991, "TRUE", "FALSE") AS PVALUE_SIG

FROM (
    SELECT
      reference_name,
      start,
      reference_bases,
      alternate_bases,
      OBS_HOM1,
      OBS_HET,
      OBS_HOM2,

      # Expected AA
      # p^2
      # ((COUNT(AA) + (COUNT(Aa)/2) /
      #  SAMPLE_COUNT) ^ 2) * SAMPLE_COUNT
      ROUND(POW((OBS_HOM1 + (OBS_HET/2)) /
        SAMPLE_COUNT, 2) * SAMPLE_COUNT, 2)
        AS E_HOM1,

      # Expected Aa
      # 2pq
      # 2 * (COUNT(AA) + (COUNT(Aa)/2) / SAMPLE_COUNT) *
      # (COUNT(aa) + (COUNT(Aa)/2) / SAMPLE_COUNT)
      # * SAMPLE_COUNT
      ROUND(2 * ((OBS_HOM1 + (OBS_HET/2)) / SAMPLE_COUNT) *
        ((OBS_HOM2 + (OBS_HET/2)) / SAMPLE_COUNT)
        * SAMPLE_COUNT, 2)
        AS E_HET,

      # Expected aa
      # q^2
      # (COUNT(aa) + (COUNT(Aa)/2) /
      #  SAMPLE_COUNT) ^ 2 * SAMPLE_COUNT
      ROUND(POW((OBS_HOM2 + (OBS_HET/2)) /
        SAMPLE_COUNT, 2) * SAMPLE_COUNT, 2)
        AS E_HOM2,

  FROM (
    SELECT
      reference_name,
      start,
      reference_bases,
      alternate_bases,
      HOM_REF AS OBS_HOM1,
      HET AS OBS_HET,
      HOM_ALT AS OBS_HOM2,
      HOM_REF + HET + HOM_ALT AS SAMPLE_COUNT,
    FROM (
      SELECT
        reference_name,
        start,
        END,
        reference_bases,
        GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
        COUNT(alternate_bases) WITHIN RECORD AS num_alts,
        SUM(EVERY(0 = call.genotype)) WITHIN call AS HOM_REF,
        SUM(EVERY(1 = call.genotype)) WITHIN call AS HOM_ALT,
        SUM(SOME(0 = call.genotype)
          AND SOME(1 = call.genotype)) WITHIN call AS HET,
      FROM
        [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2]
      # Optionally add a clause here to limit the query to a particular
      # region of the genome.
      WHERE 
                    reference_name = 'chr17'
                    AND start BETWEEN 41196311
                    AND 41277499
      HAVING
        # Skip 1/2 genotypes
        num_alts = 1
        )))
# Optionally add a clause here to sort and limit the results.
#_ORDER_BY_

Number of rows returned by this query: 185.

Displaying the first few results:

reference_name	start	reference_bases	alternate_bases	OBS_HOM1	OBS_HET	E_HOM1	E_HET	E_HOM2	ChiSq	PVALUE_SIG
chr17	41277186	G	C	1	4	1.80	2.40	0.80	2.22	FALSE
chr17	41230104	CT	C	0	4	1.00	2.00	1.00	4.00	FALSE
chr17	41230227	G	A	1	4	1.80	2.40	0.80	2.22	FALSE
chr17	41230335	A	G	1	4	1.80	2.40	0.80	2.22	FALSE
chr17	41230375	A	G	1	4	1.80	2.40	0.80	2.22	FALSE
chr17	41230523	T	G	3	2	3.20	1.60	0.20	0.31	FALSE

Compare to brca1.hwe (see the vcftools command line used to create this file).

Command line

vcftools --gzvcf brca1.merged.sample_set.vcf.gz --hardy --out hardy

require(dplyr)
df <- read.table("./data/brca1/hardy.hwe", header=TRUE)
obsSplitCol <- "OBS.HOM1.HET.HOM2."
obsTemp <- read.table(text=as.character(df[, obsSplitCol]), sep = "/")
names(obsTemp) <- c("OBS_HOM1", "OBS_HET", "OBS_HOM2")
eSplitCol <- "E.HOM1.HET.HOM2."
eTemp <- read.table(text=as.character(df[, eSplitCol]), sep = "/")
names(eTemp) <- c("E_HOM1", "E_HET", "E_HOM2")
expectedResult <- cbind(cbind(df[setdiff(names(df), c(obsSplitCol,eSplitCol))], obsTemp), eTemp)
# Convert to zero-based coordinates
expectedResult <- mutate(expectedResult, POS = POS - 1)
names(expectedResult)[1] = 'reference_name'
names(expectedResult)[2] = 'start'

How many results do the two results have in common?

nrow(inner_join(result, expectedResult, by=c("reference_name", "start", "OBS_HOM1", "OBS_HET", "OBS_HOM2")))

## [1] 160

How many result were only identified by BigQuery?

onlyBQ <- anti_join(result, expectedResult, , by=c("reference_name", "start", "OBS_HOM1", "OBS_HET", "OBS_HOM2"))
nrow(onlyBQ)

## [1] 25

Let's take a look at the first few.

reference_name	start	reference_bases	alternate_bases	OBS_HOM1	OBS_HET	E_HOM1	E_HET	E_HOM2	ChiSq	PVALUE_SIG
chr17	41196944	T	C	4	1	4.05	0.90	0.05	0.06	FALSE
chr17	41201363	T	C	3	2	3.20	1.60	0.20	0.31	FALSE
chr17	41211485	C	CACA	0	1	0.25	0.50	0.25	1.00	FALSE
chr17	41211485	C	CACAACA	0	1	0.25	0.50	0.25	1.00	FALSE
chr17	41213759	C	T	3	2	3.20	1.60	0.20	0.31	FALSE
chr17	41216205	T	C	3	2	3.20	1.60	0.20	0.31	FALSE

How many results were only identified by vcftools?

onlyVcftools <- anti_join(expectedResult, result, , by=c("reference_name", "start", "OBS_HOM1", "OBS_HET", "OBS_HOM2"))
nrow(onlyVcftools)

## [1] 16

Let's take a look at the first few.

reference_name	start	ChiSq	P	OBS_HET	E_HOM1	E_HET	E_HOM2
chr17	41196944.00	1.00	1.00	1	0.25	0.50	0.25
chr17	41201363.00	2.00	1.00	2	0.50	1.00	0.50
chr17	41213759.00	2.00	1.00	2	0.50	1.00	0.50
chr17	41216205.00	2.00	1.00	2	0.50	1.00	0.50
chr17	41220771.00	1.00	1.00	1	0.25	0.50	0.25
chr17	41222098.00	2.00	1.00	2	0.50	1.00	0.50

Retrieving the gVCF data for the results identified only by vcftools:

having <- list("_HAVING_" = paste("start <= ", onlyVcftools$start,
                "AND",
                "end >= ", onlyVcftools$start+1,
                collapse=" OR "))
query <- paste(googleRepository, "examine-data.sql", sep="")
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement,
                                                 having))

# Examine the data for particular calls.
SELECT
  call.call_set_name,
  GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
  reference_name,
  start,
  end,
  reference_bases,
  GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
  quality,
  GROUP_CONCAT(filter) WITHIN RECORD AS filter,
FROM
  [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
WHERE
  reference_name = 'chr17'
HAVING
  start <=  41258790 AND end >=  41258791 OR start <=  41252694 AND end >=  41252695 OR start <=  41249362 AND end >=  41249363 OR start <=  41248903 AND end >=  41248904 OR start <=  41247623 AND end >=  41247624 OR start <=  41247054 AND end >=  41247055 OR start <=  41245470 AND end >=  41245471 OR start <=  41241567 AND end >=  41241568 OR start <=  41239979 AND end >=  41239980 OR start <=  41251930 AND end >=  41251931 OR start <=  41222098 AND end >=  41222099 OR start <=  41220771 AND end >=  41220772 OR start <=  41216205 AND end >=  41216206 OR start <=  41213759 AND end >=  41213760 OR start <=  41201363 AND end >=  41201364 OR start <=  41196944 AND end >=  41196945
ORDER BY
  start,
  end,
  call.call_set_name

The first few:

call_call_set_name	genotype	reference_name	start	end	reference_bases	alternate_bases	quality	filter
LP6005038-DNA_A01	0,0	chr17	41196408	41197273	G		61.23
LP6005038-DNA_A02	0,0	chr17	41196408	41197273	G		61.23
LP6005038-DNA_A03	0,0	chr17	41196408	41197273	G		61.23
LP6005038-DNA_B01	0,0	chr17	41196408	41197273	G		61.23
LP6005038-DNA_B02	0,1	chr17	41196944	41196945	T	C	315.77	PASS
LP6005038-DNA_B01	0,0	chr17	41200109	41201701	A		40.23

Top

Ti/Tv by Alternate Allele Counts

Ti/Tv by Alternate Allele Counts in Google Genomics

query <- paste(googleRepository, "ti-tv-by-alternate-allele-count.sql", sep="")
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement, constraints))

# Compute the Ti/Tv ratio for variants binned by alternate allele count.
SELECT
  transitions,
  transversions,
  transitions/transversions AS titv,
  alternate_allele_count
FROM (
  SELECT
    SUM(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
    SUM(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
                     'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
    alternate_allele_count
  FROM (
    SELECT
      CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation,
      COUNT(alternate_bases) WITHIN RECORD AS num_alts,
      SUM(call.genotype = 1) WITHIN RECORD AS alternate_allele_count,
    FROM
      [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
    # Optionally add clause here to limit the query to a particular
    # region of the genome.
    WHERE 
                    reference_name = 'chr17'
                    AND start BETWEEN 41196311
                    AND 41277499
    HAVING
      # Skip 1/2 genotypes _and non-SNP variants
      num_alts = 1
      AND reference_bases IN ('A','C','G','T')
      AND alternate_bases IN ('A','C','G','T'))
  GROUP BY
    alternate_allele_count)
ORDER BY
  alternate_allele_count DESC

transitions	transversions	titv	alternate_allele_count
0	1	0.00	10
81	29	2.79	4
11	3	3.67	2
8	5	1.60	1
3	0		0

Compare to output from vcftools.

Command line

vcftools --gzvcf brca1.merged.sample_set.vcf.gz --TsTv-by-count --out TsTv-by-coun

expectedResult <- read.table("./data/brca1/TsTv-by-count.TsTv.count", header=TRUE)

ALT_ALLELE_COUNT	N_Ts	N_Tv	Ts.Tv
0	0	0
1	8	5	1.60
2	11	3	3.67
3	0	0
4	81	29	2.79
5	0	0
6	0	0
7	0	0
8	0	0
9	0	0

The results for BigQuery and vcftools are identical except for the case where there are 10 alternate alleles. With only 5 genomes, 10 sites with an alternate is the maximum. vcftools may ignore sites where all calls are the alternate allele.

Top

Ti/Tv by Depth

query <- "./sql/ti-tv-by-depth.sql"
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement))

SELECT
  call.call_set_name,
  (transitions/transversions) AS titv_ratio,
  average_depth,
FROM (
  SELECT
    call.call_set_name,
    SUM(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
    SUM(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
                     'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
    ROUND(AVG(call.DP)) AS average_depth,
  FROM (

    SELECT
      call.call_set_name,
      CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation,
      COUNT(alternate_bases) WITHIN RECORD AS num_alts,
      call.DP
    FROM (
      SELECT
        call.call_set_name,
        reference_bases,
        GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
        call.genotype,
        call.DP,
      FROM
        [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2]
      # Optionally add clause here to limit the query to a particular
      # region of the genome.
      #_WHERE_  
      )
    WHERE
      call.DP is not null
    HAVING
      # Skip 1/2 genotypes _and non-SNP variants
      num_alts = 1
      AND reference_bases IN ('A','C','G','T')
      AND alternate_bases IN ('A','C','G','T'))
    GROUP BY 
      call.call_set_name,
      call.DP,)
WHERE
  transversions > 0
GROUP BY
  call.call_set_name,
  titv_ratio,
  average_depth,

call_call_set_name	titv_ratio	average_depth
LP6005038-DNA_B01	2.09	39.00
LP6005038-DNA_B02	2.20	28.00
LP6005038-DNA_A02	2.20	30.00
LP6005038-DNA_A01	2.12	30.00
LP6005038-DNA_A03	2.15	35.00
LP6005038-DNA_B01	2.16	33.00

ggplot(result, aes(x=average_depth, y=titv_ratio, color=call_call_set_name)) + 
  geom_point() +
  #geom_smooth(se=FALSE,    # Don't add shaded confidence region
  #              fullrange=T) +
  ggtitle("Ti/Tv Ratio By Depth") +
  xlab("Coverage Depth") + 
  ylab("Ti/Tv")

TODO: Find academic tool to compare against.

Top

Sample-Level Missingness

Sample-Level Missingness in Google Genomics

See this document for details on how tables were created for this section.

query <- "./sql/sample-missingness-missing-calls.sql"
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement, constraints))

SELECT 
  bc.sample_name AS sample_name,
  nc.no_call_count as no_call_count,
  nc.no_call_count/bc.base_count AS missingness,
FROM (
  SELECT 
    sample_name,
    COUNT(no_call) as no_call_count,
  FROM(
    SELECT
      ref.sample_name AS sample_name,
      ref.chromosome AS chromosome,
      ref.start AS start,
      ref.end AS end,
      ref.reference_bases AS reference_bases,
      True AS no_call
    FROM
      (SELECT 
        *
       FROM
        [gbsc-gcp-project-mvp:qc_tables.5_genomes_ref_calls_brca1] ) AS ref
    LEFT OUTER JOIN (
      SELECT
        *
      FROM
        [gbsc-gcp-project-mvp:qc_tables.5_genomes_variants_brca1] ) AS var
    ON
      ref.sample_name = var.sample_name
      AND ref.chromosome = var.chromosome
      AND ref.start = var.start
    WHERE
      ref.is_ref is False
      AND var.is_variant_call is NULL )
  GROUP BY
    sample_name ) AS nc
JOIN (
  SELECT
    sample_name,
    COUNT(start) AS base_count
  FROM 
    [gbsc-gcp-project-mvp:qc_tables.5_genomes_ref_calls_brca1]
  GROUP BY
  sample_name ) as bc
ON
  bc.sample_name = nc.sample_name;

sample_name	no_call_count	missingness
LP6005038-DNA_A01	286	0.00
LP6005038-DNA_A02	115	0.00
LP6005038-DNA_A03	332	0.00
LP6005038-DNA_B01	228	0.00
LP6005038-DNA_B02	639	0.01

Compare to output from vcftools.

Command line

vcftools --gzvcf brca1.merged.sample_set.vcf.gz --missing-indv --out missingness

expectedResult <- read.table("./data/brca1/missingness.imiss", header=TRUE)

INDV	N_DATA	N_MISS
LP6005038-DNA_A01	81239	274
LP6005038-DNA_A02	81239	107
LP6005038-DNA_A03	81239	311
LP6005038-DNA_B01	81239	205
LP6005038-DNA_B02	81239	73

These results from BigQuery and vcftools are expected to be different. Sample-level missingness in BigQuery is calculated by joining all positions in the genome against the reference genome to identify all sites that were not called by sequencing. Vcftools only looks at sites that were called by at least one sample in the group.

Top

Variant-Level Missingness

Variant-Level Missingness in Google Genomics

query <- "./sql/variant-level-missingness-missing-calls.sql"
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement, constraints))

#SELECT  FROM [va_aaa_pilot_data.5_genome_test_vcfs_no_calls] LIMIT 1000
# Compute the ratio no-calls for each variant.
SELECT
  reference_name,
  start,
  END,
  reference_bases,
  alternate_bases,
  no_calls,
  all_calls,
  (sample_count*2 - all_calls) AS missing_calls,
  ((no_calls + (sample_count*2 - all_calls))/(all_calls + (sample_count*2 - all_calls))) AS missingness_rate,
  
FROM (
  SELECT
    reference_name,
    start,
    END,
    reference_bases,
    GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
    SUM(call.genotype == -1) WITHIN RECORD AS no_calls,
    COUNT(call.genotype) WITHIN RECORD AS all_calls,

  FROM
      [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2]
    # Optionally add clause here to limit the query to a particular
    # region of the genome.
    WHERE 
                    reference_name = 'chr17'
                    AND start BETWEEN 41196311
                    AND 41277499
  ) as calls
CROSS JOIN (
  SELECT 
    COUNT(sample_name) AS sample_count
  FROM(
    SELECT
      call.call_set_name AS sample_name, 
    FROM [va_aaa_pilot_data.5_genome_test_vcfs_no_calls] 
    GROUP BY sample_name)) as count
# Optionally add a clause here to sort and limit the results.
#_ORDER_BY_

Error: Not found: Table gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_no_calls

 notFound. Not found: Table gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_no_calls

sample_name	no_call_count	missingness
LP6005038-DNA_A01	286	0.00
LP6005038-DNA_A02	115	0.00
LP6005038-DNA_A03	332	0.00
LP6005038-DNA_B01	228	0.00
LP6005038-DNA_B02	639	0.01

Compare to output from vcftools.

Command line

vcftools --gzvcf brca1.merged.sample_set.vcf.gz --missing_site --out missing_sites

expectedResult <- read.table("./data/brca1/missing_sites.lmiss", header=TRUE)

expectedResult <- mutate(expectedResult, POS = POS - 1)
n = c("reference_name", "start", "expected_count", "filtered", "expected_missing_count", "expected_rate")
names(expectedResult) = n

print(xtable(head(expectedResult)), type="html", include.rownames=F)

reference_name	start	expected_count
chr17	41196311.00	10
chr17	41196312.00	10
chr17	41196313.00	10
chr17	41196314.00	10
chr17	41196315.00	10
chr17	41196316.00	10

Let's join them and look at the differences

joinedResult = data.table(inner_join(result, expectedResult, by=c("reference_name", "start")))

## Error in eval(expr, envir, enclos): could not find function "data.table"

matches = joinedResult[missingness_rate == expected_rate]

## Error in `[.data.frame`(joinedResult, missingness_rate == expected_rate): object 'missingness_rate' not found

differences = joinedResult[missingness_rate != expected_rate]

## Error in `[.data.frame`(joinedResult, missingness_rate != expected_rate): object 'missingness_rate' not found

How many do they have in common?

nrow(matches)

## Error in nrow(matches): object 'matches' not found

How many are different?

nrow(differences)

## Error in nrow(differences): object 'differences' not found

Here we can see which rows where different

## Error in xtable(differences): object 'differences' not found

The differences appear to be mostly form positions with insertions or deletions.

TODO: Look at the differences more closely.

Top

Sex Inference

Sex Inference in Google Genomics

query <- paste(googleRepository, "gender-check.sql", sep="")
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement, constraints))

# Compute the the homozygous and heterozygous variant counts for each individual
# within chromosome X to help determine whether the gender phenotype value is
# correct for each individual.
SELECT
  call.call_set_name,
  ROUND(SUM(het_RA)/(SUM(hom_AA) + SUM(het_RA)), 3) AS perct_het_alt_in_snvs,
  ROUND(SUM(hom_AA)/(SUM(hom_AA) + SUM(het_RA)), 3) AS perct_hom_alt_in_snvs,
  SUM(hom_AA) AS hom_AA_count,
  SUM(het_RA) AS het_RA_count,
  SUM(hom_RR) AS hom_RR_count,
FROM (
  SELECT
    reference_bases,
    GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
    COUNT(alternate_bases) WITHIN RECORD AS num_alts,
    call.call_set_name,
    SOME(call.genotype = 0) AND NOT SOME(call.genotype > 0) WITHIN call AS hom_RR,
    SOME(call.genotype > 0) AND NOT SOME(call.genotype = 0) WITHIN call AS hom_AA,
    SOME(call.genotype > 0) AND SOME(call.genotype = 0) WITHIN call AS het_RA
  FROM
    [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2]
  WHERE
    reference_name = 'chrX'
    AND start NOT BETWEEN 59999 AND 2699519
    AND start NOT BETWEEN 154931042 AND 155260559
  HAVING
    # Skip 1/2 genotypes _and non-SNP variants
    num_alts = 1
    AND reference_bases IN ('A','C','G','T')
    AND alternate_bases IN ('A','C','G','T')
  )
GROUP BY
  call.call_set_name
ORDER BY
  call.call_set_name

Number of rows returned by this query: 5.

Displaying the first few results:

call_call_set_name	perct_het_alt_in_snvs	perct_hom_alt_in_snvs	hom_AA_count	het_RA_count	hom_RR_count
LP6005038-DNA_A01	0.03	0.97	71719	2090	91043
LP6005038-DNA_A02	0.03	0.97	72908	1950	89501
LP6005038-DNA_A03	0.59	0.41	44798	64482	56682
LP6005038-DNA_B01	0.03	0.97	74881	2165	87670
LP6005038-DNA_B02	0.03	0.97	73620	2592	88003

Let's join this with the sample information:

joinedResult <- inner_join(result, sampleInfo)

And visualize the results:

ggplot(joinedResult) +
  geom_point(aes(x=call_call_set_name, y=perct_het_alt_in_snvs, color=gender)) +
  theme(axis.text.x=if(nrow(result) <= 20)
    {element_text(angle = 90, hjust = 1)} else {element_blank()}) +
  xlab("Sample") +
  ylab("Heterozygosity Rate ") +
  ggtitle("Heterozygosity Rate on the X Chromosome")

TODO: Compare to Plink

Command line

plink --vcf ~/scratch/chrX.vcf.gz --make-bed --split-x 'hg19' --out ~/scratch/chrX-split
plink --bfile ~/scratch/chrX-split --check-sex

Top

Heterozygous Haplotype

Heterozygous Haplotype in Google Genomics

query <- paste(googleRepository, "sex-chromosome-heterozygous-haplotypes.sql", sep="")
male_sample_ids = paste("'", paste(sampleInfo[sampleInfo$gender == 'Male',]$call_call_set_name, collapse=",'", "'", sep=''))
male_sample_ids = gsub(" ", "", male_sample_ids, fixed = TRUE)
male_sub = list("_MALE_SAMPLE_IDS_" = male_sample_ids)
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement, constraints, male_sub))

# Retrieve heterozygous haplotype calls on chromosomes X and Y.
SELECT
  call.call_set_name,
  GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
  reference_name,
  start,
  end,
  reference_bases,
  GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
FROM
  [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
WHERE
  reference_name IN ('chrX', 'chrY')
OMIT
  call if (2 > COUNT(call.genotype))
  OR EVERY(call.genotype <= 0)
  OR EVERY(call.genotype = 1)
HAVING call.call_set_name IN ('')
# Optionally add a clause here to sort and limit the results.
#_ORDER_BY_

Error in UseMethod("xtable"): no applicable method for 'xtable' applied to an object of class "NULL"

TODO Compare to command line

Top

Ethnicity Inference

Ethnicity Inference in Google Genomics

Status: Waiting for PCA release from Elmer. See here for status.

Top

Genome Similarity

Genome Similarity in Google Genomics

Status: Installed apache maven and built Google Genomics dataflow package. Ran into errors when trying to run dataflow job.

java -cp target/google-genomics-dataflow-v1beta2-0.2-SNAPSHOT.jar   com/google/cloud/genomics/dataflow/pipelines/IdentityByState   --project=my-project-id   --output=gs://my-bucket/localtest.txt   --genomicsSecretsFile=client_secrets.json
Error: Could not find or load main class com.google.cloud.genomics.dataflow.pipelines.IdentityByState

Command line

plink --genome --vcf chr22-merged.vcf.gz

Top

Genotype Concordance

Calculate the concordance between sequencing data and genotyping data.

query <- "./sql/genotyping-concordance.sql"
result <- DisplayAndDispatchQuery(query,
                                  project=project,
                                  replacements=c(tableReplacement))

SELECT
  sample_id,
  calls_in_common,
  identical_calls,
  (identical_calls/calls_in_common) AS concordance
FROM (
  SELECT 
    sample_id,
    COUNT(seq_genotype) AS calls_in_common,
    SUM(IF(seq_genotype = gen_genotype, 1, 0)) AS identical_calls,
  FROM (
    SELECT
      seq.sample_id AS sample_id,
      seq.reference_name AS reference_name,
      seq.start AS start,
      seq.end AS end,
      seq.genotype AS seq_genotype,
      gen.genotype AS gen_genotype,
    FROM (
      SELECT
        sample_id,
        reference_name,
        start,
        end,
        genotype,
        bin,
      FROM js(
        (SELECT
          call.call_set_name,
          reference_name,
          start,
          end,
          call.genotype,
          reference_bases,
          GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alts,
          COUNT(alternate_bases) WITHIN RECORD AS num_alts,
        FROM
          [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
         #_WHERE_
        OMIT 
          call IF EVERY (call.genotype < 0)
        HAVING 
          num_alts <= 1
          AND reference_bases IN ('A','C','G','T')
          AND (alts IS null
            OR LENGTH(alts) <= 1)
        ),
        // Start javascript function
        // Input Columns
        call.call_set_name, reference_name, start, end, call.genotype,
        // Output Schema
        "[{name: 'sample_id', type: 'string'},
        {name: 'reference_name', type: 'string'},
        {name: 'start', type: 'integer'},
        {name: 'end', type: 'integer'},
        {name: 'genotype', type: 'string'},
        {name: 'bin', type: 'integer'}]",
        // Function
        "function(r, emit) {
          for (c of r.call) {
            var binSize = 5000;
            var startBin = Math.floor(r.start/binSize);
            var endBin = Math.floor(r.end/binSize);
            var genotype = JSON.stringify(c.genotype.sort());
            for (var bin = startBin; bin <= endBin; bin++){
              emit({
                sample_id: c.call_set_name,
                reference_name: r.reference_name,
                start: r.start,
                end: r.end,
                genotype: genotype,
                bin: bin,
              })
            }
          }
        }")) AS seq
JOIN EACH (
  SELECT
    sample_id,
    reference_name,
    start,
    end,
    genotype,
    bin,
  FROM js(
    (SELECT
      call.call_set_name,
      reference_name,
      start,
      end,
      call.genotype,
    FROM
      [gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_genotyping_vcfs]
      OMIT call IF EVERY (call.genotype < 0)       
    ),
    // Start javascript function
    // Input Columns
    call.call_set_name, reference_name, start, end, call.genotype,
    // Output Schema
    "[{name: 'sample_id', type: 'string'},
    {name: 'reference_name', type: 'string'},
    {name: 'start', type: 'integer'},
    {name: 'end', type: 'integer'},
    {name: 'genotype', type: 'string'},
    {name: 'bin', type: 'integer'}]",
    // Function
    "function(r, emit) {
      for (c of r.call) {
        var binSize = 5000;
        var bin = Math.floor(r.start/binSize);
        var genotype = JSON.stringify(c.genotype.sort());
        emit({
          sample_id: c.call_set_name,
          reference_name: r.reference_name,
          start: r.start,
          end: r.end,
          genotype: genotype,
          bin: bin,
        })
      }
    }")) AS gen
ON
  seq.sample_id = gen.sample_id
  AND seq.reference_name = gen.reference_name
  AND seq.bin = gen.bin
WHERE
  seq.start <= gen.start
  AND seq.end >= gen.end )
GROUP BY 
  sample_id)
Running query:   RUNNING  2.6s
Running query:   RUNNING  3.2s
Running query:   RUNNING  3.8s
Running query:   RUNNING  4.4s
Running query:   RUNNING  5.1s
Running query:   RUNNING  5.7s
Running query:   RUNNING  6.3s
Running query:   RUNNING  6.9s
Running query:   RUNNING  7.5s
Running query:   RUNNING  8.1s
Running query:   RUNNING  8.8s
Running query:   RUNNING  9.4s
Running query:   RUNNING 10.0s
Running query:   RUNNING 10.6s
Running query:   RUNNING 11.2s
Running query:   RUNNING 11.9s
Running query:   RUNNING 12.5s
Running query:   RUNNING 13.1s
Running query:   RUNNING 13.7s
Running query:   RUNNING 14.3s
Running query:   RUNNING 14.9s
Running query:   RUNNING 15.6s
Running query:   RUNNING 16.2s
Running query:   RUNNING 16.8s
Running query:   RUNNING 17.4s
Running query:   RUNNING 18.0s
Running query:   RUNNING 18.7s
Running query:   RUNNING 19.3s
Running query:   RUNNING 19.9s
Running query:   RUNNING 20.5s
Running query:   RUNNING 21.1s
Running query:   RUNNING 21.8s
Running query:   RUNNING 22.4s
Running query:   RUNNING 23.0s
Running query:   RUNNING 23.6s
Running query:   RUNNING 24.2s
Running query:   RUNNING 24.8s
Running query:   RUNNING 25.4s
Running query:   RUNNING 26.1s
Running query:   RUNNING 26.7s
Running query:   RUNNING 27.3s
Running query:   RUNNING 27.9s
Running query:   RUNNING 28.5s
Running query:   RUNNING 29.2s
Running query:   RUNNING 29.8s
Running query:   RUNNING 30.4s
Running query:   RUNNING 31.0s
Running query:   RUNNING 31.6s
Running query:   RUNNING 32.3s
Running query:   RUNNING 32.9s
Running query:   RUNNING 33.5s
Running query:   RUNNING 34.1s
Running query:   RUNNING 34.8s
Running query:   RUNNING 35.4s
Running query:   RUNNING 36.0s
Running query:   RUNNING 36.6s
Running query:   RUNNING 37.2s
Running query:   RUNNING 37.8s
Running query:   RUNNING 38.5s
Running query:   RUNNING 39.1s
Running query:   RUNNING 39.7s
Running query:   RUNNING 40.3s
Running query:   RUNNING 40.9s
Running query:   RUNNING 41.6s
Running query:   RUNNING 42.2s
Running query:   RUNNING 42.8s
Running query:   RUNNING 43.4s

sample_id	calls_in_common	identical_calls	concordance
LP6005038-DNA_A03	2199423	2177458	0.99
LP6005038-DNA_B01	2200890	2178871	0.99
LP6005038-DNA_A01	2200420	2178471	0.99
LP6005038-DNA_A02	2200848	2178903	0.99
LP6005038-DNA_B02	2201742	2179131	0.99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison.md

Comparison.md

Comparison of Google Genomics tools vs Academic Tools

Setup

Singleton Rate

Homozygosity Rate and Inbreeding Coefficient

Hardy-Weinberg Equilibrium

Ti/Tv by Alternate Allele Counts

Ti/Tv by Depth

Sample-Level Missingness

Variant-Level Missingness

Sex Inference

Heterozygous Haplotype

Ethnicity Inference

Genome Similarity

Genotype Concordance

Files

Comparison.md

Latest commit

History

Comparison.md

File metadata and controls

Comparison of Google Genomics tools vs Academic Tools

Setup

Singleton Rate

Homozygosity Rate and Inbreeding Coefficient

Hardy-Weinberg Equilibrium

Ti/Tv by Alternate Allele Counts

Ti/Tv by Depth

Sample-Level Missingness

Variant-Level Missingness

Sex Inference

Heterozygous Haplotype

Ethnicity Inference

Genome Similarity

Genotype Concordance