The purpose of this codelab is to compare BigQuery queries to academic command line tools for genomic analysis.
- Setup
- Singleton Rate
- Homozygosity Rate and Inbreeding Coefficient
- Hardy-Weinberg Equilibrium
- Ti/Tv by Alternate Allele Counts
- Ti/Tv by Depth
- Sample-Level Missingness
- Variant-Level Missingness
- Sex Inference
- Heterozygous Haplotype
- Ethnicity Inference
- Genome Similarity
- Genotype Concordance
tableReplacement <- list("_THE_TABLE_"="gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2",
"_THE_EXPANDED_TABLE_"="gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2",
"_REF_TABLE_"="gbsc-gcp-project-mvp:qc_tables.5_genomes_ref_calls_brca1",
"_VARIANT_TABLE_"="gbsc-gcp-project-mvp:qc_tables.5_genomes_variants_brca1",
"_GENOTYPING_TABLE_"="gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_genotyping_vcfs")
sampleData <- read.csv("./data/patient_info.csv")
sampleInfo <- select(sampleData, call_call_set_name=Catalog.ID, gender=Gender)
constraints <- list("#_WHERE_"="WHERE
reference_name = 'chr17'
AND start BETWEEN 41196311
AND 41277499")
googleRepository = "https://raw.githubusercontent.com/googlegenomics/codelabs/master/R/PlatinumGenomes-QC/sql/"
Singletone Rate in Google Genomics
For each genome, count the number of variants shared by no other member of the cohort. Too many private calls for a particular individual may indicate a problem.
For our 5 genome sample set the singleton rate will be very high due to the small number of genomes.
query <- paste(googleRepository, "private-variants-brca1.sql", sep = "")
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=tableReplacement)
# Private variants within BRCA1.
SELECT
reference_name AS CHROM,
start AS POS,
GROUP_CONCAT(CASE WHEN cnt = 1 THEN 'S'
WHEN cnt = 2 THEN 'D'
ELSE STRING(cnt) END) AS SINGLETON_DOUBLETON,
reference_bases AS REF,
alternate_bases AS ALT,
GROUP_CONCAT(call.call_set_name) AS INDV,
GROUP_CONCAT(genotype) AS genotype,
COUNT(call.call_set_name) AS num_samples_with_variant
FROM (
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
alt_num,
call.call_set_name,
GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
SUM(call.genotype == alt_num) WITHIN call AS cnt,
FROM (
FLATTEN((
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
POSITION(alternate_bases) AS alt_num,
call.call_set_name,
call.genotype,
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
WHERE
reference_name = 'chr17'
AND start BETWEEN 41196311
AND 41277499
OMIT call IF EVERY(call.genotype = -1)
),
alternate_bases)
)
OMIT RECORD IF alternate_bases IS NULL
HAVING
cnt > 0
)
GROUP BY
chrom,
pos,
ref,
alt
HAVING
num_samples_with_variant = 1
ORDER BY
POS,
INDV
Number of rows returned by this query: 33.
Displaying the first few results:
CHROM | POS | SINGLETON_DOUBLETON | REF | ALT | INDV | genotype | num_samples_with_variant |
---|---|---|---|---|---|---|---|
chr17 | 41196944 | S | T | C | LP6005038-DNA_B02 | "0,1" | 1 |
chr17 | 41197938 | S | A | AT | LP6005038-DNA_B02 | "0,1" | 1 |
chr17 | 41200536 | S | T | C | LP6005038-DNA_A02 | "0,1" | 1 |
chr17 | 41202633 | S | AG | A | LP6005038-DNA_A01 | "0,1" | 1 |
chr17 | 41204835 | S | ACACACT | A | LP6005038-DNA_A01 | "0,1" | 1 |
chr17 | 41208691 | S | G | GA | LP6005038-DNA_A03 | "0,1" | 1 |
Compare to output from vcftools, which has 85 some of which are for 0/0 genotypes from reference matching blocks (see the vcftools command line used to create this file).
Command line execution
vcftools --gzvcf brca1.merged.sample_set.vcf.gz --singletons --out singletons
expectedResult <- read.table("./data/brca1/singletons.singletons", header=TRUE)
# Convert to zero-based coordinates
expectedResult <- mutate(expectedResult, POS = POS - 1)
# Clean colnames to match
colnames(expectedResult) <- gsub('\\.+', '_', colnames(expectedResult))
How many singletons do the two results have in common?
nrow(inner_join(result, expectedResult))
## Joining by: c("CHROM", "POS", "SINGLETON_DOUBLETON", "INDV")
## [1] 54
How many singletons were only identified by BigQuery?
onlyBQ <- anti_join(result, expectedResult)
## Joining by: c("CHROM", "POS", "SINGLETON_DOUBLETON", "INDV")
nrow(onlyBQ)
## [1] 0
## Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, : length of 'dimnames' [2] not equal to array extent
Which singletons were only identified by vcftools?
onlyVcftools <- anti_join(expectedResult, result)
## Joining by: c("CHROM", "POS", "SINGLETON_DOUBLETON", "INDV")
print(xtable(onlyVcftools), type="html", include.rownames=F)
CHROM | POS | SINGLETON_DOUBLETON | ALLELE | INDV |
---|---|---|---|---|
chr17 | 41271293.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41267518.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256102.00 | D | G | LP6005038-DNA_B02 |
chr17 | 41256101.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256100.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256086.00 | D | G | LP6005038-DNA_B02 |
chr17 | 41256085.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256084.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256083.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256082.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256081.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256080.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256079.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256078.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256077.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256076.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41256075.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41252696.00 | D | T | LP6005038-DNA_B01 |
chr17 | 41252647.00 | D | G | LP6005038-DNA_B02 |
chr17 | 41252634.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41249363.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41242077.00 | D | G | LP6005038-DNA_B02 |
chr17 | 41242076.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41230105.00 | D | T | LP6005038-DNA_B02 |
chr17 | 41229760.00 | D | T | LP6005038-DNA_B02 |
chr17 | 41226740.00 | D | T | LP6005038-DNA_B02 |
chr17 | 41226739.00 | D | T | LP6005038-DNA_B02 |
chr17 | 41226738.00 | D | G | LP6005038-DNA_B02 |
chr17 | 41226737.00 | D | G | LP6005038-DNA_B02 |
chr17 | 41226736.00 | D | G | LP6005038-DNA_B02 |
chr17 | 41214210.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41242075.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41214209.00 | D | A | LP6005038-DNA_B02 |
chr17 | 41202634.00 | D | G | LP6005038-DNA_B02 |
Retrieving the gVCF data for the singletons identified only by vcftools:
having <- paste("start = ", onlyVcftools$POS,
sep="", collapse=" OR ")
query <- paste(googleRepository, "examine-data.sql", sep="")
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement,
"_HAVING_"=having))
# Examine the data for particular calls.
SELECT
call.call_set_name,
GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
reference_name,
start,
end,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
quality,
GROUP_CONCAT(filter) WITHIN RECORD AS filter,
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
WHERE
reference_name = 'chr17'
HAVING
start = 41271293 OR start = 41267518 OR start = 41256102 OR start = 41256101 OR start = 41256100 OR start = 41256086 OR start = 41256085 OR start = 41256084 OR start = 41256083 OR start = 41256082 OR start = 41256081 OR start = 41256080 OR start = 41256079 OR start = 41256078 OR start = 41256077 OR start = 41256076 OR start = 41256075 OR start = 41252696 OR start = 41252647 OR start = 41252634 OR start = 41249363 OR start = 41242077 OR start = 41242076 OR start = 41230105 OR start = 41229760 OR start = 41226740 OR start = 41226739 OR start = 41226738 OR start = 41226737 OR start = 41226736 OR start = 41214210 OR start = 41242075 OR start = 41214209 OR start = 41202634
ORDER BY
start,
end,
call.call_set_name
call_call_set_name | genotype | reference_name | start | end | reference_bases | alternate_bases | quality | filter |
---|---|---|---|---|---|---|---|---|
LP6005038-DNA_B01 | 0,0 | chr17 | 41252696 | 41252755 | T | 31.21 |
It appears that they correspond either to:
- A reference-matching block, so not actually a singleton and just perhaps violating an assumption in the vcftools code. Most of the start sites queried return nothing because they are withing reference matching blocks.
- Or a non-singleon variant, perhaps due to a problem in converting the gVCF data to all-positions VCF via gvcftools?
Homozygosity Rate and Inbreeding Coefficient in Google Genomics
For each genome, compare the expected and observed rates of homozygosity.
query <- paste(googleRepository, "homozygous-variants.sql", sep="")
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement, constraints))
# Compute the expected and observed homozygosity rate for each individual.
SELECT
call.call_set_name,
O_HOM,
ROUND(E_HOM, 2) as E_HOM,
N_SITES,
ROUND((O_HOM - E_HOM) / (N_SITES - E_HOM), 5) AS F
FROM (
SELECT
call.call_set_name,
SUM(first_allele = second_allele) AS O_HOM,
SUM(1.0 - (2.0 * freq * (1.0 - freq) * (called_allele_count / (called_allele_count - 1.0)))) AS E_HOM,
COUNT(call.call_set_name) AS N_SITES,
FROM (
SELECT
reference_name,
start,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
call.call_set_name,
NTH(1, call.genotype) WITHIN call AS first_allele,
NTH(2, call.genotype) WITHIN call AS second_allele,
COUNT(alternate_bases) WITHIN RECORD AS num_alts,
SUM(call.genotype >= 0) WITHIN RECORD AS called_allele_count,
IF((SUM(1 = call.genotype) > 0),
SUM(call.genotype = 1)/SUM(call.genotype >= 0),
-1) WITHIN RECORD AS freq
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2]
# Optionally add a clause here to limit the query to a particular
# region of the genome.
WHERE
reference_name = 'chr17'
AND start BETWEEN 41196311
AND 41277499
# Skip no calls and haploid sites
OMIT call IF SOME(call.genotype < 0) OR (2 > COUNT(call.genotype))
HAVING
# Skip 1/2 genotypes _and non-SNP variants
num_alts = 1
AND reference_bases IN ('A','C','G','T')
AND alternate_bases IN ('A','C','G','T')
)
GROUP BY
call.call_set_name
)
ORDER BY
call.call_set_name
Number of rows returned by this query: 5.
Displaying the first few results:
call_call_set_name | O_HOM | E_HOM | N_SITES | F |
---|---|---|---|---|
LP6005038-DNA_A01 | 16 | 79.87 | 136 | -1.14 |
LP6005038-DNA_A02 | 14 | 68.87 | 135 | -0.83 |
LP6005038-DNA_A03 | 15 | 68.23 | 134 | -0.81 |
LP6005038-DNA_B01 | 18 | 80.00 | 136 | -1.11 |
LP6005038-DNA_B02 | 137 | 81.97 | 140 | 0.95 |
Let's compare to what was found using vcftools.
Command line execution.
vcftools --gzvcf brca1.merged.sample_set.vcf.gz --het --out het_calls
expectedResult <- read.table("./data/brca1/het_calls.het", header=TRUE)
E_HOM.x | E_HOM.y | F.x | F.y | INDV | N_SITES.x | N_SITES.y | O_HOM.x | O_HOM.y |
---|---|---|---|---|---|---|---|---|
65.50 | 79.87 | -0.85 | -1.14 | LP6005038-DNA_A01 | 143 | 136 | 0 | 16 |
66.40 | 68.87 | -0.70 | -0.83 | LP6005038-DNA_A02 | 150 | 135 | 8 | 14 |
67.60 | 68.23 | -0.72 | -0.81 | LP6005038-DNA_A03 | 149 | 134 | 9 | 15 |
64.00 | 80.00 | -0.87 | -1.11 | LP6005038-DNA_B01 | 135 | 136 | 2 | 18 |
61.80 | 81.97 | 0.92 | 0.95 | LP6005038-DNA_B02 | 133 | 140 | 127 | 137 |
The logic in the query looks similar to vcftools output_het method but there is clearly a difference.
TODO: investigate the difference further.
Hardy-Weinberg Equiibrium in Google Genomics
query <- paste(googleRepository, "hardy-weinberg.sql", sep="")
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement, constraints))
# The following query computes the Hardy-Weinberg equilibrium for variants.
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
OBS_HOM1,
OBS_HET,
OBS_HOM2,
E_HOM1,
E_HET,
E_HOM2,
# Chi Squared Calculation
# SUM(((Observed - Expected)^2) / Expected )
ROUND((POW(OBS_HOM1 - E_HOM1, 2) / E_HOM1)
+ (POW(OBS_HET - E_HET, 2) / E_HET)
+ (POW(OBS_HOM2 - E_HOM2, 2) / E_HOM2), 6)
AS ChiSq,
# Determine if Chi Sq value is significant
IF((POW(OBS_HOM1 - E_HOM1, 2) / E_HOM1)
+ (POW(OBS_HET - E_HET, 2) / E_HET)
+ (POW(OBS_HOM2 - E_HOM2, 2) / E_HOM2)
> 5.991, "TRUE", "FALSE") AS PVALUE_SIG
FROM (
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
OBS_HOM1,
OBS_HET,
OBS_HOM2,
# Expected AA
# p^2
# ((COUNT(AA) + (COUNT(Aa)/2) /
# SAMPLE_COUNT) ^ 2) * SAMPLE_COUNT
ROUND(POW((OBS_HOM1 + (OBS_HET/2)) /
SAMPLE_COUNT, 2) * SAMPLE_COUNT, 2)
AS E_HOM1,
# Expected Aa
# 2pq
# 2 * (COUNT(AA) + (COUNT(Aa)/2) / SAMPLE_COUNT) *
# (COUNT(aa) + (COUNT(Aa)/2) / SAMPLE_COUNT)
# * SAMPLE_COUNT
ROUND(2 * ((OBS_HOM1 + (OBS_HET/2)) / SAMPLE_COUNT) *
((OBS_HOM2 + (OBS_HET/2)) / SAMPLE_COUNT)
* SAMPLE_COUNT, 2)
AS E_HET,
# Expected aa
# q^2
# (COUNT(aa) + (COUNT(Aa)/2) /
# SAMPLE_COUNT) ^ 2 * SAMPLE_COUNT
ROUND(POW((OBS_HOM2 + (OBS_HET/2)) /
SAMPLE_COUNT, 2) * SAMPLE_COUNT, 2)
AS E_HOM2,
FROM (
SELECT
reference_name,
start,
reference_bases,
alternate_bases,
HOM_REF AS OBS_HOM1,
HET AS OBS_HET,
HOM_ALT AS OBS_HOM2,
HOM_REF + HET + HOM_ALT AS SAMPLE_COUNT,
FROM (
SELECT
reference_name,
start,
END,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
COUNT(alternate_bases) WITHIN RECORD AS num_alts,
SUM(EVERY(0 = call.genotype)) WITHIN call AS HOM_REF,
SUM(EVERY(1 = call.genotype)) WITHIN call AS HOM_ALT,
SUM(SOME(0 = call.genotype)
AND SOME(1 = call.genotype)) WITHIN call AS HET,
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2]
# Optionally add a clause here to limit the query to a particular
# region of the genome.
WHERE
reference_name = 'chr17'
AND start BETWEEN 41196311
AND 41277499
HAVING
# Skip 1/2 genotypes
num_alts = 1
)))
# Optionally add a clause here to sort and limit the results.
#_ORDER_BY_
Number of rows returned by this query: 185.
Displaying the first few results:
reference_name | start | reference_bases | alternate_bases | OBS_HOM1 | OBS_HET | OBS_HOM2 | E_HOM1 | E_HET | E_HOM2 | ChiSq | PVALUE_SIG |
---|---|---|---|---|---|---|---|---|---|---|---|
chr17 | 41277186 | G | C | 1 | 4 | 0 | 1.80 | 2.40 | 0.80 | 2.22 | FALSE |
chr17 | 41230104 | CT | C | 0 | 4 | 0 | 1.00 | 2.00 | 1.00 | 4.00 | FALSE |
chr17 | 41230227 | G | A | 1 | 4 | 0 | 1.80 | 2.40 | 0.80 | 2.22 | FALSE |
chr17 | 41230335 | A | G | 1 | 4 | 0 | 1.80 | 2.40 | 0.80 | 2.22 | FALSE |
chr17 | 41230375 | A | G | 1 | 4 | 0 | 1.80 | 2.40 | 0.80 | 2.22 | FALSE |
chr17 | 41230523 | T | G | 3 | 2 | 0 | 3.20 | 1.60 | 0.20 | 0.31 | FALSE |
Compare to brca1.hwe (see the vcftools command line used to create this file).
Command line
vcftools --gzvcf brca1.merged.sample_set.vcf.gz --hardy --out hardy
require(dplyr)
df <- read.table("./data/brca1/hardy.hwe", header=TRUE)
obsSplitCol <- "OBS.HOM1.HET.HOM2."
obsTemp <- read.table(text=as.character(df[, obsSplitCol]), sep = "/")
names(obsTemp) <- c("OBS_HOM1", "OBS_HET", "OBS_HOM2")
eSplitCol <- "E.HOM1.HET.HOM2."
eTemp <- read.table(text=as.character(df[, eSplitCol]), sep = "/")
names(eTemp) <- c("E_HOM1", "E_HET", "E_HOM2")
expectedResult <- cbind(cbind(df[setdiff(names(df), c(obsSplitCol,eSplitCol))], obsTemp), eTemp)
# Convert to zero-based coordinates
expectedResult <- mutate(expectedResult, POS = POS - 1)
names(expectedResult)[1] = 'reference_name'
names(expectedResult)[2] = 'start'
How many results do the two results have in common?
nrow(inner_join(result, expectedResult, by=c("reference_name", "start", "OBS_HOM1", "OBS_HET", "OBS_HOM2")))
## [1] 160
How many result were only identified by BigQuery?
onlyBQ <- anti_join(result, expectedResult, , by=c("reference_name", "start", "OBS_HOM1", "OBS_HET", "OBS_HOM2"))
nrow(onlyBQ)
## [1] 25
Let's take a look at the first few.
reference_name | start | reference_bases | alternate_bases | OBS_HOM1 | OBS_HET | OBS_HOM2 | E_HOM1 | E_HET | E_HOM2 | ChiSq | PVALUE_SIG |
---|---|---|---|---|---|---|---|---|---|---|---|
chr17 | 41196944 | T | C | 4 | 1 | 0 | 4.05 | 0.90 | 0.05 | 0.06 | FALSE |
chr17 | 41201363 | T | C | 3 | 2 | 0 | 3.20 | 1.60 | 0.20 | 0.31 | FALSE |
chr17 | 41211485 | C | CACA | 0 | 1 | 0 | 0.25 | 0.50 | 0.25 | 1.00 | FALSE |
chr17 | 41211485 | C | CACAACA | 0 | 1 | 0 | 0.25 | 0.50 | 0.25 | 1.00 | FALSE |
chr17 | 41213759 | C | T | 3 | 2 | 0 | 3.20 | 1.60 | 0.20 | 0.31 | FALSE |
chr17 | 41216205 | T | C | 3 | 2 | 0 | 3.20 | 1.60 | 0.20 | 0.31 | FALSE |
How many results were only identified by vcftools?
onlyVcftools <- anti_join(expectedResult, result, , by=c("reference_name", "start", "OBS_HOM1", "OBS_HET", "OBS_HOM2"))
nrow(onlyVcftools)
## [1] 16
Let's take a look at the first few.
reference_name | start | ChiSq | P | OBS_HOM1 | OBS_HET | OBS_HOM2 | E_HOM1 | E_HET | E_HOM2 |
---|---|---|---|---|---|---|---|---|---|
chr17 | 41196944.00 | 1.00 | 1.00 | 0 | 1 | 0 | 0.25 | 0.50 | 0.25 |
chr17 | 41201363.00 | 2.00 | 1.00 | 0 | 2 | 0 | 0.50 | 1.00 | 0.50 |
chr17 | 41213759.00 | 2.00 | 1.00 | 0 | 2 | 0 | 0.50 | 1.00 | 0.50 |
chr17 | 41216205.00 | 2.00 | 1.00 | 0 | 2 | 0 | 0.50 | 1.00 | 0.50 |
chr17 | 41220771.00 | 1.00 | 1.00 | 0 | 1 | 0 | 0.25 | 0.50 | 0.25 |
chr17 | 41222098.00 | 2.00 | 1.00 | 0 | 2 | 0 | 0.50 | 1.00 | 0.50 |
Retrieving the gVCF data for the results identified only by vcftools:
having <- list("_HAVING_" = paste("start <= ", onlyVcftools$start,
"AND",
"end >= ", onlyVcftools$start+1,
collapse=" OR "))
query <- paste(googleRepository, "examine-data.sql", sep="")
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement,
having))
# Examine the data for particular calls.
SELECT
call.call_set_name,
GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
reference_name,
start,
end,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
quality,
GROUP_CONCAT(filter) WITHIN RECORD AS filter,
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
WHERE
reference_name = 'chr17'
HAVING
start <= 41258790 AND end >= 41258791 OR start <= 41252694 AND end >= 41252695 OR start <= 41249362 AND end >= 41249363 OR start <= 41248903 AND end >= 41248904 OR start <= 41247623 AND end >= 41247624 OR start <= 41247054 AND end >= 41247055 OR start <= 41245470 AND end >= 41245471 OR start <= 41241567 AND end >= 41241568 OR start <= 41239979 AND end >= 41239980 OR start <= 41251930 AND end >= 41251931 OR start <= 41222098 AND end >= 41222099 OR start <= 41220771 AND end >= 41220772 OR start <= 41216205 AND end >= 41216206 OR start <= 41213759 AND end >= 41213760 OR start <= 41201363 AND end >= 41201364 OR start <= 41196944 AND end >= 41196945
ORDER BY
start,
end,
call.call_set_name
The first few:
call_call_set_name | genotype | reference_name | start | end | reference_bases | alternate_bases | quality | filter |
---|---|---|---|---|---|---|---|---|
LP6005038-DNA_A01 | 0,0 | chr17 | 41196408 | 41197273 | G | 61.23 | ||
LP6005038-DNA_A02 | 0,0 | chr17 | 41196408 | 41197273 | G | 61.23 | ||
LP6005038-DNA_A03 | 0,0 | chr17 | 41196408 | 41197273 | G | 61.23 | ||
LP6005038-DNA_B01 | 0,0 | chr17 | 41196408 | 41197273 | G | 61.23 | ||
LP6005038-DNA_B02 | 0,1 | chr17 | 41196944 | 41196945 | T | C | 315.77 | PASS |
LP6005038-DNA_B01 | 0,0 | chr17 | 41200109 | 41201701 | A | 40.23 |
Ti/Tv by Alternate Allele Counts in Google Genomics
query <- paste(googleRepository, "ti-tv-by-alternate-allele-count.sql", sep="")
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement, constraints))
# Compute the Ti/Tv ratio for variants binned by alternate allele count.
SELECT
transitions,
transversions,
transitions/transversions AS titv,
alternate_allele_count
FROM (
SELECT
SUM(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
SUM(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
alternate_allele_count
FROM (
SELECT
CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation,
COUNT(alternate_bases) WITHIN RECORD AS num_alts,
SUM(call.genotype = 1) WITHIN RECORD AS alternate_allele_count,
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
# Optionally add clause here to limit the query to a particular
# region of the genome.
WHERE
reference_name = 'chr17'
AND start BETWEEN 41196311
AND 41277499
HAVING
# Skip 1/2 genotypes _and non-SNP variants
num_alts = 1
AND reference_bases IN ('A','C','G','T')
AND alternate_bases IN ('A','C','G','T'))
GROUP BY
alternate_allele_count)
ORDER BY
alternate_allele_count DESC
transitions | transversions | titv | alternate_allele_count |
---|---|---|---|
0 | 1 | 0.00 | 10 |
81 | 29 | 2.79 | 4 |
11 | 3 | 3.67 | 2 |
8 | 5 | 1.60 | 1 |
3 | 0 | 0 |
Compare to output from vcftools.
Command line
vcftools --gzvcf brca1.merged.sample_set.vcf.gz --TsTv-by-count --out TsTv-by-coun
expectedResult <- read.table("./data/brca1/TsTv-by-count.TsTv.count", header=TRUE)
ALT_ALLELE_COUNT | N_Ts | N_Tv | Ts.Tv |
---|---|---|---|
0 | 0 | 0 | |
1 | 8 | 5 | 1.60 |
2 | 11 | 3 | 3.67 |
3 | 0 | 0 | |
4 | 81 | 29 | 2.79 |
5 | 0 | 0 | |
6 | 0 | 0 | |
7 | 0 | 0 | |
8 | 0 | 0 | |
9 | 0 | 0 |
The results for BigQuery and vcftools are identical except for the case where there are 10 alternate alleles. With only 5 genomes, 10 sites with an alternate is the maximum. vcftools may ignore sites where all calls are the alternate allele.
query <- "./sql/ti-tv-by-depth.sql"
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement))
SELECT
call.call_set_name,
(transitions/transversions) AS titv_ratio,
average_depth,
FROM (
SELECT
call.call_set_name,
SUM(mutation IN ('A->G', 'G->A', 'C->T', 'T->C')) AS transitions,
SUM(mutation IN ('A->C', 'C->A', 'G->T', 'T->G',
'A->T', 'T->A', 'C->G', 'G->C')) AS transversions,
ROUND(AVG(call.DP)) AS average_depth,
FROM (
SELECT
call.call_set_name,
CONCAT(reference_bases, CONCAT(STRING('->'), alternate_bases)) AS mutation,
COUNT(alternate_bases) WITHIN RECORD AS num_alts,
call.DP
FROM (
SELECT
call.call_set_name,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
call.genotype,
call.DP,
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2]
# Optionally add clause here to limit the query to a particular
# region of the genome.
#_WHERE_
)
WHERE
call.DP is not null
HAVING
# Skip 1/2 genotypes _and non-SNP variants
num_alts = 1
AND reference_bases IN ('A','C','G','T')
AND alternate_bases IN ('A','C','G','T'))
GROUP BY
call.call_set_name,
call.DP,)
WHERE
transversions > 0
GROUP BY
call.call_set_name,
titv_ratio,
average_depth,
call_call_set_name | titv_ratio | average_depth |
---|---|---|
LP6005038-DNA_B01 | 2.09 | 39.00 |
LP6005038-DNA_B02 | 2.20 | 28.00 |
LP6005038-DNA_A02 | 2.20 | 30.00 |
LP6005038-DNA_A01 | 2.12 | 30.00 |
LP6005038-DNA_A03 | 2.15 | 35.00 |
LP6005038-DNA_B01 | 2.16 | 33.00 |
ggplot(result, aes(x=average_depth, y=titv_ratio, color=call_call_set_name)) +
geom_point() +
#geom_smooth(se=FALSE, # Don't add shaded confidence region
# fullrange=T) +
ggtitle("Ti/Tv Ratio By Depth") +
xlab("Coverage Depth") +
ylab("Ti/Tv")
TODO: Find academic tool to compare against.
Sample-Level Missingness in Google Genomics
See this document for details on how tables were created for this section.
query <- "./sql/sample-missingness-missing-calls.sql"
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement, constraints))
SELECT
bc.sample_name AS sample_name,
nc.no_call_count as no_call_count,
nc.no_call_count/bc.base_count AS missingness,
FROM (
SELECT
sample_name,
COUNT(no_call) as no_call_count,
FROM(
SELECT
ref.sample_name AS sample_name,
ref.chromosome AS chromosome,
ref.start AS start,
ref.end AS end,
ref.reference_bases AS reference_bases,
True AS no_call
FROM
(SELECT
*
FROM
[gbsc-gcp-project-mvp:qc_tables.5_genomes_ref_calls_brca1] ) AS ref
LEFT OUTER JOIN (
SELECT
*
FROM
[gbsc-gcp-project-mvp:qc_tables.5_genomes_variants_brca1] ) AS var
ON
ref.sample_name = var.sample_name
AND ref.chromosome = var.chromosome
AND ref.start = var.start
WHERE
ref.is_ref is False
AND var.is_variant_call is NULL )
GROUP BY
sample_name ) AS nc
JOIN (
SELECT
sample_name,
COUNT(start) AS base_count
FROM
[gbsc-gcp-project-mvp:qc_tables.5_genomes_ref_calls_brca1]
GROUP BY
sample_name ) as bc
ON
bc.sample_name = nc.sample_name;
sample_name | no_call_count | missingness |
---|---|---|
LP6005038-DNA_A01 | 286 | 0.00 |
LP6005038-DNA_A02 | 115 | 0.00 |
LP6005038-DNA_A03 | 332 | 0.00 |
LP6005038-DNA_B01 | 228 | 0.00 |
LP6005038-DNA_B02 | 639 | 0.01 |
Compare to output from vcftools.
Command line
vcftools --gzvcf brca1.merged.sample_set.vcf.gz --missing-indv --out missingness
expectedResult <- read.table("./data/brca1/missingness.imiss", header=TRUE)
INDV | N_DATA | N_GENOTYPES_FILTERED | N_MISS | F_MISS |
---|---|---|---|---|
LP6005038-DNA_A01 | 81239 | 0 | 274 | 0.00 |
LP6005038-DNA_A02 | 81239 | 0 | 107 | 0.00 |
LP6005038-DNA_A03 | 81239 | 0 | 311 | 0.00 |
LP6005038-DNA_B01 | 81239 | 0 | 205 | 0.00 |
LP6005038-DNA_B02 | 81239 | 0 | 73 | 0.00 |
These results from BigQuery and vcftools are expected to be different. Sample-level missingness in BigQuery is calculated by joining all positions in the genome against the reference genome to identify all sites that were not called by sequencing. Vcftools only looks at sites that were called by at least one sample in the group.
Variant-Level Missingness in Google Genomics
query <- "./sql/variant-level-missingness-missing-calls.sql"
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement, constraints))
#SELECT FROM [va_aaa_pilot_data.5_genome_test_vcfs_no_calls] LIMIT 1000
# Compute the ratio no-calls for each variant.
SELECT
reference_name,
start,
END,
reference_bases,
alternate_bases,
no_calls,
all_calls,
(sample_count*2 - all_calls) AS missing_calls,
((no_calls + (sample_count*2 - all_calls))/(all_calls + (sample_count*2 - all_calls))) AS missingness_rate,
FROM (
SELECT
reference_name,
start,
END,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
SUM(call.genotype == -1) WITHIN RECORD AS no_calls,
COUNT(call.genotype) WITHIN RECORD AS all_calls,
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2]
# Optionally add clause here to limit the query to a particular
# region of the genome.
WHERE
reference_name = 'chr17'
AND start BETWEEN 41196311
AND 41277499
) as calls
CROSS JOIN (
SELECT
COUNT(sample_name) AS sample_count
FROM(
SELECT
call.call_set_name AS sample_name,
FROM [va_aaa_pilot_data.5_genome_test_vcfs_no_calls]
GROUP BY sample_name)) as count
# Optionally add a clause here to sort and limit the results.
#_ORDER_BY_
Error: Not found: Table gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_no_calls
notFound. Not found: Table gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_no_calls
sample_name | no_call_count | missingness |
---|---|---|
LP6005038-DNA_A01 | 286 | 0.00 |
LP6005038-DNA_A02 | 115 | 0.00 |
LP6005038-DNA_A03 | 332 | 0.00 |
LP6005038-DNA_B01 | 228 | 0.00 |
LP6005038-DNA_B02 | 639 | 0.01 |
Compare to output from vcftools.
Command line
vcftools --gzvcf brca1.merged.sample_set.vcf.gz --missing_site --out missing_sites
expectedResult <- read.table("./data/brca1/missing_sites.lmiss", header=TRUE)
expectedResult <- mutate(expectedResult, POS = POS - 1)
n = c("reference_name", "start", "expected_count", "filtered", "expected_missing_count", "expected_rate")
names(expectedResult) = n
print(xtable(head(expectedResult)), type="html", include.rownames=F)
reference_name | start | expected_count | filtered | expected_missing_count | expected_rate |
---|---|---|---|---|---|
chr17 | 41196311.00 | 10 | 0 | 0 | 0.00 |
chr17 | 41196312.00 | 10 | 0 | 0 | 0.00 |
chr17 | 41196313.00 | 10 | 0 | 0 | 0.00 |
chr17 | 41196314.00 | 10 | 0 | 0 | 0.00 |
chr17 | 41196315.00 | 10 | 0 | 0 | 0.00 |
chr17 | 41196316.00 | 10 | 0 | 0 | 0.00 |
Let's join them and look at the differences
joinedResult = data.table(inner_join(result, expectedResult, by=c("reference_name", "start")))
## Error in eval(expr, envir, enclos): could not find function "data.table"
matches = joinedResult[missingness_rate == expected_rate]
## Error in `[.data.frame`(joinedResult, missingness_rate == expected_rate): object 'missingness_rate' not found
differences = joinedResult[missingness_rate != expected_rate]
## Error in `[.data.frame`(joinedResult, missingness_rate != expected_rate): object 'missingness_rate' not found
How many do they have in common?
nrow(matches)
## Error in nrow(matches): object 'matches' not found
How many are different?
nrow(differences)
## Error in nrow(differences): object 'differences' not found
Here we can see which rows where different
## Error in xtable(differences): object 'differences' not found
The differences appear to be mostly form positions with insertions or deletions.
TODO: Look at the differences more closely.
Sex Inference in Google Genomics
query <- paste(googleRepository, "gender-check.sql", sep="")
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement, constraints))
# Compute the the homozygous and heterozygous variant counts for each individual
# within chromosome X to help determine whether the gender phenotype value is
# correct for each individual.
SELECT
call.call_set_name,
ROUND(SUM(het_RA)/(SUM(hom_AA) + SUM(het_RA)), 3) AS perct_het_alt_in_snvs,
ROUND(SUM(hom_AA)/(SUM(hom_AA) + SUM(het_RA)), 3) AS perct_hom_alt_in_snvs,
SUM(hom_AA) AS hom_AA_count,
SUM(het_RA) AS het_RA_count,
SUM(hom_RR) AS hom_RR_count,
FROM (
SELECT
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
COUNT(alternate_bases) WITHIN RECORD AS num_alts,
call.call_set_name,
SOME(call.genotype = 0) AND NOT SOME(call.genotype > 0) WITHIN call AS hom_RR,
SOME(call.genotype > 0) AND NOT SOME(call.genotype = 0) WITHIN call AS hom_AA,
SOME(call.genotype > 0) AND SOME(call.genotype = 0) WITHIN call AS het_RA
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_vcfs_2]
WHERE
reference_name = 'chrX'
AND start NOT BETWEEN 59999 AND 2699519
AND start NOT BETWEEN 154931042 AND 155260559
HAVING
# Skip 1/2 genotypes _and non-SNP variants
num_alts = 1
AND reference_bases IN ('A','C','G','T')
AND alternate_bases IN ('A','C','G','T')
)
GROUP BY
call.call_set_name
ORDER BY
call.call_set_name
Number of rows returned by this query: 5.
Displaying the first few results:
call_call_set_name | perct_het_alt_in_snvs | perct_hom_alt_in_snvs | hom_AA_count | het_RA_count | hom_RR_count |
---|---|---|---|---|---|
LP6005038-DNA_A01 | 0.03 | 0.97 | 71719 | 2090 | 91043 |
LP6005038-DNA_A02 | 0.03 | 0.97 | 72908 | 1950 | 89501 |
LP6005038-DNA_A03 | 0.59 | 0.41 | 44798 | 64482 | 56682 |
LP6005038-DNA_B01 | 0.03 | 0.97 | 74881 | 2165 | 87670 |
LP6005038-DNA_B02 | 0.03 | 0.97 | 73620 | 2592 | 88003 |
Let's join this with the sample information:
joinedResult <- inner_join(result, sampleInfo)
And visualize the results:
ggplot(joinedResult) +
geom_point(aes(x=call_call_set_name, y=perct_het_alt_in_snvs, color=gender)) +
theme(axis.text.x=if(nrow(result) <= 20)
{element_text(angle = 90, hjust = 1)} else {element_blank()}) +
xlab("Sample") +
ylab("Heterozygosity Rate ") +
ggtitle("Heterozygosity Rate on the X Chromosome")
TODO: Compare to Plink
Command line
plink --vcf ~/scratch/chrX.vcf.gz --make-bed --split-x 'hg19' --out ~/scratch/chrX-split
plink --bfile ~/scratch/chrX-split --check-sex
Heterozygous Haplotype in Google Genomics
query <- paste(googleRepository, "sex-chromosome-heterozygous-haplotypes.sql", sep="")
male_sample_ids = paste("'", paste(sampleInfo[sampleInfo$gender == 'Male',]$call_call_set_name, collapse=",'", "'", sep=''))
male_sample_ids = gsub(" ", "", male_sample_ids, fixed = TRUE)
male_sub = list("_MALE_SAMPLE_IDS_" = male_sample_ids)
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement, constraints, male_sub))
# Retrieve heterozygous haplotype calls on chromosomes X and Y.
SELECT
call.call_set_name,
GROUP_CONCAT(STRING(call.genotype)) WITHIN call AS genotype,
reference_name,
start,
end,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alternate_bases,
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
WHERE
reference_name IN ('chrX', 'chrY')
OMIT
call if (2 > COUNT(call.genotype))
OR EVERY(call.genotype <= 0)
OR EVERY(call.genotype = 1)
HAVING call.call_set_name IN ('')
# Optionally add a clause here to sort and limit the results.
#_ORDER_BY_
Error in UseMethod("xtable"): no applicable method for 'xtable' applied to an object of class "NULL"
TODO Compare to command line
Ethnicity Inference in Google Genomics
Status: Waiting for PCA release from Elmer. See here for status.
Genome Similarity in Google Genomics
Status: Installed apache maven and built Google Genomics dataflow package. Ran into errors when trying to run dataflow job.
java -cp target/google-genomics-dataflow-v1beta2-0.2-SNAPSHOT.jar com/google/cloud/genomics/dataflow/pipelines/IdentityByState --project=my-project-id --output=gs://my-bucket/localtest.txt --genomicsSecretsFile=client_secrets.json
Error: Could not find or load main class com.google.cloud.genomics.dataflow.pipelines.IdentityByState
Command line
plink --genome --vcf chr22-merged.vcf.gz
Calculate the concordance between sequencing data and genotyping data.
query <- "./sql/genotyping-concordance.sql"
result <- DisplayAndDispatchQuery(query,
project=project,
replacements=c(tableReplacement))
SELECT
sample_id,
calls_in_common,
identical_calls,
(identical_calls/calls_in_common) AS concordance
FROM (
SELECT
sample_id,
COUNT(seq_genotype) AS calls_in_common,
SUM(IF(seq_genotype = gen_genotype, 1, 0)) AS identical_calls,
FROM (
SELECT
seq.sample_id AS sample_id,
seq.reference_name AS reference_name,
seq.start AS start,
seq.end AS end,
seq.genotype AS seq_genotype,
gen.genotype AS gen_genotype,
FROM (
SELECT
sample_id,
reference_name,
start,
end,
genotype,
bin,
FROM js(
(SELECT
call.call_set_name,
reference_name,
start,
end,
call.genotype,
reference_bases,
GROUP_CONCAT(alternate_bases) WITHIN RECORD AS alts,
COUNT(alternate_bases) WITHIN RECORD AS num_alts,
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_gvcfs_2]
#_WHERE_
OMIT
call IF EVERY (call.genotype < 0)
HAVING
num_alts <= 1
AND reference_bases IN ('A','C','G','T')
AND (alts IS null
OR LENGTH(alts) <= 1)
),
// Start javascript function
// Input Columns
call.call_set_name, reference_name, start, end, call.genotype,
// Output Schema
"[{name: 'sample_id', type: 'string'},
{name: 'reference_name', type: 'string'},
{name: 'start', type: 'integer'},
{name: 'end', type: 'integer'},
{name: 'genotype', type: 'string'},
{name: 'bin', type: 'integer'}]",
// Function
"function(r, emit) {
for (c of r.call) {
var binSize = 5000;
var startBin = Math.floor(r.start/binSize);
var endBin = Math.floor(r.end/binSize);
var genotype = JSON.stringify(c.genotype.sort());
for (var bin = startBin; bin <= endBin; bin++){
emit({
sample_id: c.call_set_name,
reference_name: r.reference_name,
start: r.start,
end: r.end,
genotype: genotype,
bin: bin,
})
}
}
}")) AS seq
JOIN EACH (
SELECT
sample_id,
reference_name,
start,
end,
genotype,
bin,
FROM js(
(SELECT
call.call_set_name,
reference_name,
start,
end,
call.genotype,
FROM
[gbsc-gcp-project-mvp:va_aaa_pilot_data.5_genome_test_genotyping_vcfs]
OMIT call IF EVERY (call.genotype < 0)
),
// Start javascript function
// Input Columns
call.call_set_name, reference_name, start, end, call.genotype,
// Output Schema
"[{name: 'sample_id', type: 'string'},
{name: 'reference_name', type: 'string'},
{name: 'start', type: 'integer'},
{name: 'end', type: 'integer'},
{name: 'genotype', type: 'string'},
{name: 'bin', type: 'integer'}]",
// Function
"function(r, emit) {
for (c of r.call) {
var binSize = 5000;
var bin = Math.floor(r.start/binSize);
var genotype = JSON.stringify(c.genotype.sort());
emit({
sample_id: c.call_set_name,
reference_name: r.reference_name,
start: r.start,
end: r.end,
genotype: genotype,
bin: bin,
})
}
}")) AS gen
ON
seq.sample_id = gen.sample_id
AND seq.reference_name = gen.reference_name
AND seq.bin = gen.bin
WHERE
seq.start <= gen.start
AND seq.end >= gen.end )
GROUP BY
sample_id)
Running query: RUNNING 2.6s
Running query: RUNNING 3.2s
Running query: RUNNING 3.8s
Running query: RUNNING 4.4s
Running query: RUNNING 5.1s
Running query: RUNNING 5.7s
Running query: RUNNING 6.3s
Running query: RUNNING 6.9s
Running query: RUNNING 7.5s
Running query: RUNNING 8.1s
Running query: RUNNING 8.8s
Running query: RUNNING 9.4s
Running query: RUNNING 10.0s
Running query: RUNNING 10.6s
Running query: RUNNING 11.2s
Running query: RUNNING 11.9s
Running query: RUNNING 12.5s
Running query: RUNNING 13.1s
Running query: RUNNING 13.7s
Running query: RUNNING 14.3s
Running query: RUNNING 14.9s
Running query: RUNNING 15.6s
Running query: RUNNING 16.2s
Running query: RUNNING 16.8s
Running query: RUNNING 17.4s
Running query: RUNNING 18.0s
Running query: RUNNING 18.7s
Running query: RUNNING 19.3s
Running query: RUNNING 19.9s
Running query: RUNNING 20.5s
Running query: RUNNING 21.1s
Running query: RUNNING 21.8s
Running query: RUNNING 22.4s
Running query: RUNNING 23.0s
Running query: RUNNING 23.6s
Running query: RUNNING 24.2s
Running query: RUNNING 24.8s
Running query: RUNNING 25.4s
Running query: RUNNING 26.1s
Running query: RUNNING 26.7s
Running query: RUNNING 27.3s
Running query: RUNNING 27.9s
Running query: RUNNING 28.5s
Running query: RUNNING 29.2s
Running query: RUNNING 29.8s
Running query: RUNNING 30.4s
Running query: RUNNING 31.0s
Running query: RUNNING 31.6s
Running query: RUNNING 32.3s
Running query: RUNNING 32.9s
Running query: RUNNING 33.5s
Running query: RUNNING 34.1s
Running query: RUNNING 34.8s
Running query: RUNNING 35.4s
Running query: RUNNING 36.0s
Running query: RUNNING 36.6s
Running query: RUNNING 37.2s
Running query: RUNNING 37.8s
Running query: RUNNING 38.5s
Running query: RUNNING 39.1s
Running query: RUNNING 39.7s
Running query: RUNNING 40.3s
Running query: RUNNING 40.9s
Running query: RUNNING 41.6s
Running query: RUNNING 42.2s
Running query: RUNNING 42.8s
Running query: RUNNING 43.4s
sample_id | calls_in_common | identical_calls | concordance |
---|---|---|---|
LP6005038-DNA_A03 | 2199423 | 2177458 | 0.99 |
LP6005038-DNA_B01 | 2200890 | 2178871 | 0.99 |
LP6005038-DNA_A01 | 2200420 | 2178471 | 0.99 |
LP6005038-DNA_A02 | 2200848 | 2178903 | 0.99 |
LP6005038-DNA_B02 | 2201742 | 2179131 | 0.99 |