Merge pull request #11 from mklarqvist/oop_update

release 0.3.0
mklarqvist · Feb 15, 2018 · 8015750 · 8015750
2 parents 850e04d + 370090e
commit 8015750
Show file tree

Hide file tree

Showing 153 changed files with 8,160 additions and 6,059 deletions.
diff --git a/.settings/language.settings.xml b/.settings/language.settings.xml
@@ -5,7 +5,7 @@
 			<provider copy-of="extension" id="org.eclipse.cdt.ui.UserLanguageSettingsProvider"/>
 			<provider-reference id="org.eclipse.cdt.core.ReferencedProjectsLanguageSettingsProvider" ref="shared-provider"/>
 			<provider class="org.eclipse.cdt.managedbuilder.language.settings.providers.GCCBuildCommandParser" id="org.eclipse.cdt.managedbuilder.core.GCCBuildCommandParser" keep-relative-paths="false" name="CDT GCC Build Output Parser" parameter="(g?cc)|([gc]\+\+)|(clang)" prefer-non-shared="true"/>
-			<provider class="org.eclipse.cdt.internal.build.crossgcc.CrossGCCBuiltinSpecsDetector" console="false" env-hash="-1884500886375924948" id="org.eclipse.cdt.build.crossgcc.CrossGCCBuiltinSpecsDetector" keep-relative-paths="false" name="CDT Cross GCC Built-in Compiler Settings" parameter="${COMMAND} ${FLAGS} -E -P -v -dD &quot;${INPUTS}&quot;" prefer-non-shared="true">
+			<provider class="org.eclipse.cdt.internal.build.crossgcc.CrossGCCBuiltinSpecsDetector" console="false" env-hash="-923999277560746690" id="org.eclipse.cdt.build.crossgcc.CrossGCCBuiltinSpecsDetector" keep-relative-paths="false" name="CDT Cross GCC Built-in Compiler Settings" parameter="${COMMAND} ${FLAGS} -E -P -v -dD &quot;${INPUTS}&quot;" prefer-non-shared="true">
 				<language-scope id="org.eclipse.cdt.core.gcc"/>
 				<language-scope id="org.eclipse.cdt.core.g++"/>
 			</provider>

diff --git a/.travis.yml b/.travis.yml
@@ -38,25 +38,7 @@ matrix:
       env:
         - MATRIX_EVAL="CC=gcc-6 && CXX=g++-6"
       compiler: gcc
-
-    - os: osx
-      osx_image: xcode8
-      env:
-        - MATRIX_EVAL="CC=gcc-4.9 && CXX=g++-4.9"
-      compiler: gcc
-
-    - os: osx
-      osx_image: xcode8
-      env:
-        - MATRIX_EVAL="brew install gcc5 && CC=gcc-5 && CXX=g++-5"
-      compiler: gcc
-
-    - os: osx
-      osx_image: xcode8
-      env:
-        - MATRIX_EVAL="brew install gcc && CC=gcc-6 && CXX=g++-6"
-      compiler: gcc
-
+
     # works on Precise and Trusty
     - os: linux
       addons:

diff --git a/R/1kgp3_chr20_105_1.jpeg b/R/1kgp3_chr20_105_1.jpeg
diff --git a/R/1kgp3_chr20_105_1_symmetric.jpeg b/R/1kgp3_chr20_105_1_symmetric.jpeg
diff --git a/R/1kgp3_chr20_105_1_triangular.jpeg b/R/1kgp3_chr20_105_1_triangular.jpeg
diff --git a/R/1kgp3_chr20_45_part1_10.jpeg b/R/1kgp3_chr20_45_part1_10.jpeg
diff --git a/R/1kgp3_chr20_large_region.jpeg b/R/1kgp3_chr20_large_region.jpeg
diff --git a/R/example_region.R b/R/example_region.R
@@ -0,0 +1,25 @@
+# Specify colour scheme
+colors<-paste0(colorRampPalette(c("blue","red"))(10),seq(0,100,length.out = 11))
+colors[1]<-paste0(colors[1],"0")
+colors[length(colors)]<- substr(colors[length(colors)],1,7)
+
+# Define support functions
+plotLDRegion<-function(dataSource, from, to, ...){
+  # B is A but sorted for plotting reasons (Z-stack)
+  b<-dataSource[dataSource$V3>=from & dataSource$V3 <= to & dataSource$V5 >= from & dataSource$V5 <= to,]
+  b<-b[order(b$V12,decreasing = F),]
+  plot(b$V3,b$V5,pch=20,cex=.1,col=colors[cut(b$V12,breaks=seq(0,1,length.out = 11),include.lowest = T)],xlim=c(from,to),ylim=c(from,to),xaxs="i",yaxs="i", ...)
+}
+
+plotLDRegionTriangular<-function(dataSource, from, to, ...){
+  # B is A but sorted for plotting reasons (Z-stack)
+  b<-dataSource[dataSource$V3>=from & dataSource$V5<=to & dataSource$V3>=from & dataSource$V5<=to,]
+  b<-b[b$V3<b$V5,] # upper triangular only
+  b<-b[order(b$V12,decreasing = F),]
+  plot(b$V3 + ((b$V5-b$V3)/2),b$V5-b$V3,pch=20,cex=.2,col=colors[cut(b$V12,breaks=seq(0,1,length.out = 11),include.lowest = T)],xaxs="i",yaxs="i", ...)
+}
+
+# Load some LD data from Tomahawk
+ld<-read.delim("1kgp3_chr2_105_1.ld",h=F)
+plotLDRegion(ld, 2e6, 5e6, xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 2e6-5e6", las=2)
+plotLDRegionTriangular(ld, 2e6, 5e6, xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 2e6-5e6", las=2)
diff --git a/R/import_plots.R b/R/import_plots.R
diff --git a/R/plot_functions.R b/R/plot_functions.R
diff --git a/README.md b/README.md
@@ -1,23 +1,32 @@
-[![Build Status](https://travis-ci.org/mklarqvist/Tomahawk.svg?branch=master)](https://travis-ci.org/mklarqvist/Tomahawk)
-[![Release](https://img.shields.io/badge/Release-beta_0.1-blue.svg)](https://github.com/mklarqvist/Tomahawk/releases)
+[![Build Status](https://travis-ci.org/mklarqvist/tomahawk.svg?branch=master)](https://travis-ci.org/mklarqvist/tomahawk)
+[![Release](https://img.shields.io/badge/Release-beta_0.3-blue.svg)](https://github.com/mklarqvist/tomahawk/releases)
 [![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 
 ![screenshot](tomahawk.png)
 ## Fast calculation of LD in large-scale cohorts
-Tomahawk efficiently represents genotypic data by exploiting basic genetic properties and we directly query this compressed representation to calculate linkage disequilibrium for all pairwise alleles/genotypes in large-scale cohorts. In order to achieve speed, Tomahawk combines primarily two efficient algorithms exploiting different concepts: 1) low genetic diversity, and 2) the large memory registers on modern processors. The first algorithm directly compares run-length encoded representation of genotypes from two vectors. The other precomputes the run-length encodings as 1-bit encodings and use SIMD-instructions to directly compare two bit-vectors. This algorithm also exploits the relatively low genetic diversity within species. Both algorithms are embarrassingly parallel.
+Tomahawk efficiently compress genotypic data by exploiting intrinsic genetic properties and we describe algorithms to directly query, manipulate, and explore this jointly compressed representation in-place. We represent genotypic vectors as fixed-width run-length encoded (RLE) objects with the five highest bits encoding for phasing, allele A, allele B, and the remainder as the run-length. This encoding scheme is superior to dynamic-width encoding appro  aches in terms of iteration speed but inferior in terms of compressibility. The word size (`uint8_t`, `uint16_t`, `uint32_t`, or `uint64_t`) of RLE entries is fixed across a file and is determined contextually contingent on the number of samples. Tomahawk has two primary internal functions: 
 
-The current format specifications (v.0) for `TWK`,`TWI`,`TWO`,`TOI`, and `TGZF`
+1) iterate over sites and RLE entries; 
+2) computing the inner product of compressed genotypic vectors;
+
+We describe efficient algorithms to calculate genome-wide linkage disequilibrium for all pairwise alleles/genotypes in large-scale cohorts. In order to achieve speed, Tomahawk primarily combines two efficient algorithms exploiting different concepts: 1) low genetic diversity and 2) the large memory registers on modern processors. The first algorithm directly compares RLE entries from two vectors. The other transforms RLE entries to bit-vectors and use SIMD-instructions to directly compare two such bit-vectors. This second algorithm also exploits the relatively low genetic diversity within species using implicit heuristics. Both algorithms are embarrassingly parallel.
+
+The current format specifications (v.0) for `TWK`,`TWO`, and `TGZF`
 are available [TWKv0](spec/TWKv0.pdf)
 
-Marcus D. R. Klarqvist (<mk21@sanger.ac.uk>)
+### Author
+Marcus D. R. Klarqvist (<mk819@cam.ac.uk>)  
+Department of Genetics, University of Cambridge  
+Wellcome Trust Sanger Institute
+
 
 ### Installation instructions
 For modern x86-64 CPUs with `SSE4.2` or later, just type `make` in the `build`
 directory. If you see compilation errors, you most likely do not have `SSE4.2`.
 At the present time, we do not support non-x86 CPUs or old CPU architecture.
 ```bash
-git clone --recursive https://github.com/mklarqvist/Tomahawk
-cd Tomahawk
+git clone --recursive https://github.com/mklarqvist/tomahawk
+cd tomahawk
 cd build
 make
 ```
@@ -31,8 +40,6 @@ compiled target.
 
 ### Brief usage instructions
 Tomahawk comprises five primary commands: `import`, `calc`, `view`, `sort`, and `concat`.
-The function `stats` have partial support: currently limited to basics for `two` files.
-The function `index` is disabled at the moment.
 Executing `tomahawk` gives a list of commands with brief descriptions and `tomahawk <command>`
 gives detailed details for that command.
 
@@ -54,14 +61,11 @@ from Hardy-Weinberg equilibrium with a probability < 0.001
 tomahawk import -i file.vcf -o outPrefix -m 0.2 -H 1e-3
 ```
 
-### Import-extend
-If you have split up your `vcf`/`bcf` files into multiple disjoint files
-(such as one per chromosome) it is possible to iteratively import and extend a `twk` file:
-```bash
-tomahawk import -i file.bcf -e extend.twk -m 0.2 -H 1e-3
-```
-
-### Calculating linkage disequilibrium
+### Calculating all-vs-all linkage disequilibrium
+In this example we force computations to use phased math (`-p`) and show a live progressbar
+(`-d`). Generated data is filtered for minimum genotype frequency (`-a`), squared correlation
+coefficient (`-r`) and by test statistics P-value (`-p`). Total computation is partitioned into 990 psuedo-balanced blocks (`-c`)
+and select the first partition (`-C`) to compute using 28 threads (`-t`);
 ```bash
 tomahawk calc -pdi file.twk -o output_prefix -a 5 -r 0.1 -P 0.1 -c 990 -C 1 -t 28
 ```
@@ -98,5 +102,70 @@ Perform k-way merge of partially sorted blocks
 tomahawk sort -i partial.two -o sorted.two -M
 ```
 
-### License
+## Plotting
+Plotting `two` data converted into `ld` format using the supplied `R` scripts (in the `R` directory).
+First transform a `two` file into human-readable `ld` format:
+```bash
+tomahawk view -hi 1kgp3_chr2_105_1.two > 1kgp3_chr2_105_1.ld
+```
+
+Either `source` the [R/example_region.R](R/example_region.R) file or copy-paste this code into `R`:
+```R
+# Specify colour scheme
+colors<-paste0(colorRampPalette(c("blue","red"))(10),seq(0,100,length.out = 11))
+colors[1]<-paste0(colors[1],"0")
+colors[length(colors)]<- substr(colors[length(colors)],1,7)
+
+# Define support functions
+plotLDRegion<-function(dataSource, from, to, ...){
+  # Assumes all the data is from the same chromosome
+  b<-dataSource[dataSource$V3>=from & dataSource$V3 <= to & dataSource$V5 >= from & dataSource$V5 <= to,]
+  b<-b[order(b$V13,decreasing = F),] # sort for Z-stack
+  plot(b$V3,b$V5,pch=20,cex=.2,col=colors[cut(b$V13,breaks=seq(0,1,length.out = 11),include.lowest = T)],xlim=c(from,to),ylim=c(from,to),xaxs="i",yaxs="i", ...)
+}
+
+plotLDRegionTriangular<-function(dataSource, from, to, ...){
+  # Assumes all the data is from the same chromosome
+  b<-dataSource[dataSource$V3>=from & dataSource$V5<=to & dataSource$V3>=from & dataSource$V5<=to,]
+  b<-b[b$V3<b$V5,] # upper triangular only
+  b<-b[order(b$V12,decreasing = F),] # sort for Z-stack
+  plot(b$V3 + ((b$V5-b$V3)/2),b$V5-b$V3,pch=20,cex=.2,col=colors[cut(b$V12,breaks=seq(0,1,length.out = 11),include.lowest = T)],xaxs="i",yaxs="i", ...)
+}
+```
+
+Load the `ld` data we generated:
+```R
+# Load some LD data from Tomahawk
+ld<-read.delim("1kgp3_chr2_105_1.ld",h=F)
+```
+and then plot it using either of the two support functions. First plotting the data as is (upper-triangular)
+```R
+plotLDRegion(ld, 2e6, 5e6, xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 2e6-5e6", las=2)
+```
+![screenshot](R/1kgp3_chr20_105_1.jpeg)
+
+or plotting the upper-triangular rotated 45 degrees
+```R
+plotLDRegionTriangular(ld, 2e6, 5e6, xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 2e6-5e6", las=2)
+```
+![screenshot](R/1kgp3_chr20_105_1_triangular.jpeg)  
+
+Plotting large regions can be achieved by truncating the Y-axis:
+```R
+plotLDRegionTriangular(ld, min(ld$V3), max(ld$V5), xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 6e6-12e6", las=2,ylim=c(0,0.5e6))
+```
+![screenshot](R/1kgp3_chr20_large_region.jpeg)
+
+If your data has been sorted and expanded (symmetric) for rapid queries each data point is represented twice ([A,B], and [B,A]):
+```R
+# Load some symmetric LD data from Tomahawk
+ld<-read.delim("1kgp3_chr2_105_1_symmetric.ld",h=F)
+plotLDRegion(ld, 1e6, 4e6, xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 1e6-4e6", las=2)
+```
+![screenshot](R/1kgp3_chr20_105_1_symmetric.jpeg)  
+
+This figure demonstrates how Tomahawk partitions the workload in order to maximize data locality. Shown here is part 1 and 10 out of 45 for the 1000 Genomes data for chromosome 20. This data locality can have profound impact on runtime: in many cases it is faster to run many smaller partitions of the data instead of several larger ones.   
+![screenshot](R/1kgp3_chr20_45_part1_10.jpeg)
+
+## License
 [MIT](LICENSE)
diff --git a/RELEASE.md b/RELEASE.md
@@ -0,0 +1,17 @@
+# Release 0.3.0
+
+## Breaking Changes
+* All changes are breaking. There is no backwards compatibility from this release!
+
+## Major Features And Improvements
+* All major classes are now implemented as STL-like container and are as decoupled as possible
+* External sort file handles now buffer a small amount of data to reduce random access lookups
+* `two`/`twk` entries are no longer forced to be accessed by unaligned memory addresses
+* Both `two` and `twk` are now self-indexing. All external indices are now invalid
+
+## Bug Fixes and Other Changes
+* Bug fixes
+  * Too many to list here
+* Examples updates:
+  * Added R script to demonstrate simple plotting using `base`
+  * Added several figures demonstrating some keys concepts of Tomahawk