Skip to content

Commit

Permalink
Merge pull request #11 from mklarqvist/oop_update
Browse files Browse the repository at this point in the history
release 0.3.0
  • Loading branch information
mklarqvist authored Feb 15, 2018
2 parents 850e04d + 370090e commit 8015750
Show file tree
Hide file tree
Showing 153 changed files with 8,160 additions and 6,059 deletions.
2 changes: 1 addition & 1 deletion .settings/language.settings.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<provider copy-of="extension" id="org.eclipse.cdt.ui.UserLanguageSettingsProvider"/>
<provider-reference id="org.eclipse.cdt.core.ReferencedProjectsLanguageSettingsProvider" ref="shared-provider"/>
<provider class="org.eclipse.cdt.managedbuilder.language.settings.providers.GCCBuildCommandParser" id="org.eclipse.cdt.managedbuilder.core.GCCBuildCommandParser" keep-relative-paths="false" name="CDT GCC Build Output Parser" parameter="(g?cc)|([gc]\+\+)|(clang)" prefer-non-shared="true"/>
<provider class="org.eclipse.cdt.internal.build.crossgcc.CrossGCCBuiltinSpecsDetector" console="false" env-hash="-1884500886375924948" id="org.eclipse.cdt.build.crossgcc.CrossGCCBuiltinSpecsDetector" keep-relative-paths="false" name="CDT Cross GCC Built-in Compiler Settings" parameter="${COMMAND} ${FLAGS} -E -P -v -dD &quot;${INPUTS}&quot;" prefer-non-shared="true">
<provider class="org.eclipse.cdt.internal.build.crossgcc.CrossGCCBuiltinSpecsDetector" console="false" env-hash="-923999277560746690" id="org.eclipse.cdt.build.crossgcc.CrossGCCBuiltinSpecsDetector" keep-relative-paths="false" name="CDT Cross GCC Built-in Compiler Settings" parameter="${COMMAND} ${FLAGS} -E -P -v -dD &quot;${INPUTS}&quot;" prefer-non-shared="true">
<language-scope id="org.eclipse.cdt.core.gcc"/>
<language-scope id="org.eclipse.cdt.core.g++"/>
</provider>
Expand Down
20 changes: 1 addition & 19 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,25 +38,7 @@ matrix:
env:
- MATRIX_EVAL="CC=gcc-6 && CXX=g++-6"
compiler: gcc

- os: osx
osx_image: xcode8
env:
- MATRIX_EVAL="CC=gcc-4.9 && CXX=g++-4.9"
compiler: gcc

- os: osx
osx_image: xcode8
env:
- MATRIX_EVAL="brew install gcc5 && CC=gcc-5 && CXX=g++-5"
compiler: gcc

- os: osx
osx_image: xcode8
env:
- MATRIX_EVAL="brew install gcc && CC=gcc-6 && CXX=g++-6"
compiler: gcc


# works on Precise and Trusty
- os: linux
addons:
Expand Down
Binary file added R/1kgp3_chr20_105_1.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added R/1kgp3_chr20_105_1_symmetric.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added R/1kgp3_chr20_105_1_triangular.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added R/1kgp3_chr20_45_part1_10.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added R/1kgp3_chr20_large_region.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions R/example_region.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Specify colour scheme
colors<-paste0(colorRampPalette(c("blue","red"))(10),seq(0,100,length.out = 11))
colors[1]<-paste0(colors[1],"0")
colors[length(colors)]<- substr(colors[length(colors)],1,7)

# Define support functions
plotLDRegion<-function(dataSource, from, to, ...){
# B is A but sorted for plotting reasons (Z-stack)
b<-dataSource[dataSource$V3>=from & dataSource$V3 <= to & dataSource$V5 >= from & dataSource$V5 <= to,]
b<-b[order(b$V12,decreasing = F),]
plot(b$V3,b$V5,pch=20,cex=.1,col=colors[cut(b$V12,breaks=seq(0,1,length.out = 11),include.lowest = T)],xlim=c(from,to),ylim=c(from,to),xaxs="i",yaxs="i", ...)
}

plotLDRegionTriangular<-function(dataSource, from, to, ...){
# B is A but sorted for plotting reasons (Z-stack)
b<-dataSource[dataSource$V3>=from & dataSource$V5<=to & dataSource$V3>=from & dataSource$V5<=to,]
b<-b[b$V3<b$V5,] # upper triangular only
b<-b[order(b$V12,decreasing = F),]
plot(b$V3 + ((b$V5-b$V3)/2),b$V5-b$V3,pch=20,cex=.2,col=colors[cut(b$V12,breaks=seq(0,1,length.out = 11),include.lowest = T)],xaxs="i",yaxs="i", ...)
}

# Load some LD data from Tomahawk
ld<-read.delim("1kgp3_chr2_105_1.ld",h=F)
plotLDRegion(ld, 2e6, 5e6, xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 2e6-5e6", las=2)
plotLDRegionTriangular(ld, 2e6, 5e6, xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 2e6-5e6", las=2)
11 changes: 0 additions & 11 deletions R/import_plots.R

This file was deleted.

156 changes: 0 additions & 156 deletions R/plot_functions.R

This file was deleted.

105 changes: 87 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,32 @@
[![Build Status](https://travis-ci.org/mklarqvist/Tomahawk.svg?branch=master)](https://travis-ci.org/mklarqvist/Tomahawk)
[![Release](https://img.shields.io/badge/Release-beta_0.1-blue.svg)](https://github.com/mklarqvist/Tomahawk/releases)
[![Build Status](https://travis-ci.org/mklarqvist/tomahawk.svg?branch=master)](https://travis-ci.org/mklarqvist/tomahawk)
[![Release](https://img.shields.io/badge/Release-beta_0.3-blue.svg)](https://github.com/mklarqvist/tomahawk/releases)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

![screenshot](tomahawk.png)
## Fast calculation of LD in large-scale cohorts
Tomahawk efficiently represents genotypic data by exploiting basic genetic properties and we directly query this compressed representation to calculate linkage disequilibrium for all pairwise alleles/genotypes in large-scale cohorts. In order to achieve speed, Tomahawk combines primarily two efficient algorithms exploiting different concepts: 1) low genetic diversity, and 2) the large memory registers on modern processors. The first algorithm directly compares run-length encoded representation of genotypes from two vectors. The other precomputes the run-length encodings as 1-bit encodings and use SIMD-instructions to directly compare two bit-vectors. This algorithm also exploits the relatively low genetic diversity within species. Both algorithms are embarrassingly parallel.
Tomahawk efficiently compress genotypic data by exploiting intrinsic genetic properties and we describe algorithms to directly query, manipulate, and explore this jointly compressed representation in-place. We represent genotypic vectors as fixed-width run-length encoded (RLE) objects with the five highest bits encoding for phasing, allele A, allele B, and the remainder as the run-length. This encoding scheme is superior to dynamic-width encoding appro aches in terms of iteration speed but inferior in terms of compressibility. The word size (`uint8_t`, `uint16_t`, `uint32_t`, or `uint64_t`) of RLE entries is fixed across a file and is determined contextually contingent on the number of samples. Tomahawk has two primary internal functions:

The current format specifications (v.0) for `TWK`,`TWI`,`TWO`,`TOI`, and `TGZF`
1) iterate over sites and RLE entries;
2) computing the inner product of compressed genotypic vectors;

We describe efficient algorithms to calculate genome-wide linkage disequilibrium for all pairwise alleles/genotypes in large-scale cohorts. In order to achieve speed, Tomahawk primarily combines two efficient algorithms exploiting different concepts: 1) low genetic diversity and 2) the large memory registers on modern processors. The first algorithm directly compares RLE entries from two vectors. The other transforms RLE entries to bit-vectors and use SIMD-instructions to directly compare two such bit-vectors. This second algorithm also exploits the relatively low genetic diversity within species using implicit heuristics. Both algorithms are embarrassingly parallel.

The current format specifications (v.0) for `TWK`,`TWO`, and `TGZF`
are available [TWKv0](spec/TWKv0.pdf)

Marcus D. R. Klarqvist (<mk21@sanger.ac.uk>)
### Author
Marcus D. R. Klarqvist (<mk819@cam.ac.uk>)
Department of Genetics, University of Cambridge
Wellcome Trust Sanger Institute


### Installation instructions
For modern x86-64 CPUs with `SSE4.2` or later, just type `make` in the `build`
directory. If you see compilation errors, you most likely do not have `SSE4.2`.
At the present time, we do not support non-x86 CPUs or old CPU architecture.
```bash
git clone --recursive https://github.com/mklarqvist/Tomahawk
cd Tomahawk
git clone --recursive https://github.com/mklarqvist/tomahawk
cd tomahawk
cd build
make
```
Expand All @@ -31,8 +40,6 @@ compiled target.

### Brief usage instructions
Tomahawk comprises five primary commands: `import`, `calc`, `view`, `sort`, and `concat`.
The function `stats` have partial support: currently limited to basics for `two` files.
The function `index` is disabled at the moment.
Executing `tomahawk` gives a list of commands with brief descriptions and `tomahawk <command>`
gives detailed details for that command.

Expand All @@ -54,14 +61,11 @@ from Hardy-Weinberg equilibrium with a probability < 0.001
tomahawk import -i file.vcf -o outPrefix -m 0.2 -H 1e-3
```

### Import-extend
If you have split up your `vcf`/`bcf` files into multiple disjoint files
(such as one per chromosome) it is possible to iteratively import and extend a `twk` file:
```bash
tomahawk import -i file.bcf -e extend.twk -m 0.2 -H 1e-3
```

### Calculating linkage disequilibrium
### Calculating all-vs-all linkage disequilibrium
In this example we force computations to use phased math (`-p`) and show a live progressbar
(`-d`). Generated data is filtered for minimum genotype frequency (`-a`), squared correlation
coefficient (`-r`) and by test statistics P-value (`-p`). Total computation is partitioned into 990 psuedo-balanced blocks (`-c`)
and select the first partition (`-C`) to compute using 28 threads (`-t`);
```bash
tomahawk calc -pdi file.twk -o output_prefix -a 5 -r 0.1 -P 0.1 -c 990 -C 1 -t 28
```
Expand Down Expand Up @@ -98,5 +102,70 @@ Perform k-way merge of partially sorted blocks
tomahawk sort -i partial.two -o sorted.two -M
```

### License
## Plotting
Plotting `two` data converted into `ld` format using the supplied `R` scripts (in the `R` directory).
First transform a `two` file into human-readable `ld` format:
```bash
tomahawk view -hi 1kgp3_chr2_105_1.two > 1kgp3_chr2_105_1.ld
```

Either `source` the [R/example_region.R](R/example_region.R) file or copy-paste this code into `R`:
```R
# Specify colour scheme
colors<-paste0(colorRampPalette(c("blue","red"))(10),seq(0,100,length.out = 11))
colors[1]<-paste0(colors[1],"0")
colors[length(colors)]<- substr(colors[length(colors)],1,7)

# Define support functions
plotLDRegion<-function(dataSource, from, to, ...){
# Assumes all the data is from the same chromosome
b<-dataSource[dataSource$V3>=from & dataSource$V3 <= to & dataSource$V5 >= from & dataSource$V5 <= to,]
b<-b[order(b$V13,decreasing = F),] # sort for Z-stack
plot(b$V3,b$V5,pch=20,cex=.2,col=colors[cut(b$V13,breaks=seq(0,1,length.out = 11),include.lowest = T)],xlim=c(from,to),ylim=c(from,to),xaxs="i",yaxs="i", ...)
}

plotLDRegionTriangular<-function(dataSource, from, to, ...){
# Assumes all the data is from the same chromosome
b<-dataSource[dataSource$V3>=from & dataSource$V5<=to & dataSource$V3>=from & dataSource$V5<=to,]
b<-b[b$V3<b$V5,] # upper triangular only
b<-b[order(b$V12,decreasing = F),] # sort for Z-stack
plot(b$V3 + ((b$V5-b$V3)/2),b$V5-b$V3,pch=20,cex=.2,col=colors[cut(b$V12,breaks=seq(0,1,length.out = 11),include.lowest = T)],xaxs="i",yaxs="i", ...)
}
```

Load the `ld` data we generated:
```R
# Load some LD data from Tomahawk
ld<-read.delim("1kgp3_chr2_105_1.ld",h=F)
```
and then plot it using either of the two support functions. First plotting the data as is (upper-triangular)
```R
plotLDRegion(ld, 2e6, 5e6, xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 2e6-5e6", las=2)
```
![screenshot](R/1kgp3_chr20_105_1.jpeg)

or plotting the upper-triangular rotated 45 degrees
```R
plotLDRegionTriangular(ld, 2e6, 5e6, xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 2e6-5e6", las=2)
```
![screenshot](R/1kgp3_chr20_105_1_triangular.jpeg)

Plotting large regions can be achieved by truncating the Y-axis:
```R
plotLDRegionTriangular(ld, min(ld$V3), max(ld$V5), xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 6e6-12e6", las=2,ylim=c(0,0.5e6))
```
![screenshot](R/1kgp3_chr20_large_region.jpeg)

If your data has been sorted and expanded (symmetric) for rapid queries each data point is represented twice ([A,B], and [B,A]):
```R
# Load some symmetric LD data from Tomahawk
ld<-read.delim("1kgp3_chr2_105_1_symmetric.ld",h=F)
plotLDRegion(ld, 1e6, 4e6, xlab="Coordinates",ylab="Coordinates",main="1KGP3 chr20 1e6-4e6", las=2)
```
![screenshot](R/1kgp3_chr20_105_1_symmetric.jpeg)

This figure demonstrates how Tomahawk partitions the workload in order to maximize data locality. Shown here is part 1 and 10 out of 45 for the 1000 Genomes data for chromosome 20. This data locality can have profound impact on runtime: in many cases it is faster to run many smaller partitions of the data instead of several larger ones.
![screenshot](R/1kgp3_chr20_45_part1_10.jpeg)

## License
[MIT](LICENSE)
17 changes: 17 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Release 0.3.0

## Breaking Changes
* All changes are breaking. There is no backwards compatibility from this release!

## Major Features And Improvements
* All major classes are now implemented as STL-like container and are as decoupled as possible
* External sort file handles now buffer a small amount of data to reduce random access lookups
* `two`/`twk` entries are no longer forced to be accessed by unaligned memory addresses
* Both `two` and `twk` are now self-indexing. All external indices are now invalid

## Bug Fixes and Other Changes
* Bug fixes
* Too many to list here
* Examples updates:
* Added R script to demonstrate simple plotting using `base`
* Added several figures demonstrating some keys concepts of Tomahawk
Loading

0 comments on commit 8015750

Please sign in to comment.