Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #12

Merged
merged 11 commits into from
Apr 25, 2019
Merged

Dev #12

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@

# Compiled Dynamic libraries
*.so
*.so.*
*.dylib
*.dylib.*
*.dll

# Fortran module files
Expand Down
105 changes: 97 additions & 8 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,16 +68,15 @@ matrix:
env:
- MATRIX_EVAL="CC=gcc-6 && CXX=g++-6"
compiler: gcc

# works on Precise and Trusty
- os: linux
addons:
apt:
sources:
- ubuntu-toolchain-r-test
- llvm-toolchain-precise-3.6
packages:
- clang-3.6
- g++-7
- pkg-config
- libncurses5-dev
- libncursesw5-dev
Expand All @@ -89,8 +88,30 @@ matrix:
- libbz2-dev
- liblzma-dev
env:
- MATRIX_EVAL="CC=clang-3.6 && CXX=clang++-3.6"
compiler: clang
- MATRIX_EVAL="CC=gcc-7 && CXX=g++-7"
compiler: gcc

# works on Precise and Trusty
- os: linux
addons:
apt:
sources:
- ubuntu-toolchain-r-test
packages:
- g++-8
- pkg-config
- libncurses5-dev
- libncursesw5-dev
- zlib1g-dev
- libssl-dev
- liblz4-dev
- libcurl4-openssl-dev
- liblz-dev
- libbz2-dev
- liblzma-dev
env:
- MATRIX_EVAL="CC=gcc-8 && CXX=g++-8"
compiler: gcc

# works on Precise and Trusty
- os: linux
Expand Down Expand Up @@ -204,15 +225,83 @@ matrix:
- MATRIX_EVAL="CC=clang-5.0 && CXX=clang++-5.0"
compiler: clang

# works on Trusty
- os: linux
addons:
apt:
sources:
- ubuntu-toolchain-r-test
- llvm-toolchain-trusty-6.0
packages:
- clang-6.0
- libstdc++-6-dev
- pkg-config
- libncurses5-dev
- libncursesw5-dev
- zlib1g-dev
- libssl-dev
- liblz4-dev
- libcurl4-openssl-dev
- liblz-dev
- libbz2-dev
- liblzma-dev
env:
- MATRIX_EVAL="CC=clang-6.0 && CXX=clang++-6.0"
compiler: clang

# works on Trusty
- os: linux
addons:
apt:
sources:
- ubuntu-toolchain-r-test
- llvm-toolchain-trusty-7
packages:
- clang-7
- libstdc++-7-dev
- pkg-config
- libncurses5-dev
- libncursesw5-dev
- zlib1g-dev
- libssl-dev
- liblz4-dev
- libcurl4-openssl-dev
- liblz-dev
- libbz2-dev
- liblzma-dev
env:
- MATRIX_EVAL="CC=clang-7 && CXX=clang++-7"
compiler: clang

# OSX begin
- os: osx
osx_image: xcode10
env:
- MATRIX_EVAL="brew update && brew install gcc@5 && brew install openssl && CC=gcc-5 && CXX=g++-5

- os: osx
osx_image: xcode10
env:
- MATRIX_EVAL="brew update && brew install gcc@6 && brew install openssl && CC=gcc-6 && CXX=g++-6

- os: osx
osx_image: xcode10
env:
- MATRIX_EVAL="brew update && brew install gcc@7 && brew install openssl && CC=gcc-7 && CXX=g++-7

- os: osx
osx_image: xcode10
env:
- MATRIX_EVAL="brew update && brew install gcc@8 && brew install openssl && CC=gcc-8 && CXX=g++-8

before_install:
- eval "${MATRIX_EVAL}"
- git clone https://github.com/facebook/zstd
- cd zstd && make -j4 && sudo make install && cd ..
- git clone https://github.com/samtools/htslib
- cd htslib && git checkout 1832d3a1b75133e55fb6abffc3f50f8a6ed5ceae
- autoheader && autoconf && ./configure && make -j 4 && sudo make install && cd ..
- cd htslib && autoheader && autoconf && ./configure && make -j 4 && sudo make install && cd ..

script:
- git submodule update --recursive
- make -j 4
- if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then make -j 4 OPENSSL_PATH=/usr/local/opt/openssl/; else make -j 4; fi

61 changes: 22 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,28 @@
[![Build Status](https://travis-ci.org/mklarqvist/tachyon.svg?branch=master)](https://travis-ci.org/mklarqvist/tachyon)
[![Release](https://img.shields.io/badge/Release-beta_0.6.0-blue.svg)](https://github.com/mklarqvist/Tachyon/releases)
[![Release](https://img.shields.io/badge/Release-beta_0.6.1-blue.svg)](https://github.com/mklarqvist/Tachyon/releases)
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

<div align="center">
<img src="https://github.com/mklarqvist/tachyon/blob/master/yon_logo.png"><br><br>
</div>

Tachyon, or `YON` for short, is an open source C++ software library for reading, writing, and manipulating sequence variant data in a lossless and bit-exact representation. It is completely compatible with BCF/VCF. It was developed with a focus on enabling fast experimentation and storage of population-scaled datasets.
Tachyon is an open source C++ software library for reading, writing, and manipulating sequence variant data in a lossless and bit-exact representation. It is completely compatible with BCF/VCF. It was developed with a focus on enabling fast experimentation and storage of population-scaled datasets.

## How does it work?

Tachyon stores data in a format that optimize query execution (column store). Additionally, this data layout generally results in considerable gains in compression
as similar data are stored together separately. Tachyon can be considered the equivalent of what [CRAM](http://samtools.github.io/hts-specs/) is for SAM/BAM but for sequence variant data (VCF/BCF).

## Documentation

* Overview.
* [Building and installing](docs/building.md)
* [Getting started](docs/getting_started.md)
* [Performance benchmarks](docs/benchmarks.md)

## Perfomance
For reference, we compared yon to bcf on a server running Linux Ubuntu, with an Intel Xeon E5-2697 v3 processor, 64GB of DDR4-2133 RAM, and a pair of Intel SSE 750 NVMe drives running in RAID-0.

The following tests were run on the first release of [Haplotype Reference Consortium](http://www.haplotype-reference-consortium.org/) (HRC) data. There are ~39 million phased SNPs in 32,488 samples. Left panel: Filesizes for chromosomes 1-22. Right panel: We generated a yon archive for this dataset (left) and compared file sizes for both uncompressed (ubcf and uyon) and compressed data (bcf and yon) and next retrieved the site-specific information only (dropping all FORMAT fields; right).
The following tests were run on the first release of [Haplotype Reference Consortium](http://www.haplotype-reference-consortium.org/) (HRC) data. There are ~39 million phased SNPs in 32,488 samples. Left panel: Filesizes for chromosomes 1-22. Right panel: We generated a yon archive for this dataset (left) and compared file sizes for both uncompressed (ubcf and uyon) and compressed data (bcf and yon).

Compression Ratio / Chromosome | Compression Ratio
------------------|-------------------
Expand All @@ -25,49 +36,21 @@ Compression Ratio / Chromosome | Compression Ratio

ubcf: uncompressed bcf; uyon: uncompressed yon; 1 GB = 1000 * 1000 * 1000 b

The references system used was a server running Linux Ubuntu, with an Intel Xeon E5-2697 v3 processor, 64GB of DDR4-2133 RAM, and a pair of Intel SSE 750 NVMe drives running in RAID-0.

### Evaluation performance
The following tests were run to benchmark the processing time of various `yon` archives. For these tests we use three distinct datasets: 1) [1000 Genomes Phase 3](http://www.internationalgenome.org/) (1KGP3) chromosome 11; 2) [Haplotype Reference Consortium](http://www.haplotype-reference-consortium.org/) (HRC) chromosome 11; and 3) [Human Genome Diversity Project](http://www.hagsc.org/hgdp/) (HGDP) chromosome 10.

| Dataset | Variants | #INFO | #FORMAT | ubcf | bcf | uyon | yon |
|-------------|----------|-------|---------|-----------|-----------|-----------|-----------|
| 1kgp3-chr11 | 4045628 | 24 | 1 | 20.60 GB | 633.70 MB | 670.29 MB | 157.28 MB |
| HRC-chr11 | 1936990 | 6 | 1 | 125.90 GB | 3.48 GB | 1.47 GB | 461.96 MB |
| HGDP-chr10 | 3766673 | 24 | 9 | 73.93 GB | 19.07 GB | 67.76 GB | 14.40 GB |
| 1kgp3-chr11 | 4,045,628 | 24 | 1 | 20.60 GB | 633.70 MB | 670.29 MB | 157.28 MB |
| HRC-chr11 | 1,936,990 | 6 | 1 | 125.90 GB | 3.48 GB | 1.47 GB | 461.96 MB |
| HGDP-chr10 | 3,766,673 | 24 | 9 | 73.93 GB | 19.07 GB | 67.76 GB | 14.40 GB |

Throughput is measure in megabytes per second. The test involves: 1) reading raw data (`read`); 2) decrypting and uncompressing raw data (`decompress`); 3) copying and constructing data containers from byte streams (`container eval`); 4) constructing lazy-evaluated `yon1_t` records from data containers (`lazy eval`). Which level of evaluation you would use in your application depends on the use-case&mdash;but most applications should be able to operate directly from the byte-streams (`decompress`).

| Dataset | Threads | Read (MB/s) | Decompress (MB/s) | Container Eval (MB/s) | Lazy Eval (MB/s) |
|-------------|---------|-------------|-------------------|-----------------------|------------------|
| 1kgp3-chr11 | 1 | 799.7 | 231.6 | 63.3 | 14.5 |
| | 28 | 847.34 | 2242.9 | 614.5 | 135.9 |
| HRC-chr11 | 1 | 1717.4 | 192.7 | 131.1 | 14.9 |
| | 28 | 1702.2 | 2613.0 | 1940.1 | 157.7 |
| HGDP-chr10 | 1 | 1756.8 | 321.2 | 50.2 | 48.8 |
| | 28 | 1733.3 | 5451.9 | 442.5 | 449.5 |
\#INFO: number of INFO fields; \#FORMAT: number of FORMAT fields; ubcf: uncompressed bcf; uyon: uncompressed yon; 1 MB = 1000 * 1000 b

---

## Installation
For Ubuntu, Debian, and Mac systems, installation is easy: just run
```bash
git clone --recursive https://github.com/mklarqvist/tachyon
cd tachyon
./install.sh
```
Note the added `--recursive` flag to the clone request. This flag is required to additionally pull down the latest third-party dependencies. The install.sh file depends extensively on `apt-get`, so it is unlikely to run without extensive modifications on non-Debian-based systems.
If you do not have super-user (administrator) privileges required to install new packages on your system then run the local installation:
```bash
./install.sh local
```
When installing locally the required dependencies are downloaded and built in the root directory. This approach will require additional effort if you intend to move the compiled libraries to a different directory.

## Documentation

* Overview.
* [Building and installing](docs/building.md)
* [Getting started](docs/getting_started.md)
* [Performance benchmarks](docs/benchmarks.md)

### Contributing

Interested in contributing? Fork and submit a pull request and it will be reviewed.
Expand All @@ -76,7 +59,7 @@ Interested in contributing? Fork and submit a pull request and it will be review
We are actively developing Tachyon and are always interested in improving its quality. If you run into an issue, please report the problem on our Issue tracker. Be sure to add enough detail to your report that we can reproduce the problem and address it. We have not reached version 1.0 and as such the specification and/or the API interfaces may change.

### Version
This is Tachyon 0.6.0. Tachyon follows [semantic versioning](https://semver.org/).
This is Tachyon 0.6.1. Tachyon follows [semantic versioning](https://semver.org/).

### History
Tachyon grew out of the [Tomahawk][tomahawk] project for calculating genome-wide linkage-disequilibrium.
Expand Down
8 changes: 4 additions & 4 deletions docs/building.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,14 @@ For Ubuntu, Debian, and Mac systems, installation is easy: just run
```bash
git clone --recursive https://github.com/mklarqvist/tachyon
cd tachyon
sudo ./install.sh
./install.sh
```
Note the added `--recursive` flag to the clone request. This flag is required to additionally pull down the latest third-party dependencies. The install.sh file depends extensively on apt-get, so it is unlikely to run without extensive modifications on non-Debian-based systems.
If you do not have super-user privileges required to install new packages on your system then run
Note the added `--recursive` flag to the clone request. This flag is required to additionally pull down the latest third-party dependencies. The install.sh file depends extensively on `apt-get`, so it is unlikely to run without extensive modifications on non-Debian-based systems.
If you do not have super-user (administrator) privileges required to install new packages on your system then run the local installation:
```bash
./install.sh local
```
In this situation, all required dependencies are downloaded and built in the current directory. This approach will require additional effort if you intend to move the compiled libraries to a new directory.
When installing locally the required dependencies are downloaded and built in the current directory. This approach will require additional effort if you intend to move the compiled libraries to a different directory.

[openssl]: https://www.openssl.org/
[zstd]: https://github.com/facebook/zstd
Expand Down
8 changes: 8 additions & 0 deletions include/buffer.h
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,10 @@ struct yon_buffer_t {
self_type& operator+=(const int32_t& value);
self_type& operator+=(const double& value);
self_type& operator+=(const uint64_t& value);
// Apple LLVM requires the definition for unsigned long.
#ifdef __APPLE__
self_type& operator+=(const unsigned long& value);
#endif
self_type& operator+=(const int64_t& value);
self_type& operator+=(const std::string& value);

Expand All @@ -130,6 +134,10 @@ struct yon_buffer_t {
friend self_type& operator>>(self_type& data, uint16_t& target);
friend self_type& operator>>(self_type& data, uint32_t& target);
friend self_type& operator>>(self_type& data, uint64_t& target);
// Apple LLVM requires the definition for unsigned long.
#ifdef __APPLE__
friend self_type& operator>>(self_type& data, unsigned long& target);
#endif
friend self_type& operator>>(self_type& data, int8_t& target);
friend self_type& operator>>(self_type& data, int16_t& target);
friend self_type& operator>>(self_type& data, int32_t& target);
Expand Down
6 changes: 6 additions & 0 deletions include/genotypes.h
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,8 @@ bool yon_gt::EvaluateRecordsM1_(){
}
assert(n_total == this->n_s);
this->eval_cont |= YON_GT_UN_RCDS;

return true;
}

template <class T>
Expand Down Expand Up @@ -378,6 +380,8 @@ bool yon_gt::EvaluateRecordsM2_(){
}
assert(n_total == this->n_s);
this->eval_cont |= YON_GT_UN_RCDS;

return true;
}

template <class T>
Expand Down Expand Up @@ -412,6 +416,8 @@ bool yon_gt::EvaluateRecordsM4_(){
}
assert(n_total == this->n_s);
this->eval_cont |= YON_GT_UN_RCDS;

return true;
}

/****************************
Expand Down
2 changes: 1 addition & 1 deletion include/primitive_container.h
Original file line number Diff line number Diff line change
Expand Up @@ -846,7 +846,7 @@ void PrimitiveContainer<return_type>::ExpandEmpty(const uint32_t to){
if(to > this->capacity())
this->resize(to + 10);

for(int i = 0; i < n_entries_; to)
for(int i = 0; i < n_entries_; ++i)
entries_[i] = std::numeric_limits<return_type>::min() + 1;
}

Expand Down
4 changes: 2 additions & 2 deletions include/variant_block.h
Original file line number Diff line number Diff line change
Expand Up @@ -517,15 +517,15 @@ class yon_vb_istats {
}

//
inline self_type& operator+=(const const_reference entry){
inline self_type& operator+=(const_reference entry){
if(this->size() + 1 == this->n_capacity_)
this->resize();


this->entries_[this->n_entries_++] = entry;
return(*this);
}
inline self_type& add(const const_reference entry){ return(*this += entry); }
inline self_type& add(const_reference entry){ return(*this += entry); }

void resize(const uint32_t new_size){
if(new_size == this->capacity()) return;
Expand Down
2 changes: 0 additions & 2 deletions install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,6 @@ git clone https://github.com/samtools/htslib.git
fi
cd htslib
if [ ! -f htslib.so ]; then
# Temporary fix for broken htslib compatibility with C++
git checkout 1832d3a1b75133e55fb6abffc3f50f8a6ed5ceae
autoheader && autoconf && ./configure CPPFLAGS="-I/usr/local/include/" LDFLAGS="-L/usr/local/lib/" && make -j$(nproc)
else
echo "htslib already built! Skipping..."
Expand Down
4 changes: 3 additions & 1 deletion lib/core/genotypes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ yon_gt_rcd::yon_gt_rcd(yon_gt_rcd&& other) :

yon_gt_rcd& yon_gt_rcd::operator=(yon_gt_rcd&& other){
if(this == &other) return(*this);
delete allele; allele = nullptr;
delete[] allele; allele = nullptr;
std::swap(allele, other.allele);
run_length = other.run_length;
return(*this);
Expand Down Expand Up @@ -1170,6 +1170,8 @@ bool yon_gt_summary::LazyEvaluate(void){
}
else this->d->f_pic = 0;
}

return true;
}

bool yon_occ::ReadTable(const std::string file_name,
Expand Down
4 changes: 2 additions & 2 deletions lib/index/variant_index_meta.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,15 +65,15 @@ class VariantIndexMeta{
inline const_iterator cbegin() const{ return const_iterator(&this->__entries[0]); }
inline const_iterator cend() const{ return const_iterator(&this->__entries[this->n_entries]); }

inline self_type& operator+=(const const_reference index_entry){
inline self_type& operator+=(const_reference index_entry){
if(this->size() + 1 == this->n_capacity)
this->resize();


this->__entries[this->n_entries++] = index_entry;
return(*this);
}
inline self_type& add(const const_reference index_entry){ return(*this += index_entry); }
inline self_type& add(const_reference index_entry){ return(*this += index_entry); }

void resize(void){
pointer temp = this->__entries;
Expand Down
Loading