This repository has been archived by the owner on Jan 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 28
Home
Karthik Gururaj edited this page Feb 25, 2016
·
23 revisions
TileDB is a system for efficiently storing, querying and accessing sparse matrix/array data. TileDB is being developed by researchers at the [Intel Science and Technology Center for Big Data] (http://istc-bigdata.org/#&panel1-1).
VariantDB is built on top of the TileDB system. Variant data is sparse by nature (sparse relative to the whole genome).
We store variant data in a 2D TileDB array where:
- Each column corresponds to a genomic position (chromosome + position)
- Each row corresponds to a sample in a VCF (or CallSet in the GA4GH terminology)
- Each cell contains data for a given sample/CallSet at a given position. Data is stored in the form of TileDB cell attributes.
- Variant interval/gVCF interval data is stored in a cell at the start of the interval. The END is stored as a cell attribute. When queried for a given genomic position, the query library performs an efficient left sweep to determine all intervals that intersect with the queried position.
- Cells are stored in column major order - this makes accessing cells with the same column index (i.e. data for a given genomic position over all samples) fast.
#Typical methodology for importing variant data into TileDB
- Assign unique row ids to each sample/CallSet. Sample/CallSet names must be unique
- Assign unique column ranges to each chromosome/contig in the “flattened” column space of TileDB array. Also, all chromosomes must be from the same reference genome - we have no idea what will happen if you mix and match.
- Define TileDB array schema with all the fields/attributes you wish to store in TileDB.
- Produce a CSV file with a list of cells and attributes for each sample/CallSet
- Import CSV files into TileDB
- Workspace: A directory in the machine under which multiple TileDB arrays can be stored.
- Array: Name of the TileDB array
- Given a workspace and array name, the TileDB framework will store its data in the directory <workspace>/StorageManager/<array>.
- Overview of GenomicsDB
- Compiling GenomicsDB
-
Importing variant data into GenomicsDB
- Create a TileDB workspace
- Importing data from VCFs/gVCFs into TileDB/GenomicsDB
- Importing data from CSVs into TileDB/GenomicsDB
- Incremental import into TileDB/GenomicsDB
- Overlapping variant calls in a sample
- Java interface for importing VCF/CSV files into TileDB/GenomicsDB
- Dealing with multiple GenomicsDB partitions
- Querying GenomicsDB
- HDFS or S3 or GCS support in GenomicsDB
- MPI with GenomicsDB
- GenomicsDB utilities
- Try out with Docker
- Common issues
- Bug report
- External Contributions