Versioned Data System

The Galaxy and command line Versioned Data System manages the retrieval of current and past versions of selected reference sequence databases from local data stores.

Overview

This tool can be used on a server both via the command line and via the Galaxy bioinformatics workflow platform using the "Versioned Data" tool. Different kinds of content are suited to different archiving technologies, so the system provides a few storage system choices.

Fasta sequences - accession ids, descriptions and their sequences - are suited to storage as 1 line key-value pair records in a key-value store. Here we introduce a low-tech file-based database plugin for this kind of data called Kipper. It is suited entirely to the goal of producing complete versioned files. This covers much of the sequencing archiving problem for reference databases. Consult https://github.com/Public-Health-Bioinformatics/kipper for up-to-date information on Kipper.
A git archiving system plugin is also provided for software file tree archiving, with a particular file differential (diff) compression benefit for documents that have sentence-like lines added and deleted between versions.
Super-large files that are not suited to Kipper or git can be handled by a simple "folder" data store holds each version of file(s) in a separate compressed archive.
Biomaj (our reference database maintenance software) can be configured to download and store separate version files. A Biomaj plugin allows direct selection of versioned files within its "data bank" folders.

The Galaxy Versioned Data tool below, shows the interface for retrieving versions of reference database. The tool lets you select the fasta database to retrieve, and then one or more workflows. The system then generates and caches the versioned data in the data library; then links it into one's history; then runs the workflow(s) to get the derivative data (a Blast database say) and then caches that back into the data library. Future requests for that versioned data and derivatives (keyed by workflow id and input data version ids) will return the data already from cache rather than regenerating it, until the cache is deleted.

Project goals

To enable reproducible molecular biology research: To recreate a search result at a certain point in time we need versioning so that search and mapping tools can look at reference sequence databases corresponding to a particular past date or version identifier. This recall can also explain the difference between what was known in the past vs. currently.
To reduce hard drive space. Some databases are too big to keep N copies around, e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb + .... (Compressing each file individually is an option but even better we could store just the differences between subsequent versions.)
Maximize speed of archive recall. Understanding that the archived version files can be large, we'd ideally like a versioned file to be retrieved in the time it takes to write a file of that size to disk. Caching this data and its derivatives (makeblastdb databases for example) is important.
Improve sequence archive management. Provide an admin interface for managing regular scheduled import and log of reference sequence databases from our own and 3rd party sources like NCBI and cpndb.ca .
Integrate database versioning into the Galaxy workflow management software without adding a lot of complexity.
A bonus would be to enable the efficient sharing of versioned data between computers/servers.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
data_stores		data_stores
doc		doc
LICENSE.md		LICENSE.md
README.md		README.md
__init__.py		__init__.py
bccdc_macros.xml		bccdc_macros.xml
data_store_utils.py		data_store_utils.py
vdb_common.py		vdb_common.py
vdb_retrieval.py		vdb_retrieval.py
versioned_data.py		versioned_data.py
versioned_data.xml		versioned_data.xml
versioned_data_cache_clear.py		versioned_data_cache_clear.py
versioned_data_form.py		versioned_data_form.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Versioned Data System

Overview

Project goals

About

Releases

Packages

Languages

License

Public-Health-Bioinformatics/versioned_data

Folders and files

Latest commit

History

Repository files navigation

Versioned Data System

Overview

Project goals

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages