A toolset to help transform ClinVar's xml release into jsons.
The ClinVar database, maintained by the National Center for Biotechnology information (NCBI), provides information about medically important variants with associated phenotypes and effects of sequence changes. With the help from the NIH Clinical Genome Resource (ClinGen) project, ClinVar is becoming an authoritative resource for medical variants submission and interpretation. ClinVar’s content can be accessed in several ways: the website for live queries and release file download for integrated data analysis. Users can also access it via NCBI’s Eutils application programming interface (API). However, ClinVar’s full content is only available in its Extensible Markup Language (XML) release. The XML release is structured using deeply nested nodes that requires memory-efficient parsing algorithms. Consequently many existing pipelines that extract data from Clinvar use the alternative Variant Call Format (VCF) version that has incomplete data but easier to parse. Moreover, the data reported by ClinVar can be organized in many ways. Consequently, much redundant effort is spent by multiple groups performing the same data warehousing effort in order to extract the parts of data of interest from the complete ClinVar release.
To address these issues and enable local warehousing of ClinVar data, here we report our design and implementation of a memory-efficient flexible pipeline that extracts full ClinVar dataset on a regular desktop with typical hardware configuration. We invented a user-editable map file format that specifies the portion of the ClinVar’s XML file targeted for extraction. By editing this map file, the users can extract various portions of ClinVar without modifying the pipeline code. The pipeline outputs ClinVar variants as javaScript Object Notation (JSON) files for direct consumption by document oriented databases. We implemented the map processor in Ruby and designed an example map file that models the official example release file. Using this map, our flexible pipeline automatically extracted about 150,000 (as of May 2015) ClinVar variants from the XML release into JSONs. We then validated the pipeline using a series of different maps. The pipeline has been published on github (https://github.com/clingendb/clinvar_xml_pipe).
After you get the codes, run the following commands:
export RUBYLIB=$RUBYLIB:YOUR_PATH/libs
gem install logging
export PATH=$PATH:INSTALLED_GEM_PATH
ruby parse_clinvar_xml.rb examples/RCV000077146.xml maps/example.tsv.json
See the maps folder for example map files. To generate your custom map files, first edit the example.tsv file and then run
ruby parse_map.rb your.tsv 1> your.tsv.json