Merge pull request #185 from TheJacksonLaboratory/release-v1.0.0-RC2

Release v1.0.0 rc2
monarch-initiative · Jul 12, 2021 · 162e0f1 · 162e0f1
2 parents f03263b + 4f8530e
commit 162e0f1
Show file tree

Hide file tree

Showing 149 changed files with 1,202 additions and 6,308 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -2,11 +2,24 @@
 Changelog
 =========
 
+
+----------
+v1.0.0-RC2
+----------
+
+- Implement VCF output format
+- Clean up the repo from the obsolete code
+- Improve documentation & test coverage
+- Bug fixes
+  - remove null pointer in ``GeneService``
+  - do not run coverage filter if the coverage data is missing for a variant
+
+
 ----------
 v1.0.0-RC1
 ----------
 
-- Rename `annotate` CLI command to `prioritize`
+- Rename ``annotate`` CLI command to ``prioritize``
 - Multiple minor adjustments
 
 

diff --git a/README.md b/README.md
@@ -1,83 +1,14 @@
-# SvAnna
+# SvAnna - Structural Variant Annotation and Analysis
+
+![Java CI with Maven](https://github.com/TheJacksonLaboratory/SvAnna/workflows/Java%20CI%20with%20Maven/badge.svg)
+[![Documentation Status](https://readthedocs.org/projects/squirls/badge/?version=latest)](https://svanna.readthedocs.io/en/latest/?badge=latest)
+
+![Java CI with Maven](https://github.com/TheJacksonLaboratory/SvAnna/workflows/Java%20CI%20with%20Maven/badge.svg)
+[![Documentation Status](https://readthedocs.org/projects/svanna/badge/?version=latest)](https://svanna.readthedocs.io/en/latest/?badge=latest)
 
 Efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing
 
 Most users should download the latest SvAnna distribution ZIP file from
 the [Releases page](https://github.com/TheJacksonLaboratory/SvAnna/releases).
 
-Please consult the Read the docs site for detailed documentation - TODO - setup RTD.
-
-## Attic
-
-**The text below is out of sync, and the most useful parts of the text will be moved to *Read the docs*.**
-
-**The documentation needs to be completed.**
-
-### Creating the Jannovar transcript file
-[Jannovar](https://github.com/charite/jannovar) is a Java app/library for annotating
-VCF files. Its main use case is for small variants and their intersection with
-protein coding sequences. We will use it here to extract the positions of genes and
-SVs, but it may be easier just to start with a gencode GFF file in the future.
-
-Jannovar downloads various files and creates a transcript file that it uses for VCF annotation.
-At present, NCBI etc has changed the location of some files so that only the develop branch
-of Jannovar works. Enter the following commands to create the transcript file
-
-```
-git clone
-https://github.com/charite/jannovar.git
-cd jannovar
-git checkout develop
-mvn package
-java [-Xmx8g] -jar jannovar-cli-0.36-SNAPSHOT.jar download -d hg38/refseq_curated 
-```
-This command downloads various files and generates `data/hg38_refseq_curated.ser`. either move
-this to the data subdirectory in this project or softlink it (from 'data', enter `ln -s <path>`).
-Thus, for now, this project expects the path `data/data/refseq_curated.ser`.
-
-## Running svann
-
-Enter the following command to see options. The LIRICAL file is the 
-LIRICAL TSV output file. The enhancers file is created by the
-https://github.com/pnrobinson/tspec app. To use the enhancers file
-it is required to also use an HPO term with the major phenotypic abnormality, 
-e.g., [Abnormality of the immune system](https://hpo.jax.org/app/browse/term/HP:0002715).
-
-```
-$  java -jar target/svann.jar annotate -h
-  Usage: svann annotate [-hV] [-e=<enhancerFile>] [-g=<geneCodePath>]
-                        [-j=<jannovarPath>] [-t=<hpoTermIdList>] -v=<vcfFile>
-                        [-x=<outprefix>]
-  annotate VCF file
-    -e, --enhancer=<enhancerFile>
-                               tspec enhancer file
-    -g, --gencode=<geneCodePath>
-  
-    -h, --help                 Show this help message and exit.
-    -j, --jannovar=<jannovarPath>
-                               prefix for output files (default:
-                                 data/data/hg38_refseq_curated.ser )
-    -t, --term=<hpoTermIdList> HPO term IDs (comma-separated list)
-    -v, --vcf=<vcfFile>
-    -V, --version              Print version information and exit.
-    -x, --prefix=<outprefix>   prefix for output files (default: L2O )
-```
-
-
-
-
-# Documentation
-
-Generate the read the docs documentation locally by going to the ``docs`` subdirectory.
-First generate a virtual environment and install the required sphinx packages. ::
-
-    virtualenv p38
-    source p38/bin/activate
-    pip install sphinx sphinx-rtd-theme
-
-To create the documentation, ensure you are using the ``p38`` environment and enter the following command. ::
-
-    source p38/bin/activate
-    make html
-
-This will generate HTML pages under ``_build/html``.
+Please consult the Read the docs site for [detailed documentation](https://svanna.readthedocs.io/en/latest).
diff --git a/docs/Makefile b/docs/Makefile
@@ -3,11 +3,10 @@
 
 # You can set these variables from the command line.
 SPHINXOPTS    =
-SPHINXBUILD   = python -msphinx
-SPHINXPROJ    = svann
+SPHINXBUILD   = sphinx-build
+SPHINXPROJ    = SvAnna
 SOURCEDIR     = .
 BUILDDIR      = _build
-html_static_path = ['..']
 
 # Put it first so that "make" without argument is like "make help".
 help:

diff --git a/docs/conf.py b/docs/conf.py
@@ -46,7 +46,7 @@
 
 # General information about the project.
 project = u'SvAnna'
-copyright = u'2021'
+copyright = u'2021, Daniel Danis, Peter N Robinson'
 author = u'Daniel Danis, Peter Robinson'
 
 # The version info for the project you're documenting, acts as replacement for
@@ -56,7 +56,7 @@
 # The short X.Y version.
 version = u'1.0'
 # The full version, including alpha/beta/rc tags.
-release = u'1.0.0-RC1'
+release = u'1.0.0-RC2'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
@@ -142,7 +142,7 @@
 #  author, documentclass [howto, manual, or own class]).
 latex_documents = [
     (master_doc, 'SvAnna.tex', u'svann Documentation',
-     u'Peter Robinson', 'manual'),
+     u'Daniel Danis, Peter N Robinson', 'manual'),
 ]
 
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -1,22 +1,22 @@
-SvAnna: Annotation of Structural Variants in VCF files
-=====================================================
+SvAnna:
+=======
 
+Efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing
 
-SvAnna
-~~~~~
-
-This application annotates structural variants in VCF files, focussing specifically on long-read WGS analysis
+SvAnna performs phenotype-driven prioritization of structural variants in VCF files, focusing specifically on long-read WGS analysis
 of germline variants.
 
 
-
 .. toctree::
    :maxdepth: 2
    :caption: Contents:
 
+   quickstart
    setup
-   enhancers
    running
-   BND<bndannotations>
-   structuralvariation
+   outputformats
+
 
+.. structuralvariation
+.. enhancers
+.. BND<bndannotations>
diff --git a/docs/outputformats.rst b/docs/outputformats.rst
@@ -0,0 +1,70 @@
+.. _rstoutputformats:
+
+==============
+Output formats
+==============
+
+SvAnna supports storing results in 4 output formats: *HTML*, *VCF* *CSV*, and *TSV*. Use the ``--output-format`` option
+to select one or more of the desired output formats (e.g. ``--output-format html,vcf``).
+
+HTML output format
+^^^^^^^^^^^^^^^^^^
+
+SvAnna creates an *HTML* file with the analysis summary and with variants sorted by the :math:`TAD_{SV}` score
+in descending order.
+By default, top 100 variants are included into the report. The number of the reported variants can be adjusted by
+the ``--report-top-variants`` option.
+
+The report consists of several parts:
+
+* *Analysis summary* - Details of HPO terms of the proband, paths of the input files, and the analysis parameters.
+* *Variant counts* - Breakdown of the number of the variant types of the different categories.
+* *Prioritized SVs* - Visualizations of the prioritized variants.
+
+.. TODO - write more about the HTML report
+
+.. note::
+  Only the variants that passed all the filters are visualized in the *Prioritized SVs* section
+
+The ``--no-breakends`` excludes breakend/translocation variants from the report.
+
+VCF output format
+^^^^^^^^^^^^^^^^^
+When including ``vcf`` into the ``--output-format`` option, a VCF file with all input variants is created.
+The prioritization adds a novel *INFO* field to each variant:
+
+* ``TADSV`` - an *INFO* field containing :math:`TAD_{SV}` score for the variant.
+
+.. note::
+  * ``--report-top-variants`` option has no effect for the *VCF* output format.
+  * add ``--uncompressed-output`` flag if you want to get uncompressed VCF file
+
+
+CSV/TSV output format
+^^^^^^^^^^^^^^^^^^^^^
+To write *n* most deleterious variants into a *CSV* (or *TSV*) file, use ``csv`` (``tsv``) in the ``--output-format`` option.
+
+The results are written into a tabular file with the following columns:
+
+* *contig* - name of the contig/chromosome (e.g. ``1``, ``2``, ``X``)
+* *start* - 0-based start coordinate (excluded) of the variant on positive strand
+* *end* - 0-based end coordinate (included) of the variant on positive strand
+* *id* - variant ID as it was present in the input VCF file
+* *vtype* - variant type, one of {``DEL``, ``DUP``, ``INV``, ``INS``, ``BND``, ``CNV``}
+* *failed_filters* - the names of filters that the variant failed to pass. The names are separated by semicolon (``;``)
+  * ``filter`` - the variant failed previous VCF filters - at least one filter flag is present in the variant VCF line, except for ``PASS``.
+  * ``coverage`` - the variant is supported by less reads than specified by ``--min-read-support`` option
+* *tadsv* - the :math:`TAD_{SV}` score value
+
+.. table:: Tabular output
+
+  ======== ========= ========== ====== ======= ================= =====================
+   contig    start      end       id    vtype   failed_filters         tadsv
+  ======== ========= ========== ====== ======= ================= =====================
+   11       31130456  31671718   abcd   DEL                       109.75766900764305
+   18       46962113  46969912   efgh   DUP     filter;coverage   3.2
+   ...      ...       ...        ...    ...     ...               ...
+  ======== ========= ========== ====== ======= ================= =====================
+
+.. note::
+  add ``--uncompressed-output`` flag if you want to get uncompressed tabular file
diff --git a/docs/quickstart.rst b/docs/quickstart.rst
@@ -0,0 +1,101 @@
+.. _rstquickstart:
+
+==========
+Quickstart
+==========
+
+This document is intended for the impatient users who want to quickly setup and prioritize variants with SvAnna.
+
+Prerequisites
+^^^^^^^^^^^^^
+
+SvAnna is written in Java 11 and needs Java 11+ to be present in the runtime environment. Please verify that you are
+using Java 11+ by running::
+
+  java -version
+
+
+SvAnna setup
+^^^^^^^^^^^^
+
+SvAnna is install by running the following three steps.
+
+1. Download SvAnna distribution ZIP
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Download and extract SvAnna distribution ZIP archive from `here <https://github.com/TheJacksonLaboratory/SvAnna/releases>`_.
+Expand the *Assets* menu and download the ``svanna-cli-{version}-distribution.zip``. Choose the latest stable version,
+or a release candidate (RC).
+
+After unzipping the distribution archive, run the following command to display the help message::
+
+  java -jar svanna-cli-1.0.0-RC1.jar --help
+
+.. note::
+  If things went OK, the command above will print the following help message::
+
+    Structural variant prioritization
+    Usage: svanna-cli.jar [-hV] [COMMAND]
+      -h, --help      Show this help message and exit.
+      -V, --version   Print version information and exit.
+    Commands:
+      generate-config, G  Generate a configuration YAML file
+      prioritize, P       Prioritize the variants
+    See the full documentation at `https://github.com/TheJacksonLaboratory/SvAnna`
+
+2. Download SvAnna database files
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Run the following::
+
+  wget https://svanna.s3.amazonaws.com/svanna.zip && unzip svanna.zip
+  wget https://squirls.s3.amazonaws.com/jannovar_v0.35.zip && unzip jannovar_v0.35.zip
+
+
+3. Generate & fill the configuration file
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generate the configuration file::
+
+  java -jar `pwd`/svanna/svanna-cli-1.0.0-RC1.jar generate-config svanna-config.yml
+
+Now open the generated file in your favorite text editor and provide absolute paths to the following two resources:
+
+* ``dataDirectory:`` - the absolute path to the folder where SvAnna database files were extracted
+* ``jannovarCachePath`` - the absolute path to selected Jannovar ``*.ser`` file, e.g. ``/path/to/hg38_refseq.ser``
+
+.. tip::
+  The YAML syntax requires a whitespace to be present between the *key*: *value* pairs.
+
+Note the location of the configuration file, as the path to the configuration file must be provided for all SvAnna runs.
+Having completed the steps above, you are good to prioritize variants in a VCF file.
+
+Prioritize structural variants in VCF file
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Let's annotate a toy VCF file containing eight SVs reported in the SvAnna manuscript.
+
+First, let's download the VCF file::
+
+  wget https://github.com/TheJacksonLaboratory/SvAnna/blob/master/svanna-cli/src/examples/example.vcf
+
+The variants were sourced from published clinical case reports and each variant led to a Mendelian disease.
+
+For the purpose of this test run, let's assume that the VCF file contains SVs identified in a short/long read
+sequencing run of a patient presenting with the following clinical symptoms:
+
+* *HP:0011890* - Prolonged bleeding following procedure
+* *HP:0000978* - Bruising susceptibility
+* *HP:0012147* - Reduced quantity of Von Willebrand factor
+
+Now, let's prioritize the variants::
+
+  java -jar svanna/svanna-cli-1.0.0-RC1.jar prioritize --config svanna-config.yml --output-format html,csv,vcf --vcf example.vcf --term HP:0011890 --term HP:0000978 --term HP:0012147
+
+The variant with ID ``Othman-2010-20696945-VWF-index-FigS7`` that disrupts a promoter of the *von Willenbrand factor*
+(*VWF*) gene (`Othman et al., 2010 <https://pubmed.ncbi.nlm.nih.gov/20696945>`_)
+receives the highest :math:`TAD_{SV}` score of 25.61, and the variant is placed on rank 1.
+
+SvAnna stores prioritization results in *HTML*, *CSV*, and *VCF* output formats next to the input VCF file.
+
+Read the :ref:`rstsetup` and :ref:`rstrunning` sections to learn all details regarding setting up and running SvAnna.