Skip to content
bjornnystedt edited this page Jun 13, 2016 · 8 revisions

WGSstructvar

Implementing a workflow for a standardized structural variation calling and filtering

Background and description

Human WGS is expected to rapidly gain in popularity, and SciLifeLab has invested in a large sequencing capacity, approaching a rate of 10,000 samples per year during Q3/Q4 2016. Structural variation (SV) is not yet a part of the standard variant calling at NGI. There is no global consensus on what programs should be used for SV, but we have in the last year gained experience and performed benchmarks, which (together with public benchmarks) allows us to provide valuable recommendations. A key observation is that filtering of SV in a sample against SV calls from an ethnically similar reference population appears to considerably reduce the number of false positives and/or common SVs. For this to work well, the same SV calling method needs to be applied both to the (disease) sample and the reference cohort.

We want to produce a workflow (WF) for SV in Human WGS projects, to be run on the Swedish reference population as well as by research groups and NBIS staff. It should be possible to incorporate the WF into the production analysis pipelines at NGI and Clinical Diagnostics (although the actual implementation is outside of the scope of the current project). The WF should include filtering towards a SV population frequency table.

Values

We want a stable and standardized workflow for SV calling in Human WGS projects. The workflow should enable scientific discovery by Swedish research groups by

  • running SV calling on Human WGS samples, including variant annotation
  • filtering sample SV calls towards SV frequencies from the 1000 samples in the Swedish reference population.

Requirements

Always: “Findable, Useful, Citeable”

  • Quality (sorry for being fluffly…)
    • Enable scientific discovery
    • Be largely accepted in the research community
  • Performance
    • Flexibility in software to avoid overuse of CPU-intense programs
    • Easy tracking and re-run of failed runs for large sample sets (1000+)
    • Available (through the module system?) at relevant Uppmax clusters
    • Efficient use of CPU at Uppmax
    • Possible to run at production hardware at NGI and Clinical Genomics
  • Tools
    • Implemented in NextFlow, to allow synergy between ongoing SciLifeLab development projects (RNASeq, Somatic variant calling)

Note that SV is an ill-defined concept, including a number of redundant subclasses, which will need to be specified for this project, e.g. insertions, deletions, copy-number variations (CNV), inversions, and translocations.

Main stakeholders

  • The Swedish reference population project (Running SV on 1000 individuals. The SV calls will be used to produce a SV freq table as a part of the SweFreq project).
  • The SciLifeLab Human WGS National Projects and other Human WGS projects, and NBIS staff supporting such projects.
  • NGI (Max Käller) and Clinical Genomics (Valtteri Wirta)

Organisation

  • Owner: Björn Nystedt (WGS ToolBox Coordinator)
  • Project leader/Scrum master: Pall Olason
  • Coders: Johan Viklund, Samuel Lampa

Project charter

First draft https://docs.google.com/presentation/d/17jhDfkGswnnI6YiSeiQNAGaMERZZDmeSRx6HHgH0PI0/edit#slide=id.p4