Add UMI reads processing capability #145

lescai · 2020-03-05T11:42:25Z

nf-core/sarek pull request

Many thanks for contributing to nf-core/sarek!

Please fill in the appropriate checklist below (delete whatever is not relevant).
These are the most common things requested on pull requests (PRs).

PR checklist

This pull request introduces a chunk of code to process reads containing UMIs. Unique Molecular Indices are very important particularly for somatic workflows aiming at detecting very low allele-fraction variants (MRD, Liquid Biopsy). The chosen workflow adopts the FGBIO tools, which create a consensus read within the same UMI-groups, and a robust method for identification of the groups. See blog and ref.
The approach ensures downstream compatibility with the workflow: the result of the UMI process is a uBam, which can then be fed into MappingReads and downstream in both HaplotypeCaller and more importantly Mutect2 or Strelka.
Tests are work in progress: datasets have been identified from 2 different UMI types (QIAseq and Illumina TSO), but cannot complete them on laptop
As indicated above, the reads will be uploaded at nf-core/sarek branch on the nf-core/test-datasets repo
The code has passed lints (nf-core lint .).
Documentation in docs has been updated
CHANGELOG.md is not been updated yet
README.md has not been updated yet (not sure if this is relevant)

main.nf

maxulysse

Looks amazing.
We just need the test data to do some CI.
Can you update the CHANGELOG as well?

docs/usage.md

maxulysse · 2020-03-05T11:54:37Z

Made a couple of suggestions, if you accept them, you can batch commit them.

Co-Authored-By: Maxime Garcia <maxime.garcia@scilifelab.se>

main.nf

docs/usage.md

lescai

agree with suggestions, reviewed and made more explicit

chelauk · 2020-07-20T11:57:15Z

Hi any updates on adding umi to variant calling? Is it working? Otherwise I will build a new pipeline.

lescai · 2020-07-20T12:14:25Z

Hi @chelauk not sure what's holding the pull request at this stage, I did test everything at the March hackathon using the test data here
https://github.com/nibscles/test-datasets/tree/sarek-umi
And it's working.
We are using our standalone code at NIBSC, if you don't want to build something new: it's undocumented, but it will give you an unmapped BAM, which you can then feed into Sarek if you like.
https://github.com/nibscbioinformatics/core/tree/master/workflows/umi
The important thing is to pass the correct UMI structure (see documentation for this pull request)
https://github.com/nibscles/sarek/blob/umi/docs/usage.md#--umi

maxulysse · 2020-07-20T14:06:48Z

Hi @chelauk @nibscles
I wanted to close this one during the hackathon, but did not find the time.
I'll make sure merge it by the end of the week.
We're planning a minor release soonish as well

main.nf

maxulysse · 2020-07-22T20:03:44Z

@chelauk @nibscles
PR is finally merged.
I'll be making another PR regarding UMI, some tiny code polishing and adding test.

Francesco added 12 commits March 4, 2020 14:17

added software to environment file

16df96b

added UMI options to param help

d55306f

split inputreads into an additional channel for UMI processing

138f217

added UMI code block

31b962d

connected UMI output to mapping process

2a46042

corrected missing tuple

587793c

fixed reference in umi file mapping

7dfe534

fixing channel used twice for fasta by sentieon

1446f42

wrongly duplicated channel with typo

e13c9c1

using ch_fasta as in mapping for umi mapping

494899a

reduced test config for laptop run

5c6e40a

added documentation for UMI input options

a3cd712

lescai requested a review from maxulysse as a code owner March 5, 2020 11:42