Skip to content

Commit

Permalink
Merge pull request #1388 from egreenberg7/dev
Browse files Browse the repository at this point in the history
New module: Kraken2/Bracken on Unaligned Sequences for Contamination Detection
  • Loading branch information
Shaun-Regenbaum authored Sep 19, 2024
2 parents 0b4125d + 02f65ab commit da7b999
Show file tree
Hide file tree
Showing 34 changed files with 1,430 additions and 201 deletions.
28 changes: 28 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Enhancements & fixes

- [PR #1388](https://github.com/nf-core/rnaseq/pull/1351) - Adding Kraken2/Bracken on unaligned reads as an additional quality control step to detect sample contamination
- [PR #1186](https://github.com/nf-core/rnaseq/pull/1186) - Bump pipeline version to 3.16.0dev

### Parameters

| Old parameter | New parameter |
| ------------- | --------------------------- |
| | `--contaminant_screening` |
| | `--kraken_db` |
| | `--save_kraken_assignments` |
| | `--save_kraken_unassigned` |
| | `--bracken_precision` |

> **NB:** Parameter has been **updated** if both old and new parameter information is present.
> **NB:** Parameter has been **added** if just the new parameter information is present.
> **NB:** Parameter has been **removed** if new parameter information isn't present.
### Software dependencies

| Dependency | Old version | New version |
| ---------- | ----------- | ----------- |
| `Kraken2` | ----------- | 2.1.3 |
| `Bracken` | ----------- | 2.9 |

> **NB:** Dependency has been **updated** if both old and new version information is present.
>
> **NB:** Dependency has been **added** if just the new version information is present.
>
> **NB:** Dependency has been **removed** if new version information isn't present.
## [[3.15.1](https://github.com/nf-core/rnaseq/releases/tag/3.15.1)] - 2024-09-16

### Enhancements & fixes
Expand Down
8 changes: 8 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@

> Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.
- [Bracken](https://doi.org/10.7717/peerj-cs.104)

> Lu, J., Breitwieser, F. P., Thielen, P., & Salzberg, S. L. (2017). Bracken: estimating species abundance in metagenomics data. PeerJ. Computer Science, 3(e104), e104. https://doi.org/10.7717/peerj-cs.104
- [fastp](https://www.ncbi.nlm.nih.gov/pubmed/30423086/)

> Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281.
Expand All @@ -38,6 +42,10 @@

> Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019 Aug;37(8):907-915. doi: 10.1038/s41587-019-0201-4. Epub 2019 Aug 2. PubMed PMID: 31375807.
- [Kraken2](https://doi.org/10.1186/s13059-019-1891-0)

> Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. https://doi.org/10.1186/s13059-019-1891-0
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
3. [`dupRadar`](https://bioconductor.org/packages/release/bioc/html/dupRadar.html)
4. [`Preseq`](http://smithlabresearch.org/software/preseq/)
5. [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
6. [`Kraken2`](https://ccb.jhu.edu/software/kraken2/) -> [`Bracken`](https://ccb.jhu.edu/software/bracken/) on unaligned sequences; _optional_
15. Pseudoalignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/) or ['Kallisto'](https://pachterlab.github.io/kallisto/); _optional_)
16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))

Expand Down
Binary file added docs/images/bracken-top-n-plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/images/nf-core-rnaseq_metro_map_grey.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
414 changes: 235 additions & 179 deletions docs/images/nf-core-rnaseq_metro_map_grey.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 21 additions & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Preseq](#preseq) - Estimation of library complexity
- [featureCounts](#featurecounts) - Read counting relative to gene biotype
- [DESeq2](#deseq2) - PCA plot and sample pairwise distance heatmap and dendrogram
- [Kraken2/Bracken](#kraken2bracken) - Taxonomic classification of unaligned reads
- [MultiQC](#multiqc) - Present QC for raw reads, alignment, read counting and sample similiarity
- [Pseudoalignment and quantification](#pseudoalignment-and-quantification)
- [Salmon](#pseudoalignment) - Wicked fast gene and isoform quantification relative to the transcriptome
Expand Down Expand Up @@ -656,6 +657,25 @@ The plot on the left hand side shows the standard PC plot - notice the variable

<p align="center"><img src="images/mqc_deseq2_clustering.png" alt="MultiQC - DESeq2 sample similarity plot" width="600"></p>

### Kraken2/Bracken

<details markdown="1">
<summary>Output files</summary>

- `<ALIGNER>/contaminants/kraken2/kraken_reports`
- `*.kraken2.report.txt`: Classification of unaligned reads in the Kraken report format. See the [kraken2 manual](https://github.com/DerrickWood/kraken2/wiki/Manual#output-formats) for more details
- `*.classified*.fastq.gz` If `--save_kraken_alignments`, outputs fastq file for each sample with each classified read annotated with taxonomic identification from Kraken2.
- `*.unclassified*.fastq.gz` If `save_kraken_unassigned`, outputs fastq file with all reads that were not classified by Kraken2.
- `<ALIGNER>/contaminants/bracken/`
- `*.kraken2.report_bracken.txt`: Kraken-style reports of the Bracken abundance estimate results. See the [kraken2 manual](https://github.com/DerrickWood/kraken2/wiki/Manual#output-formats) for more details.
- `*.tsv` Summary of estimated reads for each taxon member at the given classification level and what corrections were made from Kraken2.

</details>

[Kraken2](https://ccb.jhu.edu/software/kraken2/) is a taxonomic classification tool that uses k-mer matches paired with a lowest common ancestory (LCA) algorithm to classify species reads. [Bracken](https://ccb.jhu.edu/software/bracken/) is a statistical method to generate abundance estimates based off of the Kraken2 output. These algorithms are run on unaligned sequences to detect potential contamination of samples. MultiQC reports the top 5 taxon members detected at the level of classification used for Bracken, with toggles available for higher taxonomic levels. If Bracken is skipped, MultiQC will report the top 5 species detected by Kraken2.

![MultiQC - Bracken top species plot](images/bracken-top-n-plot.png)

### MultiQC

<details markdown="1">
Expand All @@ -675,7 +695,7 @@ Results generated by MultiQC collate pipeline QC from supported tools i.e. FastQ

### Pseudoalignment

The principal output files are the same between Salmon and Kallsto:
The principal output files are the same between Salmon and Kallisto:

<details markdown="1">
<summary>Output files</summary>
Expand Down
8 changes: 8 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,14 @@ Notes:

By default, the input GTF file will be filtered to ensure that sequence names correspond to those in the genome fasta file, and to remove rows with empty transcript identifiers. Filtering can be bypassed completely where you are confident it is not necessary, using the `--skip_gtf_filter` parameter. If you just want to skip the 'transcript_id' checking component of the GTF filtering script used in the pipeline this can be disabled specifically using the `--skip_gtf_transcript_filter` parameter.

## Contamination screening options

The pipeline provides the option to scan unaligned reads for contamination from other species using [Kraken2](https://ccb.jhu.edu/software/kraken2/), with the possibility of applying corrections from [Bracken](https://ccb.jhu.edu/software/bracken/). Since running Bracken is not computationally expensive, we recommend always using it to refine the abundance estimates generated by Kraken2.

It is important to note that the accuracy of Kraken2 is [highly dependent on the database](https://doi.org/10.1099/mgen.0.000949) used. Specifically, it is [crucial](https://doi.org/10.1128/mbio.01607-23) to ensure that the host genome is included in the database. If you are particularly concerned about certain contaminants, it may be beneficial to use a smaller, more focused database containing primarily those contaminants instead of the full standard database. Various pre-built databases [are available for download](https://benlangmead.github.io/aws-indexes/k2), and instructions for building a custom database can be found in the [Kraken2 documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown). Additionally, genomes of contaminants detected in previous sequencing experiments are available on the [OpenContami website](https://openlooper.hgc.jp/opencontami/help/help_oct.php).

While Kraken2 is capable of detecting low-abundance contaminants in a sample, false positives can occur. Therefore, if only a very small number of reads from a contaminating species are detected, these results should be interpreted with caution.

## Running the pipeline

The typical command for running the pipeline is as follows:
Expand Down
13 changes: 12 additions & 1 deletion modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
"installed_by": ["modules"]
},
"bracken/bracken": {
"branch": "master",
"git_sha": "c214fad97b328eb6d6233f779be9ba44814a9136",
"installed_by": ["modules"]
},
"cat/fastq": {
"branch": "master",
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
Expand Down Expand Up @@ -68,7 +73,8 @@
"hisat2/align": {
"branch": "master",
"git_sha": "ad30f90cfc383dfaa505771d24f9e292c53157ab",
"installed_by": ["fastq_align_hisat2"]
"installed_by": ["fastq_align_hisat2"],
"patch": "modules/nf-core/hisat2/align/hisat2-align.diff"
},
"hisat2/build": {
"branch": "master",
Expand All @@ -90,6 +96,11 @@
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
"installed_by": ["modules", "quantify_pseudo_alignment"]
},
"kraken2/kraken2": {
"branch": "master",
"git_sha": "a13d5d945742a60bbef6e5c177e81cda540f75dc",
"installed_by": ["modules"]
},
"multiqc": {
"branch": "master",
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
Expand Down
7 changes: 7 additions & 0 deletions modules/nf-core/bracken/bracken/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

55 changes: 55 additions & 0 deletions modules/nf-core/bracken/bracken/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

51 changes: 51 additions & 0 deletions modules/nf-core/bracken/bracken/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 13 additions & 0 deletions modules/nf-core/bracken/bracken/nextflow.config

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions modules/nf-core/bracken/bracken/tests/genus_test.config

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit da7b999

Please sign in to comment.