Merge pull request #1106 from nf-core/kallisto_quant

Kallisto quantification
nf-core · Nov 15, 2023 · ab9df9a · ab9df9a
2 parents f189c95 + 5ad263a
commit ab9df9a
Show file tree

Hide file tree

Showing 41 changed files with 1,361 additions and 494 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -232,16 +232,19 @@ jobs:
         run: |
           nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker --aligner hisat2 ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/
 
-  salmon:
-    name: Test Salmon with workflow parameters
+  pseudo:
+    name: Test Pseudoaligners with workflow parameters
     if: ${{ (github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/rnaseq')) && !contains(github.event.head_commit.message, '[ci fast]') }}
     runs-on: ubuntu-latest
     strategy:
       matrix:
         parameters:
-          - "--skip_qc"
-          - "--skip_alignment --skip_pseudo_alignment"
-          - "--salmon_index false --transcript_fasta false"
+          - "--pseudo_aligner salmon --skip_qc"
+          - "--pseudo_aligner salmon --skip_alignment --skip_pseudo_alignment"
+          - "--pseudo_aligner salmon --salmon_index false --transcript_fasta false"
+          - "--pseudo_aligner kallisto --skip_qc"
+          - "--pseudo_aligner kallisto --skip_alignment --skip_pseudo_alignment"
+          - "--pseudo_aligner kallisto --kallisto_index false --transcript_fasta false"
     steps:
       - name: Check out pipeline code
         uses: actions/checkout@v2
@@ -280,6 +283,6 @@ jobs:
           wget -qO- get.nextflow.io | bash
           sudo mv nextflow /usr/local/bin/
 
-      - name: Run pipeline with Salmon and various parameters
+      - name: Run pipeline with Salmon or Kallisto and various parameters
         run: |
-          nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker --pseudo_aligner salmon ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/
+          nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 Special thanks to the following for their contributions to the release:
 
 - [Adam Talbot](https://github.com/adamrtalbot)
+- [Jonathan Manning](https://github.com/pinin4fjords)
 - [Júlia Mir Pedrol](https://github.com/mirpedrol)
 - [Matthias Zepper](https://github.com/MatthiasZepper)
 - [Maxime Garcia](https://github.com/maxulysse)
@@ -28,13 +29,16 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
 - [PR #1083](https://github.com/nf-core/rnaseq/pull/1083) - Move local modules and subworkflows to subfolders
 - [PR #1088](https://github.com/nf-core/rnaseq/pull/1088) - Updates contributing and code of conduct documents with nf-core template 2.10
 - [PR #1091](https://github.com/nf-core/rnaseq/pull/1091) - Reorganise parameters in schema for better usability
+- [PR #1106](https://github.com/nf-core/rnaseq/pull/1106) - Kallisto quantification
+- [PR #1106](https://github.com/nf-core/rnaseq/pull/1106) - MultiQC [version bump](https://github.com/nf-core/rnaseq/pull/1106/commits/aebad067a10a45510a2b421da852cb436ae65fd8)
+- [#1050](https://github.com/nf-core/rnaseq/issues/1050) - Provide custom prefix/suffix for summary files to avoid overwriting
 
 ### Software dependencies
 
 | Dependency              | Old version | New version |
 | ----------------------- | ----------- | ----------- |
 | `fastqc`                | 0.11.9      | 0.12.1      |
-| `multiqc`               | 1.14        | 1.15        |
+| `multiqc`               | 1.14        | 1.17        |
 | `ucsc-bedgraphtobigwig` | 377         | 445         |
 
 > **NB:** Dependency has been **updated** if both old and new version information is present.
@@ -61,7 +65,7 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
 ### Enhancements & fixes
 
 - [[#1011](https://github.com/nf-core/rnaseq/issues/1011)] - FastQ files from UMI-tools not being passed to fastp
-- [[#1018](https://github.com/nf-core/rnaseq/issues/1018)] - Ability to skip both alignment and pseudo-alignment to only run pre-processing QC steps.
+- [[#1018](https://github.com/nf-core/rnaseq/issues/1018)] - Ability to skip both alignment and pseudoalignment to only run pre-processing QC steps.
 - [PR #1016](https://github.com/nf-core/rnaseq/pull/1016) - Updated pipeline template to [nf-core/tools 2.8](https://github.com/nf-core/tools/releases/tag/2.8)
 - [PR #1025](https://github.com/nf-core/fetchngs/pull/1025) - Add `public_aws_ecr.config` to source mulled containers when using `public.ecr.aws` Docker Biocontainer registry
 - [PR #1038](https://github.com/nf-core/rnaseq/pull/1038) - Updated error log for count values when supplying `--additional_fasta`
@@ -809,7 +813,7 @@ Major novel changes include:
 - Added options to skip several steps
   - Skip trimming using `--skipTrimming`
   - Skip BiotypeQC using `--skipBiotypeQC`
-  - Skip Alignment using `--skipAlignment` to only use pseudo-alignment using Salmon
+  - Skip Alignment using `--skipAlignment` to only use pseudoalignment using Salmon
 
 ### Documentation updates
 

diff --git a/README.md b/README.md
@@ -39,7 +39,7 @@
     3. [`dupRadar`](https://bioconductor.org/packages/release/bioc/html/dupRadar.html)
     4. [`Preseq`](http://smithlabresearch.org/software/preseq/)
     5. [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
-15. Pseudo-alignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/); _optional_)
+15. Pseudoalignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/) or ['Kallisto'](https://pachterlab.github.io/kallisto/); _optional_)
 16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))
 
 > **Note**

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -23,6 +23,7 @@ run_modules:
   - hisat2
   - rsem
   - salmon
+  - kallisto
   - samtools
   - picard
   - preseq
@@ -66,6 +67,7 @@ extra_fn_clean_exts:
   - ".umi_dedup"
   - "_val"
   - ".markdup"
+  - "_primary"
 
 # Customise the module search patterns to speed up execution time
 #  - Skip module sub-tools that we are not interested in

diff --git a/bin/salmon_tx2gene.py b/bin/salmon_tx2gene.py
diff --git a/bin/salmon_summarizedexperiment.r → bin/summarizedexperiment.r b/bin/salmon_summarizedexperiment.r → bin/summarizedexperiment.r
@@ -2,18 +2,18 @@
 
 library(SummarizedExperiment)
 
-## Create SummarizedExperiment (se) object from Salmon counts
+## Create SummarizedExperiment (se) object from counts
 
 args <- commandArgs(trailingOnly = TRUE)
-if (length(args) < 2) {
-    stop("Usage: salmon_se.r <coldata> <counts> <tpm>", call. = FALSE)
+if (length(args) < 3) {
+    stop("Usage: summarizedexperiment.r <coldata> <counts> <tpm> <tx2gene>", call. = FALSE)
 }
 
 coldata <- args[1]
 counts_fn <- args[2]
 tpm_fn <- args[3]
+tx2gene <- args[4]
 
-tx2gene <- "salmon_tx2gene.tsv"
 info <- file.info(tx2gene)
 if (info$size == 0) {
     tx2gene <- NULL

diff --git a/bin/tx2gene.py b/bin/tx2gene.py
@@ -0,0 +1,160 @@
+#!/usr/bin/env python
+import logging
+import argparse
+import glob
+import os
+from collections import Counter, defaultdict, OrderedDict
+from collections.abc import Set
+from typing import Dict
+
+# Configure logging
+logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+
+
+def read_top_transcripts(quant_dir: str, file_pattern: str) -> Set[str]:
+    """
+    Read the top 100 transcripts from the quantification file.
+
+    Parameters:
+    quant_dir (str): Directory where quantification files are located.
+    file_pattern (str): Pattern to match quantification files.
+
+    Returns:
+    set: A set containing the top 100 transcripts.
+    """
+    try:
+        # Find the quantification file within the directory
+        quant_file_path = glob.glob(os.path.join(quant_dir, "*", file_pattern))[0]
+        with open(quant_file_path, "r") as file_handle:
+            # Read the file and extract the top 100 transcripts
+            return {line.split()[0] for i, line in enumerate(file_handle) if i > 0 and i <= 100}
+    except IndexError:
+        # Log an error and raise a FileNotFoundError if the quant file does not exist
+        logger.error("No quantification files found.")
+        raise FileNotFoundError("Quantification file not found.")
+
+
+def discover_transcript_attribute(gtf_file: str, transcripts: Set[str]) -> str:
+    """
+    Discover the attribute in the GTF that corresponds to transcripts, prioritizing 'transcript_id'.
+
+    Parameters:
+    gtf_file (str): Path to the GTF file.
+    transcripts (Set[str]): A set of transcripts to match in the GTF file.
+
+    Returns:
+    str: The attribute name that corresponds to transcripts in the GTF file.
+    """
+    votes = Counter()
+    with open(gtf_file) as inh:
+        # Read GTF file, skipping header lines
+        for line in filter(lambda x: not x.startswith("#"), inh):
+            cols = line.split("\t")
+            # Parse attribute column and update votes for each attribute found
+            attributes = dict(item.strip().split(" ", 1) for item in cols[8].split(";") if item.strip())
+            votes.update(key for key, value in attributes.items() if value.strip('"') in transcripts)
+
+    if not votes:
+        # Log a warning if no matching attribute is found
+        logger.warning("No attribute in GTF matching transcripts")
+        return ""
+
+    # Check if 'transcript_id' is among the attributes with the highest votes
+    if "transcript_id" in votes and votes["transcript_id"] == max(votes.values()):
+        logger.info("Attribute 'transcript_id' corresponds to transcripts.")
+        return "transcript_id"
+
+    # If 'transcript_id' isn't the highest, determine the most common attribute that matches the transcripts
+    attribute, _ = votes.most_common(1)[0]
+    logger.info(f"Attribute '{attribute}' corresponds to transcripts.")
+    return attribute
+
+
+def parse_attributes(attributes_text: str) -> Dict[str, str]:
+    """
+    Parse the attributes column of a GTF file.
+
+    :param attributes_text: The attributes column as a string.
+    :return: A dictionary of the attributes.
+    """
+    # Split the attributes string by semicolon and strip whitespace
+    attributes = attributes_text.strip().split(";")
+    attr_dict = OrderedDict()
+
+    # Iterate over each attribute pair
+    for attribute in attributes:
+        # Split the attribute into key and value, ensuring there are two parts
+        parts = attribute.strip().split(" ", 1)
+        if len(parts) == 2:
+            key, value = parts
+            # Remove any double quotes from the value
+            value = value.replace('"', "")
+            attr_dict[key] = value
+
+    return attr_dict
+
+
+def map_transcripts_to_gene(
+    quant_type: str, gtf_file: str, quant_dir: str, gene_id: str, extra_id_field: str, output_file: str
+) -> bool:
+    """
+    Map transcripts to gene names and write the output to a file.
+
+    Parameters:
+    quant_type (str): The quantification method used (e.g., 'salmon').
+    gtf_file (str): Path to the GTF file.
+    quant_dir (str): Directory where quantification files are located.
+    gene_id (str): The gene ID attribute in the GTF file.
+    extra_id_field (str): Additional ID field in the GTF file.
+    output_file (str): The output file path.
+
+    Returns:
+    bool: True if the operation was successful, False otherwise.
+    """
+    # Read the top transcripts based on quantification type
+    transcripts = read_top_transcripts(quant_dir, "quant.sf" if quant_type == "salmon" else "abundance.tsv")
+    # Discover the attribute that corresponds to transcripts in the GTF
+    transcript_attribute = discover_transcript_attribute(gtf_file, transcripts)
+
+    if not transcript_attribute:
+        # If no attribute is found, return False
+        return False
+
+    # Open GTF and output file to write the mappings
+    # Initialize the set to track seen combinations
+    seen = set()
+
+    with open(gtf_file) as inh, open(output_file, "w") as output_handle:
+        # Parse each line of the GTF, mapping transcripts to genes
+        for line in filter(lambda x: not x.startswith("#"), inh):
+            cols = line.split("\t")
+            attr_dict = parse_attributes(cols[8])
+            if gene_id in attr_dict and transcript_attribute in attr_dict:
+                # Create a unique identifier for the transcript-gene combination
+                transcript_gene_pair = (attr_dict[transcript_attribute], attr_dict[gene_id])
+
+                # Check if the combination has already been seen
+                if transcript_gene_pair not in seen:
+                    # If it's a new combination, write it to the output and add to the seen set
+                    extra_id = attr_dict.get(extra_id_field, attr_dict[gene_id])
+                    output_handle.write(f"{attr_dict[transcript_attribute]}\t{attr_dict[gene_id]}\t{extra_id}\n")
+                    seen.add(transcript_gene_pair)
+
+    return True
+
+
+# Main function to parse arguments and call the mapping function
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Map transcripts to gene names for tximport.")
+    parser.add_argument("--quant_type", type=str, help="Quantification type", default="salmon")
+    parser.add_argument("--gtf", type=str, help="GTF file", required=True)
+    parser.add_argument("--quants", type=str, help="Output of quantification", required=True)
+    parser.add_argument("--id", type=str, help="Gene ID in the GTF file", required=True)
+    parser.add_argument("--extra", type=str, help="Extra ID in the GTF file")
+    parser.add_argument("-o", "--output", dest="output", default="tx2gene.tsv", type=str, help="File with output")
+
+    args = parser.parse_args()
+    if not map_transcripts_to_gene(args.quant_type, args.gtf, args.quants, args.id, args.extra, args.output):
+        logger.error("Failed to map transcripts to genes.")