Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kallisto quantification #1106

Merged
merged 66 commits into from
Nov 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
0979aee
naive kallisto addition to prepare_genome, following salmon pattern (…
pinin4fjords Oct 31, 2023
882843b
Add kallisto modules
pinin4fjords Oct 31, 2023
952ab48
update schema
pinin4fjords Oct 31, 2023
ac4fa24
Pass kallisto index in workflow
pinin4fjords Oct 31, 2023
549cfef
Add kallisto quant subworkflow
pinin4fjords Oct 31, 2023
b32eba6
[skip ci] next batch of kallisto adaptations
pinin4fjords Oct 31, 2023
06f72dc
tximport is more generic than just salmon
pinin4fjords Nov 1, 2023
9e3d589
Fix kallisto indexing
pinin4fjords Nov 1, 2023
1a41fbf
[skip ci] Next phase of Kallisto integration
pinin4fjords Nov 1, 2023
c081697
[skip ci] Bump kallisto modules
pinin4fjords Nov 2, 2023
7eb755b
[skip ci] rewire for updated kallisto modules
pinin4fjords Nov 2, 2023
42cef1d
[skip ci] Make tx2gene.py generic
pinin4fjords Nov 2, 2023
4089972
[skip ci] Try chatgpt's tidied version of tx2gene.py
pinin4fjords Nov 2, 2023
d18c3f1
[skip ci] Tidied up and genericised tximport script
pinin4fjords Nov 2, 2023
2d3e08e
[skip ci] misc compatibility updates
pinin4fjords Nov 2, 2023
5be6a04
[skip ci] fix bin/tx2gene.py
pinin4fjords Nov 2, 2023
4d8911f
[skip ci] prioritise transcript_id
pinin4fjords Nov 2, 2023
a5b27f7
[skip ci] final kallisto results handling fixes
pinin4fjords Nov 2, 2023
73e1a54
more genericisation between kallisto and salmon
pinin4fjords Nov 3, 2023
2a24792
Remove debugging line
pinin4fjords Nov 3, 2023
aebad06
Fix multiqc to avoid imp module error
pinin4fjords Nov 3, 2023
2ada1e4
Fix process selector
pinin4fjords Nov 3, 2023
39589a6
add kallisto_index default
pinin4fjords Nov 3, 2023
839ac5c
Combine kallisto and salmon components
pinin4fjords Nov 3, 2023
aa6e8c0
appease eclint
pinin4fjords Nov 3, 2023
72766a5
[automated] Fix linting with Prettier
nf-core-bot Nov 3, 2023
7328c65
Update CHANGELOG
pinin4fjords Nov 3, 2023
3c56b68
Merge branch 'kallisto_quant' of https://github.com/nf-core/rnaseq in…
pinin4fjords Nov 3, 2023
6a5e485
blackify
pinin4fjords Nov 3, 2023
f40065d
Don't hardcode tx2gene_file
pinin4fjords Nov 3, 2023
943e6ab
syntax fix
pinin4fjords Nov 3, 2023
83217d3
Add some kallisto tests
pinin4fjords Nov 3, 2023
6f85da7
Update metro maps
pinin4fjords Nov 3, 2023
7bda41e
[skip ci] Fix metro maps
pinin4fjords Nov 3, 2023
6479792
Update strandedness params for kallisto and salmon
pinin4fjords Nov 3, 2023
92716b4
[skip ci] appease linter
pinin4fjords Nov 3, 2023
cadec09
Fix kallisto publishing
pinin4fjords Nov 3, 2023
908c6f2
[skip ci] fix usage.md for Kallisto
pinin4fjords Nov 3, 2023
334088e
Merge branch 'kallisto_quant' of github.com:nf-core/rnaseq into kalli…
pinin4fjords Nov 3, 2023
0fc5407
[automated] Fix linting with Prettier
nf-core-bot Nov 3, 2023
e9ed01d
Update fraglen defaults
pinin4fjords Nov 3, 2023
1467c81
MultiQC needs STDOUT from Kallisto
pinin4fjords Nov 6, 2023
7d04886
Fix Kallisto MultiQC
pinin4fjords Nov 6, 2023
8420e99
Update docs
pinin4fjords Nov 6, 2023
46866ae
[automated] Fix linting with Prettier
nf-core-bot Nov 6, 2023
b39e7e8
Correct kallisto out path
pinin4fjords Nov 6, 2023
36694f8
Merge branch 'kallisto_quant' of github.com:nf-core/rnaseq into kalli…
pinin4fjords Nov 6, 2023
58bbc27
Add prefix usage to local modules to facilitate batched runs
pinin4fjords Nov 7, 2023
ab4c662
[skip ci] Address easy review feedback
pinin4fjords Nov 7, 2023
c612dd0
[skip ci] Update config per latest
pinin4fjords Nov 9, 2023
4a81e64
[skip ci] fix font sizes in metro map file icons
pinin4fjords Nov 9, 2023
d5c6f7f
thank myself to poke ci
pinin4fjords Nov 9, 2023
d75d507
Reflect updated kallisto module
pinin4fjords Nov 9, 2023
b0cf295
[skip ci] test file not supposed to be there
pinin4fjords Nov 10, 2023
9dcaf0b
Merge branch 'dev' into kallisto_quant
pinin4fjords Nov 10, 2023
4f556ee
[skip ci] Add Kallisto outputs to Tower yml
pinin4fjords Nov 13, 2023
9ab9f7c
Post-review changes
drpatelh Nov 14, 2023
0c9c0aa
Fix Kallisto sample name rendering in MultiQC
drpatelh Nov 14, 2023
2decfdc
[skip ci] Alphabetically order enum fields for salmon strandedness
drpatelh Nov 14, 2023
5ea79ac
[skip ci] Bump module from branch
pinin4fjords Nov 14, 2023
9605b0f
Merge branch 'kallisto_quant' of github.com:nf-core/rnaseq into kalli…
pinin4fjords Nov 14, 2023
83d9fb4
[automated] Fix linting with Prettier
nf-core-bot Nov 14, 2023
6c75322
[skip ci] removed kallisto strandedness as a standalone parameter
pinin4fjords Nov 14, 2023
5458b5c
[skip ci] Fix kallisto log file ext
pinin4fjords Nov 14, 2023
68b8977
Fix tiny issue
pinin4fjords Nov 14, 2023
5ad263a
[automated] Fix linting with Prettier
nf-core-bot Nov 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -232,16 +232,19 @@ jobs:
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker --aligner hisat2 ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/

salmon:
name: Test Salmon with workflow parameters
pseudo:
name: Test Pseudoaligners with workflow parameters
if: ${{ (github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/rnaseq')) && !contains(github.event.head_commit.message, '[ci fast]') }}
runs-on: ubuntu-latest
strategy:
matrix:
parameters:
- "--skip_qc"
- "--skip_alignment --skip_pseudo_alignment"
- "--salmon_index false --transcript_fasta false"
- "--pseudo_aligner salmon --skip_qc"
- "--pseudo_aligner salmon --skip_alignment --skip_pseudo_alignment"
- "--pseudo_aligner salmon --salmon_index false --transcript_fasta false"
- "--pseudo_aligner kallisto --skip_qc"
- "--pseudo_aligner kallisto --skip_alignment --skip_pseudo_alignment"
- "--pseudo_aligner kallisto --kallisto_index false --transcript_fasta false"
steps:
- name: Check out pipeline code
uses: actions/checkout@v2
Expand Down Expand Up @@ -280,6 +283,6 @@ jobs:
wget -qO- get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/

- name: Run pipeline with Salmon and various parameters
- name: Run pipeline with Salmon or Kallisto and various parameters
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker --pseudo_aligner salmon ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/
nextflow run ${GITHUB_WORKSPACE} -profile test_cache,docker ${{ matrix.parameters }} --outdir ./results --test_data_base ${{ github.workspace }}/test-datasets/
10 changes: 7 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
Special thanks to the following for their contributions to the release:

- [Adam Talbot](https://github.com/adamrtalbot)
- [Jonathan Manning](https://github.com/pinin4fjords)
- [Júlia Mir Pedrol](https://github.com/mirpedrol)
- [Matthias Zepper](https://github.com/MatthiasZepper)
- [Maxime Garcia](https://github.com/maxulysse)
Expand All @@ -28,13 +29,16 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
- [PR #1083](https://github.com/nf-core/rnaseq/pull/1083) - Move local modules and subworkflows to subfolders
- [PR #1088](https://github.com/nf-core/rnaseq/pull/1088) - Updates contributing and code of conduct documents with nf-core template 2.10
- [PR #1091](https://github.com/nf-core/rnaseq/pull/1091) - Reorganise parameters in schema for better usability
- [PR #1106](https://github.com/nf-core/rnaseq/pull/1106) - Kallisto quantification
pinin4fjords marked this conversation as resolved.
Show resolved Hide resolved
- [PR #1106](https://github.com/nf-core/rnaseq/pull/1106) - MultiQC [version bump](https://github.com/nf-core/rnaseq/pull/1106/commits/aebad067a10a45510a2b421da852cb436ae65fd8)
- [#1050](https://github.com/nf-core/rnaseq/issues/1050) - Provide custom prefix/suffix for summary files to avoid overwriting

### Software dependencies

| Dependency | Old version | New version |
| ----------------------- | ----------- | ----------- |
| `fastqc` | 0.11.9 | 0.12.1 |
| `multiqc` | 1.14 | 1.15 |
| `multiqc` | 1.14 | 1.17 |
| `ucsc-bedgraphtobigwig` | 377 | 445 |

> **NB:** Dependency has been **updated** if both old and new version information is present.
Expand All @@ -61,7 +65,7 @@ Thank you to everyone else that has contributed by reporting bugs, enhancements
### Enhancements & fixes

- [[#1011](https://github.com/nf-core/rnaseq/issues/1011)] - FastQ files from UMI-tools not being passed to fastp
- [[#1018](https://github.com/nf-core/rnaseq/issues/1018)] - Ability to skip both alignment and pseudo-alignment to only run pre-processing QC steps.
- [[#1018](https://github.com/nf-core/rnaseq/issues/1018)] - Ability to skip both alignment and pseudoalignment to only run pre-processing QC steps.
- [PR #1016](https://github.com/nf-core/rnaseq/pull/1016) - Updated pipeline template to [nf-core/tools 2.8](https://github.com/nf-core/tools/releases/tag/2.8)
- [PR #1025](https://github.com/nf-core/fetchngs/pull/1025) - Add `public_aws_ecr.config` to source mulled containers when using `public.ecr.aws` Docker Biocontainer registry
- [PR #1038](https://github.com/nf-core/rnaseq/pull/1038) - Updated error log for count values when supplying `--additional_fasta`
Expand Down Expand Up @@ -809,7 +813,7 @@ Major novel changes include:
- Added options to skip several steps
- Skip trimming using `--skipTrimming`
- Skip BiotypeQC using `--skipBiotypeQC`
- Skip Alignment using `--skipAlignment` to only use pseudo-alignment using Salmon
- Skip Alignment using `--skipAlignment` to only use pseudoalignment using Salmon

### Documentation updates

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
3. [`dupRadar`](https://bioconductor.org/packages/release/bioc/html/dupRadar.html)
4. [`Preseq`](http://smithlabresearch.org/software/preseq/)
5. [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
15. Pseudo-alignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/); _optional_)
15. Pseudoalignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/) or ['Kallisto'](https://pachterlab.github.io/kallisto/); _optional_)
16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))

> **Note**
Expand Down
2 changes: 2 additions & 0 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ run_modules:
- hisat2
- rsem
- salmon
- kallisto
- samtools
- picard
- preseq
Expand Down Expand Up @@ -66,6 +67,7 @@ extra_fn_clean_exts:
- ".umi_dedup"
- "_val"
- ".markdup"
- "_primary"

# Customise the module search patterns to speed up execution time
# - Skip module sub-tools that we are not interested in
Expand Down
89 changes: 0 additions & 89 deletions bin/salmon_tx2gene.py

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@

library(SummarizedExperiment)

## Create SummarizedExperiment (se) object from Salmon counts
## Create SummarizedExperiment (se) object from counts

args <- commandArgs(trailingOnly = TRUE)
if (length(args) < 2) {
stop("Usage: salmon_se.r <coldata> <counts> <tpm>", call. = FALSE)
if (length(args) < 3) {
stop("Usage: summarizedexperiment.r <coldata> <counts> <tpm> <tx2gene>", call. = FALSE)
}

coldata <- args[1]
counts_fn <- args[2]
tpm_fn <- args[3]
tx2gene <- args[4]

tx2gene <- "salmon_tx2gene.tsv"
info <- file.info(tx2gene)
if (info$size == 0) {
tx2gene <- NULL
Expand Down
160 changes: 160 additions & 0 deletions bin/tx2gene.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
#!/usr/bin/env python
import logging
import argparse
import glob
import os
from collections import Counter, defaultdict, OrderedDict
from collections.abc import Set
from typing import Dict

# Configure logging
logging.basicConfig(format="%(name)s - %(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)


def read_top_transcripts(quant_dir: str, file_pattern: str) -> Set[str]:
"""
Read the top 100 transcripts from the quantification file.

Parameters:
quant_dir (str): Directory where quantification files are located.
file_pattern (str): Pattern to match quantification files.

Returns:
set: A set containing the top 100 transcripts.
"""
try:
# Find the quantification file within the directory
quant_file_path = glob.glob(os.path.join(quant_dir, "*", file_pattern))[0]
with open(quant_file_path, "r") as file_handle:
# Read the file and extract the top 100 transcripts
return {line.split()[0] for i, line in enumerate(file_handle) if i > 0 and i <= 100}
except IndexError:
# Log an error and raise a FileNotFoundError if the quant file does not exist
logger.error("No quantification files found.")
raise FileNotFoundError("Quantification file not found.")


def discover_transcript_attribute(gtf_file: str, transcripts: Set[str]) -> str:
"""
Discover the attribute in the GTF that corresponds to transcripts, prioritizing 'transcript_id'.

Parameters:
gtf_file (str): Path to the GTF file.
transcripts (Set[str]): A set of transcripts to match in the GTF file.

Returns:
str: The attribute name that corresponds to transcripts in the GTF file.
"""
votes = Counter()
with open(gtf_file) as inh:
# Read GTF file, skipping header lines
for line in filter(lambda x: not x.startswith("#"), inh):
cols = line.split("\t")
# Parse attribute column and update votes for each attribute found
attributes = dict(item.strip().split(" ", 1) for item in cols[8].split(";") if item.strip())
votes.update(key for key, value in attributes.items() if value.strip('"') in transcripts)

if not votes:
# Log a warning if no matching attribute is found
logger.warning("No attribute in GTF matching transcripts")
return ""

# Check if 'transcript_id' is among the attributes with the highest votes
if "transcript_id" in votes and votes["transcript_id"] == max(votes.values()):
logger.info("Attribute 'transcript_id' corresponds to transcripts.")
return "transcript_id"

# If 'transcript_id' isn't the highest, determine the most common attribute that matches the transcripts
attribute, _ = votes.most_common(1)[0]
logger.info(f"Attribute '{attribute}' corresponds to transcripts.")
return attribute


def parse_attributes(attributes_text: str) -> Dict[str, str]:
"""
Parse the attributes column of a GTF file.

:param attributes_text: The attributes column as a string.
:return: A dictionary of the attributes.
"""
# Split the attributes string by semicolon and strip whitespace
attributes = attributes_text.strip().split(";")
attr_dict = OrderedDict()

# Iterate over each attribute pair
for attribute in attributes:
# Split the attribute into key and value, ensuring there are two parts
parts = attribute.strip().split(" ", 1)
if len(parts) == 2:
key, value = parts
# Remove any double quotes from the value
value = value.replace('"', "")
attr_dict[key] = value

return attr_dict


def map_transcripts_to_gene(
quant_type: str, gtf_file: str, quant_dir: str, gene_id: str, extra_id_field: str, output_file: str
) -> bool:
"""
Map transcripts to gene names and write the output to a file.

Parameters:
quant_type (str): The quantification method used (e.g., 'salmon').
gtf_file (str): Path to the GTF file.
quant_dir (str): Directory where quantification files are located.
gene_id (str): The gene ID attribute in the GTF file.
extra_id_field (str): Additional ID field in the GTF file.
output_file (str): The output file path.

Returns:
bool: True if the operation was successful, False otherwise.
"""
# Read the top transcripts based on quantification type
transcripts = read_top_transcripts(quant_dir, "quant.sf" if quant_type == "salmon" else "abundance.tsv")
# Discover the attribute that corresponds to transcripts in the GTF
transcript_attribute = discover_transcript_attribute(gtf_file, transcripts)

if not transcript_attribute:
# If no attribute is found, return False
return False

# Open GTF and output file to write the mappings
# Initialize the set to track seen combinations
seen = set()

with open(gtf_file) as inh, open(output_file, "w") as output_handle:
# Parse each line of the GTF, mapping transcripts to genes
for line in filter(lambda x: not x.startswith("#"), inh):
cols = line.split("\t")
attr_dict = parse_attributes(cols[8])
if gene_id in attr_dict and transcript_attribute in attr_dict:
# Create a unique identifier for the transcript-gene combination
transcript_gene_pair = (attr_dict[transcript_attribute], attr_dict[gene_id])

# Check if the combination has already been seen
if transcript_gene_pair not in seen:
# If it's a new combination, write it to the output and add to the seen set
extra_id = attr_dict.get(extra_id_field, attr_dict[gene_id])
output_handle.write(f"{attr_dict[transcript_attribute]}\t{attr_dict[gene_id]}\t{extra_id}\n")
seen.add(transcript_gene_pair)

return True


# Main function to parse arguments and call the mapping function
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Map transcripts to gene names for tximport.")
parser.add_argument("--quant_type", type=str, help="Quantification type", default="salmon")
parser.add_argument("--gtf", type=str, help="GTF file", required=True)
parser.add_argument("--quants", type=str, help="Output of quantification", required=True)
parser.add_argument("--id", type=str, help="Gene ID in the GTF file", required=True)
parser.add_argument("--extra", type=str, help="Extra ID in the GTF file")
parser.add_argument("-o", "--output", dest="output", default="tx2gene.tsv", type=str, help="File with output")

args = parser.parse_args()
if not map_transcripts_to_gene(args.quant_type, args.gtf, args.quants, args.id, args.extra, args.output):
logger.error("Failed to map transcripts to genes.")
Loading
Loading