Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Add UMI Handling to the pipeline #164

Merged
merged 30 commits into from
Jan 11, 2024

Conversation

CKComputomics
Copy link
Contributor

@CKComputomics CKComputomics commented Jun 21, 2022

Adds the option to use UMIs directly in the pipeline. This can be activated by setting --with_umi
For the extraction step the nf-core sub workflow has been imported. The deduplication step had to be implemented in a new subworkflow. It utilizes the existing bowtie modules to map the reads to a reference genome and deduplicates based on this mapping. The deduplicated reads are merged with the unmapped reads into one file. This behavior can be deactivated by setting --umi_merge_unmapped false.
Using UMIs can result in fast files with very little reads. To few reads can result in a fail of mirtop. this needs to be considered when using this feature.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
    • If necessary, also make a PR on the nf-core/smrnaseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Christian Kubica and others added 23 commits May 10, 2022 14:35
Reverting changes to a non-linted version and added the umitools modules.
Added the umitools workflow and integrated it into the smrnaseq workflow
Add additional documentation to use UMI tools as part of the pipeline.
Most of the documentation has been copied from nf-core/rnaseq.
The bam2fq module is neccessary to convert the deduplicated bam
files back into a fastq format to be fed into the existing
pipeline.
Added the umitools extract modules.config lines from nf-core/rnaseq
to this pipeline.
Added configurations for umi deduplication.
Initial comit of the umi dedup subworkflow. The workflow combines
already existing modules of the pipeline and nf-core module to
deduplicate the reads by mapping them to the species genome and
re-converting them to fastq after deduplication.
includes the optional umitools deduplication step after the read
QC.
Added additional configuration to change the output file name of
samtools sort.
Added the documentation detailing the output files of the UMI-tools
deduplication step.
After deduplication the reads that remained unaligned to the
provided reference genome are merged with the set of deduplicated
reads to enable the use of the full spectrum of reads, independent
of potential reference bias. This behaviour can be deactivated by
setting --umi_merge_unmapped false
Information on the new --umi_merge_unmapped command were added to
both the CHANGELOG, as well as the output markdown script.
@CKComputomics
Copy link
Contributor Author

@apeltzer there is still a problem with nf-core lint and prettier. The email_template.html file does either pass the knitting or prettier, but the changes made by each cause a fail of the other. Could you have a look at what's going on there.

@apeltzer
Copy link
Member

apeltzer commented Jun 21, 2022

Can you resolve the conflicts with dev before that?
@CKComputomics

@github-actions
Copy link

github-actions bot commented Jun 21, 2022

nf-core lint overall result: Failed ❌

Posted for pipeline commit fcc3ef0

+| ✅ 157 tests passed       |+
!| ❗   1 tests had warnings |!
-| ❌   6 tests failed       |-

❌ Test failures:

  • files_unchanged - .github/CONTRIBUTING.md does not match the template
  • files_unchanged - .github/PULL_REQUEST_TEMPLATE.md does not match the template
  • files_unchanged - .github/workflows/linting.yml does not match the template
  • files_unchanged - lib/NfcoreTemplate.groovy does not match the template
  • multiqc_config - 'assets/multiqc_config.yml' does not contain a matching 'report_comment'.
    The expected comment is:
    This report has been generated by the <a href="https://github.com/nf-core/smrnaseq/tree/dev" target="_blank">nf-core/smrnaseq</a> analysis pipeline. For information about how to interpret these results, please see the <a href="https://nf-co.re/smrnaseq/dev/docs/output" target="_blank">documentation</a>.
    The current comment is:
    This report has been generated by the <a href="https://github.com/nf-core/smrnaseq/releases/tag/dev" target="_blank">nf-core/smrnaseq</a> analysis pipeline. For information about how to interpret these results, please see the <a href="https://nf-co.re/smrnaseq/dev/docs/output" target="_blank">documentation</a>.
  • modules_structure - modules directory structure is outdated. Should be 'modules/nf-core/TOOL/SUBTOOL'

❗ Test warnings:

  • pipeline_todos - TODO string in WorkflowSmrnaseq.groovy: Optionally add in-text citation tools to this list.

✅ Tests passed:

Run details

  • nf-core/tools version 2.11.1
  • Run at 2024-01-11 12:54:06

@CKComputomics
Copy link
Contributor Author

Can you resolve the conflicts with dev before that?
@CKComputomics

Done! Fixed the prettier vs listing issue as well.

@lpantano
Copy link
Contributor

Hi,

has somebody run this in a real dataset? at least without UMI to make sure you get the same results? I don't quite follow whether the trimming will be exactly the same. Where is the params.protocol variable sync with the parameters for nf-core/trimgalore now? Before was in the local/trimgalore. I can try to run this in some of our samples next week to check we get the same results if we don't use the UMI option.

@apeltzer apeltzer modified the milestones: Release 2.1.0, Release 2.2.0 Oct 11, 2022
@apeltzer apeltzer modified the milestones: Release 2.2.0, Release 2.1.1 Patch Feb 17, 2023
@sean-at-tessera
Copy link

@apeltzer I'm currently working with miRNA-seq data using UMIs and would love for this feature to get merged into dev -- is there anything I can use to vet this functionality on the datasets I'm using?

@chaochungkuo
Copy link

Hi, I think this feature is really needed and useful. I just want to reactivate this thread again. I will run a test in the coming week.

@apeltzer
Copy link
Member

apeltzer commented Apr 3, 2023

You can run using the -r CKComputomics:umitools branch, @sean-at-tessera . @chaochungkuo if you want to test things, please let me know if things work fine. We also would have to resolve the merge conflicts and finally get this merged into the pipeline.

@chaochungkuo
Copy link

Hi @apeltzer Thanks. But I cannot run it as you suggested.

nextflow run nf-core/smrnaseq -r CKComputomics:umitools -profile docker --with_umi \
     --umitools_extract_method regex --umitools_bc_pattern '.+AACTGTAGGCACCATCAAT{s<=2}(?P<umi_1>.{12})(?P<discard_2>.*)' \
     --input samplesheet.csv --outdir results --mirtrace_species hsa --mirtrace_protocol qiaseq \
     --three_prime_adapter AACTGTAGGCACCATCAAT --protocol qiaseq \
     --genome GRCh38 \
     --mirna_gtf /data/genomes/hg38/miRNA/hsa.gff3 \
     --mature /data/genomes/spikein/QIASeq_miRNAseq_SpikeIn/mature_with_qiaseq_spikein.fa \
     --hairpin /data/genomes/spikein/QIASeq_miRNAseq_SpikeIn/hairpin_with_qiaseq_spikein.fa

And the error message I got:

N E X T F L O W  ~  version 23.04.0
Pulling nf-core/smrnaseq ...
WARN: Cannot read project manifest -- Cause: Remote resource not found: https://api.github.com/repos/nf-core/smrnaseq/contents/nextflow.config?ref=CKComputomics:umitools
Remote resource not found: https://api.github.com/repos/nf-core/smrnaseq/contents/main.nf?ref=CKComputomics:umitools

Any idea? We used QIAseq™ miRNA Library QC Spike-Ins.

@sean-at-tessera
Copy link

Should it be the below?

nextflow run CKComputomics/smrnaseq -r umitools
...

@chaochungkuo
Copy link

Thanks, @sean-at-tessera it works now. I will report after it is done.

@apeltzer
Copy link
Member

apeltzer commented Apr 4, 2023

Yes sorry, mistakenly thought the branch is already here in smrnaseq.

@sean-at-tessera
Copy link

@chaochungkuo I also tried to look at resolving the merge conflicts, but I think it would take your insight to do quickly. It looks like enough has changed since you implemented umitools that some things need to be renamed.

@sean-at-tessera
Copy link

Screenshot 2023-04-04 at 1 11 15 PM

@chaochungkuo can you document what --bc-pattern needs to be? Should that be pulled from the existing parameter list? I'm working with single-end data, which you may not have tested yet.

@chaochungkuo
Copy link

Hi @sean-at-tessera

I am not the one who implements this. It was done by @CKComputomics.

I also have single-end reads with QIAseq™ miRNA Library QC Spike-Ins. These parameters are specific for this kit:

--umitools_extract_method regex --umitools_bc_pattern '.+AACTGTAGGCACCATCAAT{s<=2}(?P<umi_1>.{12})(?P<discard_2>.*)' \

I got my results but it seems like didn't go through to the end. The processes of umitools takes too much memory and time.
newplot
newplot (1)

The error messages are:

[aa/a4040b] NOTE: Process `NFCORE_SMRNASEQ:SMRNASEQ:DEDUPLICATE_UMIS:UMITOOLS_DEDUP (Patientin_P009)` terminated with an error exit status (137) -- Execution is retried (1)
[35/7645bc] NOTE: Process `NFCORE_SMRNASEQ:SMRNASEQ:DEDUPLICATE_UMIS:UMITOOLS_DEDUP (Kontrolle_K023)` terminated with an error exit status (137) -- Execution is retried (1)
...

I am not sure the root of this issue. However, I will increase the limit and run it again. Any advice is appreciated.

@CKComputomics
Copy link
Contributor Author

@chaochungkuo can you document what --bc-pattern needs to be? Should that be pulled from the existing parameter list? I'm working with single-end data, which you may not have tested yet.

Hi, the --bc-pattern error seems to originate from UMItools directly. The corresponding parameter in the nextflow run would be --umitools_bc_pattern.
I have not used UMItools or this pipeline in a while and would refer you to the original UMI-tools documentation. It should be the UMI barcode pattern to use e.g. 'NNNNNN' indicates that the first 6 nucleotides of the read are from the UMI.

I hope this helps you to solve the issue. If not I will do some digging and see if I can figure out what is going wrong.

@chaochungkuo using more memory sounds like a reasonable option. I have only ever tested the pipeline with small datasets and thus have no idea how this scales on full sets.

@chaochungkuo
Copy link

When I run umitools directly outside of nfcore/smrnaseq, I use the following command:

umi_tools extract --stdin=$1 --stdout=${TRIMMED_FASTQ} --extract-method=regex --bc-pattern='.+AACTGTAGGCACCATCAAT{s<=2}(?P<umi_1>.{12})(?P<discard_2>.*)'

I works fine and I got the trimmed FASTQs I want.

When I pass these parameters into this branch now, I modified them as:

nextflow run nf-core/smrnaseq -r CKComputomics:umitools -profile docker --with_umi \
     --umitools_extract_method regex --umitools_bc_pattern '.+AACTGTAGGCACCATCAAT{s<=2}(?P<umi_1>.{12})(?P<discard_2>.*)' \
     --input samplesheet.csv --outdir results --mirtrace_species hsa --mirtrace_protocol qiaseq \
     --three_prime_adapter AACTGTAGGCACCATCAAT --protocol qiaseq

I thought the name of the parameter is changed from --bc-pattern to --umitools_bc_pattern now.

@sean-at-tessera
Copy link

sean-at-tessera commented Apr 5, 2023

@CKComputomics I have the same problem as @chaochungkuo; specifying --umitools_bc_pattern still yields ValueError: Must supply --bc-pattern for single-end from umitools in process NFCORE_SMRNASEQ:SMRNASEQ:FASTQC_UMITOOLS_TRIMGALORE:UMITOOLS_EXTRACT.

Looking at the file for the extract command, I'm guessing bc_params needs to appear in $args, but then, shouldn't the parameter just be --bc_params?

@chaochungkuo
Copy link

@sean-at-tessera
I just checked the log file of umitool extract and I see the parameter --umitools_bc_pattern I gave to nextflow is properly passed to umi_tools as --bc-pattern:

# UMI-tools version: 1.1.2
# output generated by extract -I Kontrolle_K013_S32_R1_001.fastq.gz -S Kontrolle_K013.umi_extract.fastq.gz --extract-method=regex --bc-pattern=.+AACTGTAGGCACCATCAAT{s<=2}(?P<umi_1>.{12})(?P<discard_2>.*)

However, I still get error message as

Command exit status:
  137

No idea yet...

@chaochungkuo
Copy link

Here is the exact error I received. Could someone help me to diagnose? Thanks.

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_SMRNASEQ:SMRNASEQ:DEDUPLICATE_UMIS:UMITOOLS_DEDUP":
      umitools: $(umi_tools --version 2>&1 | sed 's/^.*UMI-tools version://; s/ *$//')
  END_VERSIONS

Command exit status:
  137

Command output:
  # chrom                                   : None
  # compresslevel                           : 6
  # detection_method                        : None
  # filter_umi                              : None
  # gene_tag                                : None
  # gene_transcript_map                     : None
  # get_umi_method                          : read_id
  # ignore_tlen                             : False
  # ignore_umi                              : False
  # in_sam                                  : False
  # log2stderr                              : False
  # loglevel                                : 1
  # mapping_quality                         : 0
  # method                                  : directional
  # no_sort_output                          : False
  # out_sam                                 : False
  # output_unmapped                         : False
  # paired                                  : False
  # per_cell                                : False
  # per_contig                              : False
  # per_gene                                : False
  # random_seed                             : None
  # read_length                             : False
  # short_help                              : None
  # skip_regex                              : ^(__|Unassigned)
  # soft_clip_threshold                     : 4
  # spliced                                 : False
  # stats                                   : Patientin_P030.umi_dedup.sorted
  # stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
  # stdin                                   : <_io.TextIOWrapper name='Patientin_P030.sorted.bam' mode='r' encoding='UTF-8'>
  # stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
  # stdout                                  : <_io.TextIOWrapper name='Patientin_P030.umi_dedup.sorted.bam' mode='w' encoding='UTF-8'>
  # subset                                  : None
  # threshold                               : 1
  # timeit_file                             : None
  # timeit_header                           : None
  # timeit_name                             : all
  # tmpdir                                  : None
  # umi_sep                                 : _
  # umi_tag                                 : RX
  # umi_tag_delim                           : None
  # umi_tag_split                           : None
  # umi_whitelist                           : None
  # umi_whitelist_paired                    : None
  # unmapped_reads                          : discard
  # unpaired_reads                          : use
  # whole_contig                            : False
  2023-04-06 00:35:28,960 INFO command: dedup -I Patientin_P030.sorted.bam -S Patientin_P030.umi_dedup.sorted.bam --output-stats Patientin_P030.umi_dedup.sorted
  2023-04-06 00:35:52,840 INFO total_umis 13631184
  2023-04-06 00:35:52,840 INFO #umis 664445

@sean-at-tessera
Copy link

@chaochungkuo exit status 137 generally indicates a memory error. Could you allocate more memory and try again?

@sean-at-tessera
Copy link

I was able to almost run the CKComputomics:umitools on my data. umitools did require up to ~10Gb of memory at one point. This might be a similar upper bound for your testing, @chaochungkuo. One of the key issues in getting it to run was passing --umitools_extract_method regex, which I'd neglected to do before and was causing an error.

The pipeline only crashed on one stage in the edgeR step. This is because this PR doesn't include the fix incorporated in this pull request.

@CKComputomics , could you please resolve the merge conflicts with dev? I can run another test then and evaluate the resulting outputs.

@Dtdavidgit
Copy link

Hi guys, just want to activate this thread again, wondering when the UMI handling feature will be added to the repo?
Thanks

@apeltzer apeltzer removed this from the Release 2.2.0 milestone Sep 1, 2023
@apeltzer
Copy link
Member

Ok, will give this a go now that more people requrested it. My hope was that someone is quicker at this but that seems not to be the case ;-)

@apeltzer
Copy link
Member

I will pull in upstream changes, then try to resolve conflicts and merge it

@apeltzer apeltzer changed the base branch from dev to umi-handling January 11, 2024 12:39
@apeltzer apeltzer merged commit 069beb1 into nf-core:umi-handling Jan 11, 2024
2 of 8 checks passed
@apeltzer apeltzer mentioned this pull request Jan 24, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

7 participants