Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentieon haplotyper refactored #1074

Merged
merged 11 commits into from
Jun 7, 2023
Merged

Conversation

asp8200
Copy link
Contributor

@asp8200 asp8200 commented Jun 5, 2023

Making the Sentieon-Haplotyper nf-module more general, in particular, enabling it to run with all possible emit-modes while still being able to about both a vcf and a gvcf in the same run.

Examples of nf-cmd's with the new option --sentieon_haplotyper_emit_mode:

nextflow run main.nf -profile test,targeted --input ./tests/csv/3.0/mapped_single_bam.csv --tools sentieon_haplotyper --step variant_calling --outdir results --sentieon_haplotyper_emit_mode confident

nextflow run main.nf -profile test,targeted --input ./tests/csv/3.0/mapped_single_bam.csv --tools sentieon_haplotyper --step variant_calling --outdir results --sentieon_haplotyper_emit_mode confident, gvcf

nextflow run main.nf -profile test,targeted --input ./tests/csv/3.0/mapped_single_bam.csv --tools sentieon_haplotyper --step variant_calling --outdir results --sentieon_haplotyper_emit_mode all, gvcf

Tests and documentation updated.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/sarek branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@asp8200 asp8200 requested a review from maxulysse as a code owner June 5, 2023 21:25
@asp8200 asp8200 removed the request for review from maxulysse June 5, 2023 21:25
@github-actions
Copy link

github-actions bot commented Jun 5, 2023

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit c0cd6f0

+| ✅ 152 tests passed       |+
#| ❔   9 tests were ignored |#
!| ❗   1 tests had warnings |!

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.8
  • Run at 2023-06-07 08:42:33

Copy link
Member

@maxulysse maxulysse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good

nextflow_schema.json Outdated Show resolved Hide resolved
@asp8200
Copy link
Contributor Author

asp8200 commented Jun 6, 2023

A slightly better output-section in the sentieon-haplotyper module might look like this:

    output:
    tuple val(meta), path("*.vcf.gz")     , optional:true, emit: vcf
    tuple val(meta), path("*.vcf.gz.tbi") , optional:true, emit: vcf_tbi
    tuple val(meta), path("*.gvcf.gz")    , optional:true, emit: gvcf
    tuple val(meta), path("*.gvcf.gz.tbi"), optional:true, emit: gvcf_tbi
    path "versions.yml"                   , emit: versions

but the problem with that is that in the Sarek-workflow we may subsequently call GATK's MergeVcfs on the gvcf-files, and GATK's MergeVcfs apparently require the gvcf-files to have the extension .vcf.gz (or possible .vcf but that is besides the point).

Error msg from MergeVcfs when feeding it gvcf-files with the exension .gvcf.gz:

$ gatk --java-options "-Xmx3276M" MergeVcfs --INPUT test.haplotyper.chr22_20001-40001.sentieon.gvcf.gz --INPUT test.haplotyper.chr22_2-15000.sentieon.gvcf.gz --OUTPUT foo.g.vcf.gz --SEQUENCE_DICTIONARY genome.dict --TMP_DIR .
Using GATK jar /home/ubuntu/miniconda3/envs/gatk4/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx3276M -jar /home/ubuntu/miniconda3/envs/gatk4/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar MergeVcfs --INPUT test.haplotyper.chr22_20001-40001.sentieon.gvcf.gz --INPUT test.haplotyper.chr22_2-15000.sentieon.gvcf.gz --OUTPUT foo.g.vcf.gz --SEQUENCE_DICTIONARY genome.dict --TMP_DIR .
07:55:25.957 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ubuntu/miniconda3/envs/gatk4/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Jun 06 07:55:26 UTC 2023] MergeVcfs --INPUT test.haplotyper.chr22_20001-40001.sentieon.gvcf.gz --INPUT test.haplotyper.chr22_2-15000.sentieon.gvcf.gz --OUTPUT foo.g.vcf.gz --SEQUENCE_DICTIONARY genome.dict --TMP_DIR . --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX true --CREATE_MD5_FILE false --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Tue Jun 06 07:55:26 UTC 2023] Executing as ubuntu@ip-172-31-6-196 on Linux 5.15.0-1026-aws amd64; OpenJDK 64-Bit Server VM 17.0.3-internal+0-adhoc..src; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.4.0.0
[Tue Jun 06 07:55:26 UTC 2023] picard.vcf.MergeVcfs done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=104857600
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
java.io.UncheckedIOException: java.nio.charset.MalformedInputException: Input length = 1
	at java.base/java.nio.file.FileChannelLinesSpliterator.readLine(FileChannelLinesSpliterator.java:192)
	at java.base/java.nio.file.FileChannelLinesSpliterator.forEachRemaining(FileChannelLinesSpliterator.java:132)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at htsjdk.samtools.util.IOUtil.unrollPaths(IOUtil.java:1188)
	at picard.vcf.MergeVcfs.doWork(MergeVcfs.java:171)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:289)
	at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)
Caused by: java.nio.charset.MalformedInputException: Input length = 1
	at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:188)
	at java.base/java.io.BufferedReader.fill(BufferedReader.java:162)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:329)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:396)
	at java.base/java.nio.file.FileChannelLinesSpliterator.readLine(FileChannelLinesSpliterator.java:190)
	... 14 more

Feeding it .g.vcf.gzfiles and there is no problem.

@FriederikeHanssen
Copy link
Contributor

strong argument for me for g.vcf then. I don't really want to anger MergeVCF for a point placed at a different location :D

@asp8200
Copy link
Contributor Author

asp8200 commented Jun 6, 2023

@maxulysse : On the current PR (a9e82ee) output vcf-files are named .unfiltered.vcf.gz if intervals are used and .sentieon.vcf.gz if no intervals are used 🤦‍♂️

The reason I changed .unfiltered.vcf.gz to .sentieon.vcf.gz in the sentieon-haplotyper-moduler is that one might say that the module does some filtering if the emit-mode is set to confident and variant 🤔

However, on second thought, I think I'll just go back to giving the vcf-files the extension .unfiltered.vcf.gz.

@asp8200 asp8200 changed the title DRAFT: Sentieon haplotyper refactored Sentieon haplotyper refactored Jun 7, 2023
@asp8200 asp8200 requested a review from maxulysse June 7, 2023 09:17
@maxulysse maxulysse merged commit 45ce0da into sentieon Jun 7, 2023
@maxulysse maxulysse deleted the sentieon_haplotyper_refactored branch June 7, 2023 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants