Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop branch: Error in rule vardict_tumor_normal #1286

Closed
ivadym opened this issue Oct 17, 2023 · 21 comments
Closed

Develop branch: Error in rule vardict_tumor_normal #1286

ivadym opened this issue Oct 17, 2023 · 21 comments
Labels
Bug Something isn't working Urgency Large
Milestone

Comments

@ivadym
Copy link
Contributor

ivadym commented Oct 17, 2023

Describe the bug
When processing variants in specific regions a critical exception occurred: java.lang.IndexOutOfBoundsException with a negative bitIndex value.

After restarting the analysis, it succeeds.

If workflow, which rules

  • vardict_tumor_normal (balsamic TGA-TN and balsamic-umi TN)

Screenshots

There was Exception while processing variant on 55271360 on region 7:55271084-55271374. The processing will be continued from the next variant.
java.lang.IndexOutOfBoundsException: bitIndex < 0: -2039770131
	at java.util.BitSet.get(BitSet.java:623)
	at htsjdk.samtools.AbstractBAMFileIndex.query(AbstractBAMFileIndex.java:281)
	at htsjdk.samtools.DiskBasedBAMFileIndex.getSpanOverlapping(DiskBasedBAMFileIndex.java:61)
	at htsjdk.samtools.BAMFileReader.getFileSpan(BAMFileReader.java:935)
	at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:952)
	at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:612)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:538)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:400)
	at com.astrazeneca.vardict.data.SamView.<init>(SamView.java:21)
	at com.astrazeneca.vardict.modules.VariationRealigner.noPassingReads(VariationRealigner.java:2060)
	at com.astrazeneca.vardict.modules.VariationRealigner.realigndel(VariationRealigner.java:617)
	at com.astrazeneca.vardict.modules.VariationRealigner.realignIndels(VariationRealigner.java:387)
	at com.astrazeneca.vardict.modules.VariationRealigner.process(VariationRealigner.java:88)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
	at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
	at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
	at com.astrazeneca.vardict.modes.AbstractMode.pipeline(AbstractMode.java:49)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.combineAnalysis(SomaticPostProcessModule.java:404)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.printVariationsFromFirstSample(SomaticPostProcessModule.java:198)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.callingForBothSamples(SomaticPostProcessModule.java:143)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.accept(SomaticPostProcessModule.java:87)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.accept(SomaticPostProcessModule.java:28)
	at java.util.concurrent.CompletableFuture.biAccept(CompletableFuture.java:1190)
	at java.util.concurrent.CompletableFuture.biAcceptStage(CompletableFuture.java:1206)
	at java.util.concurrent.CompletableFuture.thenAcceptBoth(CompletableFuture.java:2056)
	at com.astrazeneca.vardict.modes.SomaticMode.processBothBamsInPipeline(SomaticMode.java:125)
	at com.astrazeneca.vardict.modes.SomaticMode.access$000(SomaticMode.java:31)
	at com.astrazeneca.vardict.modes.SomaticMode$SomdictWorker.call(SomaticMode.java:96)
	at com.astrazeneca.vardict.modes.SomaticMode$SomdictWorker.call(SomaticMode.java:79)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
There was Exception while processing variant on 55271360 on region 7:55271084-55271374. The processing will be continued from the next variant.
java.lang.IndexOutOfBoundsException: bitIndex < 0: -2039770131
	at java.util.BitSet.get(BitSet.java:623)
	at htsjdk.samtools.AbstractBAMFileIndex.query(AbstractBAMFileIndex.java:281)
	at htsjdk.samtools.DiskBasedBAMFileIndex.getSpanOverlapping(DiskBasedBAMFileIndex.java:61)
	at htsjdk.samtools.BAMFileReader.getFileSpan(BAMFileReader.java:935)
	at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:952)
	at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:612)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:538)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:400)
	at com.astrazeneca.vardict.data.SamView.<init>(SamView.java:21)
	at com.astrazeneca.vardict.modules.VariationRealigner.noPassingReads(VariationRealigner.java:2060)
	at com.astrazeneca.vardict.modules.VariationRealigner.realigndel(VariationRealigner.java:617)
	at com.astrazeneca.vardict.modules.VariationRealigner.realignIndels(VariationRealigner.java:387)
	at com.astrazeneca.vardict.modules.VariationRealigner.process(VariationRealigner.java:88)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
	at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
	at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
	at com.astrazeneca.vardict.modes.AbstractMode.pipeline(AbstractMode.java:49)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.combineAnalysis(SomaticPostProcessModule.java:404)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.printVariationsFromFirstSample(SomaticPostProcessModule.java:198)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.callingForBothSamples(SomaticPostProcessModule.java:143)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.accept(SomaticPostProcessModule.java:87)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.accept(SomaticPostProcessModule.java:28)
	at java.util.concurrent.CompletableFuture.biAccept(CompletableFuture.java:1190)
	at java.util.concurrent.CompletableFuture.biAcceptStage(CompletableFuture.java:1206)
	at java.util.concurrent.CompletableFuture.thenAcceptBoth(CompletableFuture.java:2056)
	at com.astrazeneca.vardict.modes.SomaticMode.processBothBamsInPipeline(SomaticMode.java:125)
	at com.astrazeneca.vardict.modes.SomaticMode.access$000(SomaticMode.java:31)
	at com.astrazeneca.vardict.modes.SomaticMode$SomdictWorker.call(SomaticMode.java:96)
	at com.astrazeneca.vardict.modes.SomaticMode$SomdictWorker.call(SomaticMode.java:79)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Version (please complete the following information):
balsamic --version develop

Additional context
Previously we had a similar issue: #1271

@ivadym ivadym added Bug Something isn't working Urgency Large labels Oct 17, 2023
@ivadym ivadym added this to BALSAMIC Oct 17, 2023
@github-project-automation github-project-automation bot moved this to Todo in BALSAMIC Oct 17, 2023
@ivadym ivadym moved this from Todo to Planned in BALSAMIC Oct 17, 2023
@ivadym ivadym added this to the Release 13 milestone Oct 24, 2023
@ivadym
Copy link
Contributor Author

ivadym commented Nov 1, 2023

We have also vardict failing in tumor-only now:

Critical exception occurs on region: 19:6827911-6829086, program will be stopped.
java.util.concurrent.CompletionException: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:618)
	at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
	at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
	at com.astrazeneca.vardict.modes.AbstractMode.pipeline(AbstractMode.java:49)
	at com.astrazeneca.vardict.modes.SimpleMode.processBamInPipeline(SimpleMode.java:103)
	at com.astrazeneca.vardict.modes.SimpleMode.access$000(SimpleMode.java:29)
	at com.astrazeneca.vardict.modes.SimpleMode$VardictWorker.call(SimpleMode.java:86)
	at com.astrazeneca.vardict.modes.SimpleMode$VardictWorker.call(SimpleMode.java:72)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
	at htsjdk.samtools.AbstractBAMFileIndex.query(AbstractBAMFileIndex.java:297)
	at htsjdk.samtools.DiskBasedBAMFileIndex.getSpanOverlapping(DiskBasedBAMFileIndex.java:61)
	at htsjdk.samtools.BAMFileReader.getFileSpan(BAMFileReader.java:935)
	at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:952)
	at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:612)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:538)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:400)
	at com.astrazeneca.vardict.data.SamView.<init>(SamView.java:21)
	at com.astrazeneca.vardict.modules.VariationRealigner.noPassingReads(VariationRealigner.java:2060)
	at com.astrazeneca.vardict.modules.VariationRealigner.realigndel(VariationRealigner.java:617)
	at com.astrazeneca.vardict.modules.VariationRealigner.realignIndels(VariationRealigner.java:387)
	at com.astrazeneca.vardict.modules.VariationRealigner.process(VariationRealigner.java:88)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
	... 11 more
[Tue Oct 31 19:04:24 2023]
Error in rule vardict_tumor_only:
There was Exception while processing variant on 112154530 on region 5:112154487-112155216. The processing will be continued from the next variant.
java.lang.IllegalArgumentException
	at java.nio.Buffer.position(Buffer.java:244)
	at htsjdk.samtools.MemoryMappedFileBuffer.skipBytes(MemoryMappedFileBuffer.java:49)
	at htsjdk.samtools.AbstractBAMFileIndex.skipBytes(AbstractBAMFileIndex.java:451)
	at htsjdk.samtools.AbstractBAMFileIndex.query(AbstractBAMFileIndex.java:291)
	at htsjdk.samtools.DiskBasedBAMFileIndex.getSpanOverlapping(DiskBasedBAMFileIndex.java:61)
	at htsjdk.samtools.BAMFileReader.getFileSpan(BAMFileReader.java:935)
	at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:952)
	at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:612)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:538)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:400)
	at com.astrazeneca.vardict.data.SamView.<init>(SamView.java:21)
	at com.astrazeneca.vardict.modules.VariationRealigner.noPassingReads(VariationRealigner.java:2060)
	at com.astrazeneca.vardict.modules.VariationRealigner.realigndel(VariationRealigner.java:617)
	at com.astrazeneca.vardict.modules.VariationRealigner.realignIndels(VariationRealigner.java:387)
	at com.astrazeneca.vardict.modules.VariationRealigner.process(VariationRealigner.java:88)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
	at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
	at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
	at com.astrazeneca.vardict.modes.AbstractMode.pipeline(AbstractMode.java:49)
	at com.astrazeneca.vardict.modes.SimpleMode.processBamInPipeline(SimpleMode.java:103)
	at com.astrazeneca.vardict.modes.SimpleMode.access$000(SimpleMode.java:29)
	at com.astrazeneca.vardict.modes.SimpleMode$VardictWorker.call(SimpleMode.java:86)
	at com.astrazeneca.vardict.modes.SimpleMode$VardictWorker.call(SimpleMode.java:72)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

@ivadym
Copy link
Contributor Author

ivadym commented Nov 15, 2023

Errors appearing in the integration tests:

TN-WGS

Critical exception occurs on region: 10:43604378-43612279, program will be stopped.
java.util.concurrent.CompletionException: java.lang.IllegalArgumentException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
	at com.astrazeneca.vardict.collection.DirectThreadExecutor.execute(DirectThreadExecutor.java:8)
	at java.util.concurrent.CompletableFuture.asyncSupplyStage(CompletableFuture.java:1618)
	at java.util.concurrent.CompletableFuture.supplyAsync(CompletableFuture.java:1843)
	at com.astrazeneca.vardict.modes.AbstractMode.pipeline(AbstractMode.java:46)
	at com.astrazeneca.vardict.modes.SomaticMode.processBothBamsInPipeline(SomaticMode.java:122)
	at com.astrazeneca.vardict.modes.SomaticMode.access$000(SomaticMode.java:31)
	at com.astrazeneca.vardict.modes.SomaticMode$SomdictWorker.call(SomaticMode.java:96)
	at com.astrazeneca.vardict.modes.SomaticMode$SomdictWorker.call(SomaticMode.java:79)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException
	at java.nio.Buffer.position(Buffer.java:244)
	at htsjdk.samtools.MemoryMappedFileBuffer.skipBytes(MemoryMappedFileBuffer.java:49)
	at htsjdk.samtools.AbstractBAMFileIndex.skipBytes(AbstractBAMFileIndex.java:451)
	at htsjdk.samtools.AbstractBAMFileIndex.query(AbstractBAMFileIndex.java:291)
	at htsjdk.samtools.DiskBasedBAMFileIndex.getSpanOverlapping(DiskBasedBAMFileIndex.java:61)
	at htsjdk.samtools.BAMFileReader.getFileSpan(BAMFileReader.java:935)
	at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:952)
	at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:612)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:538)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:400)
	at com.astrazeneca.vardict.data.SamView.<init>(SamView.java:21)
	at com.astrazeneca.vardict.modules.RecordPreprocessor.nextReader(RecordPreprocessor.java:76)
	at com.astrazeneca.vardict.modules.RecordPreprocessor.<init>(RecordPreprocessor.java:50)
	at com.astrazeneca.vardict.modules.SAMFileParser.process(SAMFileParser.java:14)
	at com.astrazeneca.vardict.modes.AbstractMode.lambda$pipeline$0(AbstractMode.java:47)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
	... 12 more

UMI TN-WGS

java.util.concurrent.CompletionException: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:618)
	at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
	at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
	at com.astrazeneca.vardict.modes.AbstractMode.pipeline(AbstractMode.java:49)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.combineAnalysis(SomaticPostProcessModule.java:404)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.printVariationsFromFirstSample(SomaticPostProcessModule.java:198)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.callingForBothSamples(SomaticPostProcessModule.java:143)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.accept(SomaticPostProcessModule.java:87)
	at com.astrazeneca.vardict.postprocessmodules.SomaticPostProcessModule.accept(SomaticPostProcessModule.java:28)
	at java.util.concurrent.CompletableFuture.biAccept(CompletableFuture.java:1190)
	at java.util.concurrent.CompletableFuture.biAcceptStage(CompletableFuture.java:1206)
	at java.util.concurrent.CompletableFuture.thenAcceptBoth(CompletableFuture.java:2056)
	at com.astrazeneca.vardict.modes.SomaticMode.processBothBamsInPipeline(SomaticMode.java:125)
	at com.astrazeneca.vardict.modes.SomaticMode.access$000(SomaticMode.java:31)
	at com.astrazeneca.vardict.modes.SomaticMode$SomdictWorker.call(SomaticMode.java:96)
	at com.astrazeneca.vardict.modes.SomaticMode$SomdictWorker.call(SomaticMode.java:79)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
	at htsjdk.samtools.AbstractBAMFileIndex.query(AbstractBAMFileIndex.java:297)
	at htsjdk.samtools.DiskBasedBAMFileIndex.getSpanOverlapping(DiskBasedBAMFileIndex.java:61)
	at htsjdk.samtools.BAMFileReader.getFileSpan(BAMFileReader.java:935)
	at htsjdk.samtools.BAMFileReader.createIndexIterator(BAMFileReader.java:952)
	at htsjdk.samtools.BAMFileReader.query(BAMFileReader.java:612)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:533)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.query(SamReader.java:538)
	at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryOverlapping(SamReader.java:400)
	at com.astrazeneca.vardict.data.SamView.<init>(SamView.java:21)
	at com.astrazeneca.vardict.modules.VariationRealigner.noPassingReads(VariationRealigner.java:2060)
	at com.astrazeneca.vardict.modules.VariationRealigner.realigndel(VariationRealigner.java:617)
	at com.astrazeneca.vardict.modules.VariationRealigner.realignIndels(VariationRealigner.java:387)
	at com.astrazeneca.vardict.modules.VariationRealigner.process(VariationRealigner.java:88)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
	... 19 more

@mathiasbio
Copy link
Collaborator

In this issue which sounds similar AstraZeneca-NGS/VarDict#167 the author says resorting the bamfile would solve their issue. I'll make some tests!

@ivadym
Copy link
Contributor Author

ivadym commented Nov 15, 2023

I'm not sure how transferable are the issues in VarDict to VarDictJava though 🤔

@mathiasbio
Copy link
Collaborator

I don't know either, but it is one of the areas of the analysis where there may have been some changes between the previous version of balsamic and the current release, with the removal of concatenation and parallel alignment. Could be worth a try!

@mathiasbio
Copy link
Collaborator

I have tried 3 different things now, independently. TLDR: Fiddled with the bamfiles - still confused - still using a lot more vmem than in production

  1. I tried adding sort by position: Add resort bam #1331

Ran the same case 6 times, and once it failed vardict

Critical exception occurs on region: 1:120461884-120462174, program will be stopped.
java.util.concurrent.CompletionException: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)

Me and @ivadym looked at the benchmarks of the VarDict rule and saw that in some cases the virtual memory had increased by roughly 150X since previous validation. This seemed like a likely explanation behind this random error, that maybe sometimes the rules which took up an unexpected amount of memory was put on an already heavily loaded node and then crashed. (though one could have hoped for more informative errors)

Since the only thing that seemed to have changed since last validation was the bam-files I decided to try test some of the post-processing steps that exists in the production version:

  1. I tried removing the extra read groups that was added during parallel alignment using:

singularity exec --bind /home align_qc.sif picard -Xmx150g AddOrReplaceReadGroups -RGPU ILLUMINAi -RGID NORMAL -RGSM NORMAL -RGPL ILLUMINAi -RGLB ILLUMINAi -MAX_RECORDS_IN_RAM 1000000 -CREATE_INDEX true -CREATE_MD5_FILE true -INPUT normal.ACC.dedup_sorted.bam -OUTPUT normal.ACC.dedup_sorted_rg.bam

But the max_vms from the benchmarks looks similar...

  1. I tried adding FixMate step.

This apparently also does:

INFO    2023-11-23 14:04:38     FixMateInformation      Sorting input into queryname order.
INFO    2023-11-23 14:10:54     FixMateInformation      Sorting by queryname complete.
INFO    2023-11-23 14:10:54     FixMateInformation      Output will be sorted by coordinate
INFO    2023-11-23 14:10:54     FixMateInformation      Traversing query name sorted records and fixing up mate pair information.
INFO    2023-11-23 14:10:54     FixMateInformation      Seen many non-increasing record positions. Printing Read-names as well.

And Sorting input into queryname order. seemed significant, as maybe VarDict performs better if the reads are in queryname rather than position. (this is also interesting with regards to downstream tools in WGS if this is what we're doing in production, maybe it slows down analysis substantially as it's mostly standard to sort by position I think).

Anyway...I ran this (similar to what we run in production):
singularity exec --bind /home align_qc.sif picard -Xmx150g FixMateInformation -ADD_MATE_CIGAR true -CREATE_INDEX true -CREATE_MD5_FILE true -INPUT normal.ACC.dedup_sorted.bam -OUTPUT normal.ACC.dedup_sorted_matefix2.bam

But again...the max_vms from VarDict looks the same.

@fevac
Copy link
Contributor

fevac commented Nov 23, 2023

I guess there are no changes in numbers of cpus or tmp folder being used?

@fevac
Copy link
Contributor

fevac commented Nov 23, 2023

sorry, not sure I follow everything, is the conclusion that the issue is in the bam files? all the rest is the same?
but also, does it complete successfully after retrying? or is it failing consistently?

@mathiasbio
Copy link
Collaborator

I didn't check the CPUs, or the tmp folders. Basically just saw the greatly increased memory and it seemed like a likely culprit. Basically the error seems random, and it also completes when retrying, so not failing consistently. I still wonder if it has something to do with the bamfiles but I'm not sure how since I have tested most things that seemed relevant in production:

  • Removal of unmapped reads (not tested)
  • 1 read group in bam (tested, no effect)
  • sort bam by query in fixmate (tested, no effect)
  • add mate MQ info from fixmate (tested, no effect)
  • concatenate fastqs before...(not tested)

I think I'll try concatenating the fastqs just as a sanity check

@fevac
Copy link
Contributor

fevac commented Nov 24, 2023

but it's weird that is not consistent and finishes when retrying. Is it related to specific nodes?

@mathiasbio
Copy link
Collaborator

We checked that a little too, but maybe not as extensively as we should, but it seems to occur more frequently on node2 and 4, but we have seen other nodes as well.

@mathiasbio
Copy link
Collaborator

mathiasbio commented Nov 24, 2023

Tried concatenating fastq inputs before starting. Same increased vmem, and multiple vardict errors:

Critical exception occurs on region: 16:20237573-20237774, program will be stopped.
java.util.concurrent.CompletionException: java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)

on node 13, 21, 21 (again), and 13 again

@mathiasbio
Copy link
Collaborator

comparing benchmarks all these metrics is increased from recent validation to my implement parallel alignment branch:

max_rss: Maximum "Resident Set Size”, this is the non-swapped physical memory a process has used.
max_vms: Maximum “Virtual Memory Size”, this is the total amount of virtual memory used by the process.
max_uss: Unique Set Size”, this is the memory which is unique to a process and which would be freed if the process was terminated right now.
max_pss: “Proportional Set Size”, is the amount of memory shared with other processes, accounted in a way that the amount is divided evenly between the processes that share it (Linux only)

To give some values. Average values for all VarDict jobs from recent validation (it's run 1 time per chromosome)
max_rss = 8,7812
max_vms = 217,7804
max_uss = 7,3076
max_pss = 7,7916
Compared to same case run in my old implement parallel alignment branch:
max_rss = 9430,9336
max_vms = 36240,0828
max_uss = 9383,656
max_pss = 9386,9888

@ivadym
Copy link
Contributor Author

ivadym commented Nov 24, 2023

A production environment in Stage behaves as expected: vardict doesn't fail, and the memory is consistent with what we see in production.

I've also increased the Java heap space to 64G (Xmx64G) and the number of cores:

  • acetuna: completed succesfully
  • setamoeba: completed sucessfully
  • unitedbeagle: failed
    Critical exception occurs on region: 20:39794721-39795011, program will be stopped.
    java.util.concurrent.CompletionException: java.lang.InternalError: a fault occurred in a recent unsafe memory access 
    operation in compiled Java code
    
  • savinghorse: failed
    There was Exception while processing variant on 407630 on region 6:407355-407699. The processing will be continued 
    from the next variant.
    java.lang.IndexOutOfBoundsException: bitIndex < 0: -1588120723
    

We're getting two different errors, and the IndexOutOfBoundsException, which we though was solved in #1271

@mathiasbio
Copy link
Collaborator

Apparently there's no evidence that we have increased the memory usage in VarDict in the new release, instead the differences in numbers between current production and the release version is due to us updating Snakemake which solved this bug: snakemake/snakemake#1671 where benchmark stats were not accurately reported from jobs using singularity...

So back to square one with the VarDict issue.

@ivadym ivadym pinned this issue Nov 24, 2023
@ivadym
Copy link
Contributor Author

ivadym commented Nov 24, 2023

Excluding the container as the cause of this issue, vardict also fails when using a varcall_py3 container from production

@mathiasbio
Copy link
Collaborator

Yes I saw that too. I also tried:

  1. using an old bamfile from the previous validation, but I only ran it once, but that worked (however checking the benchmarks using this old file showed similar mem values as when using a bamfile from the release-version).
  2. I tried reverting the workflow back to the production, at least partially, by reverting to doing concatenation, using bwa-mem, and picard markduplicates. But the error remains there. I'll try next adding the final steps of the post-processing and run again.

@mathiasbio
Copy link
Collaborator

mathiasbio commented Nov 27, 2023

Summary:

Vardicts fails on bamfiles from:

  • Sentieon, Dedup, Sort (roughly 1 in 6)
  • Sentieon, Dedup, Sort + FixMate (1/1)
  • Sentieon, Dedup, Sort (on concatenated files) (1/1)
  • Sentieon, Dedup, Sort (using current production container for VarDict) (roughly 1 in 6)
  • Concatenate, Bwa mem, Picard MarkDups (roughly 2 / 6)
  • Concatenate, Bwa mem, Picard MarkDups, FixMate, remove unmapped, AddRG (roughly 1/6) (BASICALLY BACK TO PRODUCTION)
  • Sentieon, Dedup, Sort, VarDict with increased memory and threads (3/6)

Succeeds:

@mathiasbio
Copy link
Collaborator

At the moment there is no evidence that removing unmapped reads helps in this issue. So maybe this can be removed in a future cleanup PR when things are less hectic.

@fevac
Copy link
Contributor

fevac commented Nov 28, 2023

Agree. Amazing job @mathiasbio ! 🌟

@ivadym
Copy link
Contributor Author

ivadym commented Nov 30, 2023

❤️ 🎖️ @mathiasbio

Closing! #1332

@ivadym ivadym closed this as completed Nov 30, 2023
@github-project-automation github-project-automation bot moved this from Planned to Completed in BALSAMIC Nov 30, 2023
@ivadym ivadym unpinned this issue Dec 1, 2023
@ivadym ivadym moved this from Completed to Archived in BALSAMIC Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Urgency Large
Projects
Status: Archived
Development

No branches or pull requests

3 participants