-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GenomicsDBImport: Modifications in order to address GATK issue #3269 #4645
GenomicsDBImport: Modifications in order to address GATK issue #3269 #4645
Conversation
@kgururaj Thanks, this branch passes the test case that was failing for me before. |
Thanks for the quick turnaround @kgururaj ! |
There is currently one failing integration test:
|
: getFeatureReadersSerially(sampleNameToVcfPath, batchSize, index); | ||
} | ||
|
||
private List<GenomicsDBImportConfiguration.Partition> generatePartitionListFromIntervals(List<ChromosomeInterval> chromosomeIntevals) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: intevals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@droazen - that failure is related to the question I sent Louis and James - I'll forward you the same email |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass review complete -- back to @francares and @kgururaj for changes. @lbergelson will also chime in with a separate review shortly.
As noted above, one of the GenomicsDBImport
integration tests is currently failing. Do you know why?
Also, could you please provide a complete list of the features/changes introduced in this PR relative to the version in master, as it's a bit hard to reconstruct from the github history. Thanks!
if (arrayFolder.exists()) { | ||
exportConfigurationBuilder.setArrayName(GenomicsDBConstants.DEFAULT_ARRAY_NAME); | ||
} else { | ||
exportConfigurationBuilder.setGenerateArrayNameFromPartitionBounds(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining what this setGenerateArrayNameFromPartitionBounds(true)
call does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
logger.info("Done importing batch " + batchCount + "/" + totalBatchCount); | ||
try { | ||
importer = new GenomicsDBImporter(importConfig); | ||
importer.executeImport(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this batching done internally in executeImport()
now? If so, are there regular status messages emitted to the logger before/after each batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction - progress isn't printed now. Will try to figure this out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks -- regular logger updates before/after each batch are crucial for us, as our GenomicsDBImport
jobs tend to be very long-running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest version I'm seeing a lot of
16:26:48.128 INFO GenomicsDBImport - Starting batch input file preload
16:27:11.629 INFO GenomicsDBImport - Finished batch preload
but I prefer the old "batch X out of Y" format. We have import jobs that run for 24 hours and it's nice to know how much progress has been made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not getting any logging from querying the GDB either, e.g. running GenotypeGVCFs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see lines:
14:46:01.216 INFO ProgressMeter - 1:69417 0.3 1000 3505.9
14:46:05.059 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
14:50:01.511 INFO ProgressMeter - 1:876592 4.3 7000 1631.6
14:50:13.315 INFO ProgressMeter - 1:892516 4.5 12000 2674.5
Is that adequate? master branch also seems to print something similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many samples was that? My problem is that I seems to be getting updates every X variants, but for 80,000 samples that takes a very long time.
//TODO: Fix this. | ||
// BiConsumer<Map<String, FeatureReader<VariantContext>>, Integer> closeReaders = (readers, batchCount) -> { | ||
// progressMeter.update(intervals.get(0)); | ||
// logger.info("Done importing batch " + batchCount + "/" + totalBatchCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what this TODO is? Are the readers getting closed properly? (I notice that you deleted the closeReaders()
method below...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spurious comment - yes, the readers are being closed inside GenomicsDBImporter
logger.info("GenomicsDB consolidation started"); | ||
GenomicsDBImporter.consolidateTileDBArray(workspace, GenomicsDBConstants.DEFAULT_ARRAY_NAME); | ||
logger.info("GenomicsDB consolidation completed"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that consolidation and writing the JSONs are handled internally now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - see this line
@@ -115,7 +120,7 @@ | |||
* </ul> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to update the tool documentation to reflect the fact that multiple intervals are now supported. Also included any extra information the user might need to know (eg., are the intervals required to be contiguous?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed doc - intervals need not be contiguous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test for non adjancent intervals
private static final String COMBINED = largeFileTestDir + "gvcfs/combined.gatk3.7.g.vcf.gz"; | ||
private static final String COMBINED_MULTI_INTERVAL = largeFileTestDir + "gvcfs/combined_multi_interval.gatk3.7.g.vcf.gz"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include a comment describing exactly how this file was created (including the command line used).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
private static final ArrayList<SimpleInterval> MULTIPLE_INTERVALS = new ArrayList<SimpleInterval>(Arrays.asList( | ||
new SimpleInterval("chr20", 17960187, 17970000), | ||
new SimpleInterval("chr20", 17970001, 17980000), | ||
new SimpleInterval("chr20", 17980001, 17981445) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When specifying multiple intervals, are they required to be contiguous? If not, could you please add a separate test case involving non-contiguous intervals as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a third test case involving multiple intervals from different contigs, as that is an important use case for our users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intervals need not be contiguous
checkGenomicsDBAgainstExpected(workspace, interval, expectedCombinedVCF.getAbsolutePath(), referenceFile, true); | ||
for(SimpleInterval currInterval : intervals) { | ||
List<SimpleInterval> tmpList = new ArrayList<SimpleInterval>(Arrays.asList(currInterval)); | ||
File expectedCombinedVCF = runCombineGVCFs(vcfInputs, tmpList, referenceFile, CombineGVCFArgs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you be running CombineGVCFs
with multiple intervals as well, instead of 1 interval at a time, so that it's more of an apples-to-apples comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to the earlier question about CombineGVCF
} | ||
|
||
private static GenomicsDBFeatureReader<VariantContext, PositionalBufferedStream> getGenomicsDBFeatureReader(final String workspace, final String reference) throws IOException { | ||
private static GenomicsDBFeatureReader<VariantContext, PositionalBufferedStream> getGenomicsDBFeatureReader( | ||
final String workspace, final String reference) throws IOException { | ||
return getGenomicsDBFeatureReader(workspace, reference, false); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a test case covering the issue reported in #4716 -- can you add one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add test cases covering the new sites-only query support (#3688) and support for retrieving the GT field?
private static void checkGenomicsDBAgainstExpected(final String workspace, final SimpleInterval interval, final String expectedCombinedVCF, final String referenceFile, final boolean testAll) throws IOException { | ||
private static void checkGenomicsDBAgainstExpected(final String workspace, final List<SimpleInterval> intervals, | ||
final String expectedCombinedVCF, final String referenceFile, | ||
final boolean testAll) throws IOException { | ||
final GenomicsDBFeatureReader<VariantContext, PositionalBufferedStream> genomicsDBFeatureReader = | ||
getGenomicsDBFeatureReader(workspace, referenceFile, !testAll); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we disabling retrieval of the GT field when testAll
is true? This seems backwards.... I'm not convinced there's currently good test coverage for retrieval of the GT field.
@ldgauthier Would it be possible to run this against 1 shard of the 20k and compare to the previous output to make sure nothing has gone weird? |
private Map<String, FeatureReader<VariantContext>> createSampleToReaderMap( | ||
final Map<String, Path> sampleNameToVcfPath, final int batchSize, final int index) { | ||
//TODO: fix casting since it's really ugly | ||
return inputPreloadExecutorService != null ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past we found that the sorting here was really important or it changed the results. If it's still sensitive to the ordering of the map maybe the signature of ImportConfig
constructor that this is fed to should be modified to use SortedMaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think we have kept the same sort order based on the following:
- The tool sorts by sample name first
- Using the samples in the same order as provided in the sampleToVcfName TreeMap
- Passing the same TreeMap as argument to getFeatureReaders()
I did a couple of tests with arbitrary ordering of samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me. It looks like there was a lot of positive progress on the Java API. I'd like us to do a larger test to be sure nothing changed in ways we weren't expecting, but it looks good.
@francares We encountered a problem when @ldgauthier tried running this branch on one of the shards of our previous 20k calling computation. The current version of gatk is able to complete the shard in a given amount of memory (7.5G I think, but @ldgauthier would know for sure), but this version ran out of memory. Has something changed that would make the tool require more memory at runtime now? Have you seen increased memory usage after your changes? I think I may know what's causing the issue. It seems like you're probably starting multiple simultaneous batch imports which is causing us to go over our memory limits. See
In the case of 0 threads you use @francares Does that seem plausible to you? We should try rerunning our job with threads set to 1 which should restrict it to a single batch at a time. |
Are you testing with a single chromosome interval or multiple intervals? If a single interval, there should be no difference since a single thread will be used. |
It looks like it will run with some number of threads up to |
@kgururaj It's definitely possible that this is a red herring and there's something else that's using more memory. It looks like we're using the common threadpool though which could potentially be a number of threads. |
Yep, each chromosome interval is handled by a single thread |
@lbergelson sorry for my late response. I'm currently on vacations but I will try to respond (with some delay) any question.
That's is the first difference from previous implementation. If whatever you have in that function consume lots of memory, that's an issue. Those are the two things I'm seeing right now without having the chance to debug :(. |
@francares Ah, sorry to bother you on your vacation. We can address this when you get back. The creation of the vcf readers is incredibly memory intensive because each one needs a large index. The number of open readers is the limiting factor that determines how large a batch size we can use. It doesn't make sense in this case to use multiple importers in parallel if we're also restricting batch size. The only reason to run with a batch size smaller than the number of samples is to get around memory restrictions, so running two simultaneous imports will just force a smaller batch size which is obviously sub-optimal. |
@francares The "batch size" argument should represent an absolute upper limit on the number of simultaneous readers open at once, as its only purpose is to control memory usage, as @lbergelson noted above. |
|
@kgururaj I ran with batch size 50, using a sample map, 5 reader threads:
That sample map has 80K genomes because that's the project I'm working on now. Log was as follows:
|
@ldgauthier You're running 80k? Does that run using the current master version of GATK? I assumed you were rerunning a 20k shard with the same settings we had used for the 20k. |
Added a golden output for multi-interval CI tests
…allel in GenomicsDB
* Modified intervals in CI test for the Combine GVCF compare test
59d2725
to
8f76d9f
Compare
* Sites-only * GT field check * Spanning deletion in the input gvcf
8f76d9f
to
dc649c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks like all comments have been addressed in this latest version -- merging!
@droazen Yay! |
This PR addresses required changes in order to use latest version of GenomicsDB which exposes new functionality such as: