GenomicsDBImport: Modifications in order to address GATK issue #3269 #4645

francares · 2018-04-10T17:42:48Z

This PR addresses required changes in order to use latest version of GenomicsDB which exposes new functionality such as:

Multi interval import and query support:
- We create multiple arrays (directories) in a single workspace - one per interval. So, if you wish to import intervals ("chr1", [ 1, 100M ]) and ("chr2", [ 1, 100M ]), you end up with 2 directories/arrays in the workspace with names chr1$1$100M and chr2$1$100M. The array names depend on the partition bounds.
- During the read phase, the user only supplies the workspace. The array names are obtained by scanning the entries in the workspace and reading the right arrays. For example, if you wish to read ("chr2", [ 50, 50M] ), then only the second array is queried.
- In the previous version of the tool, the array name was a constant - genomicsdb_array. The new version will be backward compatible with respect to reads. Hence, if a directory named genomicsdb_array is found in the workspace directory, it's passed as the array for the GenomicsDBFeatureReader otherwise the array names are generated from the directory entry names.
- Parallel import based on chromosome intervals. The number of threads to use can be specified as an integer argument to the executeImport call. If no argument is specified, the number of threads is determined by Java's ForkJoinPool (typically equal to the #cores in the system).
  - The max number of intervals to import in parallel can be controlled by the command line argument --max-num-intervals-to-import-in-parallel (default 1)
  - Note that increasing parallelism increases the number of FeatureReaders opened to feed data to the importer. So, if you are using N threads and your batch size is B, you will have N*B feature readers open.
Protobuf based API for import and read
- Import
- Read
  - "sites-only" query for GenomicsDB #3688
  - Option to produce GT field
  - Option to produce GT for spanning deletion based on min PL value
  - GenomicsDB: max # of alleles should be configurable #2687
  - Doesn't support Implement a way to specify combine operations per-field in GenomicsDB #4541 or Expose vid_mapping_file JSON in GenomicsDB #3689 yet - next version
Bug fixes
- GenomicsDB Import/Select fails on record containing a spanning deletion allele #4716
- More error messages

kgururaj · 2018-04-18T18:56:44Z

Supersedes #4496

kgururaj · 2018-04-30T18:26:24Z

Fix for #4716 - @cwhelan please use this branch for testing. You should be able to re-use the pre-created array - the bug was in the query part only.

cwhelan · 2018-04-30T20:55:21Z

@kgururaj Thanks, this branch passes the test case that was failing for me before.

ldgauthier · 2018-05-01T13:06:36Z

Thanks for the quick turnaround @kgururaj !

droazen · 2018-05-03T17:12:57Z

There is currently one failing integration test:

completedicsdb.GenomicsDBImportIntegrationTest.testGenomicsDBImportFileInputsAgainstCombineGVCFWithMultipleIntervals FAILED
    java.lang.AssertionError: actual is longer than expected with at least one additional element: [VC null @ chr20:17970000 Q. of type=SYMBOLIC alleles=[G*, <NON_REF>] attr={} GT=htsjdk.variant.bcf2.BCF2Codec$LazyData@134b79ab filters=
        at org.testng.Assert.fail(Assert.java:93)
        at org.broadinstitute.hellbender.utils.test.BaseTest.assertCondition(BaseTest.java:395)
        at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImportIntegrationTest.lambda$checkGenomicsDBAgainstExpected$8(GenomicsDBImportIntegrationTest.java:326)
        at java.util.ArrayList.forEach(ArrayList.java:1249)
        at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImportIntegrationTest.checkGenomicsDBAgainstExpected(GenomicsDBImportIntegrationTest.java:319)
        at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImportIntegrationTest.testGenomicsDBAgainstCombineGVCFs(GenomicsDBImportIntegrationTest.java:166)
        at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImportIntegrationTest.testGenomicsDBImportFileInputsAgainstCombineGVCFWithMultipleIntervals(GenomicsDBImportIntegrationTest.java:107)

lbergelson · 2018-05-03T17:29:16Z

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

+                : getFeatureReadersSerially(sampleNameToVcfPath, batchSize, index);
+    }
+
+    private List<GenomicsDBImportConfiguration.Partition> generatePartitionListFromIntervals(List<ChromosomeInterval> chromosomeIntevals) {


typo: intevals

kgururaj · 2018-05-03T17:42:51Z

@droazen - that failure is related to the question I sent Louis and James - I'll forward you the same email

droazen

First pass review complete -- back to @francares and @kgururaj for changes. @lbergelson will also chime in with a separate review shortly.

As noted above, one of the GenomicsDBImport integration tests is currently failing. Do you know why?

Also, could you please provide a complete list of the features/changes introduced in this PR relative to the version in master, as it's a bit hard to reconstruct from the github history. Thanks!

droazen · 2018-05-03T17:18:58Z

src/main/java/org/broadinstitute/hellbender/engine/FeatureDataSource.java

+        if (arrayFolder.exists()) {
+            exportConfigurationBuilder.setArrayName(GenomicsDBConstants.DEFAULT_ARRAY_NAME);
+        } else {
+            exportConfigurationBuilder.setGenerateArrayNameFromPartitionBounds(true);


Can you add a comment explaining what this setGenerateArrayNameFromPartitionBounds(true) call does?

droazen · 2018-05-03T17:29:27Z

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

-            logger.info("Done importing batch " + batchCount + "/" + totalBatchCount);
+        try {
+            importer = new GenomicsDBImporter(importConfig);
+            importer.executeImport();


Is this batching done internally in executeImport() now? If so, are there regular status messages emitted to the logger before/after each batch?

Correction - progress isn't printed now. Will try to figure this out

Ok, thanks -- regular logger updates before/after each batch are crucial for us, as our GenomicsDBImport jobs tend to be very long-running.

In the latest version I'm seeing a lot of

16:26:48.128 INFO GenomicsDBImport - Starting batch input file preload 16:27:11.629 INFO GenomicsDBImport - Finished batch preload

but I prefer the old "batch X out of Y" format. We have import jobs that run for 24 hours and it's nice to know how much progress has been made.

I'm not getting any logging from querying the GDB either, e.g. running GenotypeGVCFs.

I see lines:

14:46:01.216 INFO ProgressMeter - 1:69417 0.3 1000 3505.9 14:46:05.059 WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples 14:50:01.511 INFO ProgressMeter - 1:876592 4.3 7000 1631.6 14:50:13.315 INFO ProgressMeter - 1:892516 4.5 12000 2674.5

Is that adequate? master branch also seems to print something similar

How many samples was that? My problem is that I seems to be getting updates every X variants, but for 80,000 samples that takes a very long time.

droazen · 2018-05-03T17:29:42Z

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

+        //TODO: Fix this.
+//        BiConsumer<Map<String, FeatureReader<VariantContext>>, Integer> closeReaders = (readers, batchCount) -> {
+//            progressMeter.update(intervals.get(0));
+//            logger.info("Done importing batch " + batchCount + "/" + totalBatchCount);


Can you explain what this TODO is? Are the readers getting closed properly? (I notice that you deleted the closeReaders() method below...)

Spurious comment - yes, the readers are being closed inside GenomicsDBImporter

droazen · 2018-05-03T17:31:37Z

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

-            logger.info("GenomicsDB consolidation started");
-            GenomicsDBImporter.consolidateTileDBArray(workspace, GenomicsDBConstants.DEFAULT_ARRAY_NAME);
-            logger.info("GenomicsDB consolidation completed");
-        }


I'm assuming that consolidation and writing the JSONs are handled internally now?

Yes - see this line

droazen · 2018-05-03T17:33:09Z

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

@@ -115,7 +120,7 @@
 * </ul>


You need to update the tool documentation to reflect the fact that multiple intervals are now supported. Also included any extra information the user might need to know (eg., are the intervals required to be contiguous?)

Fixed doc - intervals need not be contiguous

Added test for non adjancent intervals

droazen · 2018-05-03T17:36:07Z

...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java

    private static final String COMBINED = largeFileTestDir + "gvcfs/combined.gatk3.7.g.vcf.gz";
+    private static final String COMBINED_MULTI_INTERVAL = largeFileTestDir + "gvcfs/combined_multi_interval.gatk3.7.g.vcf.gz";


Include a comment describing exactly how this file was created (including the command line used).

droazen · 2018-05-03T17:38:23Z

...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java

+    private static final ArrayList<SimpleInterval> MULTIPLE_INTERVALS = new ArrayList<SimpleInterval>(Arrays.asList(
+        new SimpleInterval("chr20", 17960187, 17970000),
+        new SimpleInterval("chr20", 17970001, 17980000),
+        new SimpleInterval("chr20", 17980001, 17981445)


When specifying multiple intervals, are they required to be contiguous? If not, could you please add a separate test case involving non-contiguous intervals as well?

Also add a third test case involving multiple intervals from different contigs, as that is an important use case for our users.

Intervals need not be contiguous

droazen · 2018-05-03T17:41:29Z

...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java

-        checkGenomicsDBAgainstExpected(workspace, interval, expectedCombinedVCF.getAbsolutePath(), referenceFile, true);
+        for(SimpleInterval currInterval : intervals) {
+            List<SimpleInterval> tmpList = new ArrayList<SimpleInterval>(Arrays.asList(currInterval));
+            File expectedCombinedVCF = runCombineGVCFs(vcfInputs, tmpList, referenceFile, CombineGVCFArgs);


Shouldn't you be running CombineGVCFs with multiple intervals as well, instead of 1 interval at a time, so that it's more of an apples-to-apples comparison?

Related to the earlier question about CombineGVCF

droazen · 2018-05-03T17:49:06Z

...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java

    }

-    private static GenomicsDBFeatureReader<VariantContext, PositionalBufferedStream> getGenomicsDBFeatureReader(final String workspace, final String reference) throws IOException {
+    private static GenomicsDBFeatureReader<VariantContext, PositionalBufferedStream> getGenomicsDBFeatureReader(
+            final String workspace, final String reference) throws IOException {
        return getGenomicsDBFeatureReader(workspace, reference, false);
    }



I don't see a test case covering the issue reported in #4716 -- can you add one?

Can you also add test cases covering the new sites-only query support (#3688) and support for retrieving the GT field?

droazen · 2018-05-03T18:03:25Z

...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java

-    private static void checkGenomicsDBAgainstExpected(final String workspace, final SimpleInterval interval, final String expectedCombinedVCF, final String referenceFile, final boolean testAll) throws IOException {
+    private static void checkGenomicsDBAgainstExpected(final String workspace, final List<SimpleInterval> intervals,
+                                                       final String expectedCombinedVCF, final String referenceFile,
+                                                       final boolean testAll) throws IOException {
        final GenomicsDBFeatureReader<VariantContext, PositionalBufferedStream> genomicsDBFeatureReader =
                getGenomicsDBFeatureReader(workspace, referenceFile, !testAll);


Why are we disabling retrieval of the GT field when testAll is true? This seems backwards.... I'm not convinced there's currently good test coverage for retrieval of the GT field.

lbergelson · 2018-05-03T19:36:48Z

@ldgauthier Would it be possible to run this against 1 shard of the 20k and compare to the previous output to make sure nothing has gone weird?

lbergelson · 2018-05-03T19:44:22Z

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImport.java

+    private Map<String, FeatureReader<VariantContext>> createSampleToReaderMap(
+            final Map<String, Path> sampleNameToVcfPath, final int batchSize, final int index) {
+        //TODO: fix casting since it's really ugly
+        return inputPreloadExecutorService != null ?


In the past we found that the sorting here was really important or it changed the results. If it's still sensitive to the ordering of the map maybe the signature of ImportConfig constructor that this is fed to should be modified to use SortedMaps

Yes. I think we have kept the same sort order based on the following:

The tool sorts by sample name first

Using the samples in the same order as provided in the sampleToVcfName TreeMap

Passing the same TreeMap as argument to getFeatureReaders()

I did a couple of tests with arbitrary ordering of samples

lbergelson

The changes look good to me. It looks like there was a lot of positive progress on the Java API. I'd like us to do a larger test to be sure nothing changed in ways we weren't expecting, but it looks good.

lbergelson · 2018-05-08T21:45:51Z

@francares We encountered a problem when @ldgauthier tried running this branch on one of the shards of our previous 20k calling computation. The current version of gatk is able to complete the shard in a given amount of memory (7.5G I think, but @ldgauthier would know for sure), but this version ran out of memory. Has something changed that would make the tool require more memory at runtime now? Have you seen increased memory usage after your changes?

I think I may know what's causing the issue. It seems like you're probably starting multiple simultaneous batch imports which is causing us to go over our memory limits.

See GenomicsDBImporter line 554:

        final ExecutorService executor = numThreads == 0 ? ForkJoinPool.commonPool() : Executors.newFixedThreadPool(numThreads);

In the case of 0 threads you use ForkJoinPool.commonPool(), which is the case GATK is using, since it calls the no args executeImport As far as I can see, this uses some number of threads automatically derived from the system. Each additional import batch consumes a ton of memory in this case because each VCFReader uses a lot of memory. (That's a different problem, but it's our problem).

@francares Does that seem plausible to you? We should try rerunning our job with threads set to 1 which should restrict it to a single batch at a time.

kgururaj · 2018-05-09T14:37:38Z

Are you testing with a single chromosome interval or multiple intervals? If a single interval, there should be no difference since a single thread will be used.

lbergelson · 2018-05-09T15:19:26Z

It looks like it will run with some number of threads up to numPartitions. Is num partitions always going to be 1 when there's only a single interval?

lbergelson · 2018-05-09T15:20:42Z

@kgururaj It's definitely possible that this is a red herring and there's something else that's using more memory. It looks like we're using the common threadpool though which could potentially be a number of threads.

kgururaj · 2018-05-09T16:39:15Z

Yep, each chromosome interval is handled by a single thread

francares · 2018-05-09T17:14:49Z

@lbergelson sorry for my late response. I'm currently on vacations but I will try to respond (with some delay) any question.
So, what I'm seeing here (using Github web without having a proper dev env) is that for each interval, it's going to call (in parallel) sample reader function from the configuration =>

final Map<String, FeatureReader<VariantContext>> sampleToReaderMap =
                                this.config.sampleToReaderMapCreator().apply(
                                        this.config.getSampleNameToVcfPath(), updatedBatchSize, index);

That's is the first difference from previous implementation. If whatever you have in that function consume lots of memory, that's an issue.
Regarding the thread pool, I'm not seeing it's being starved by chromosome parallel import but it might use extra memory to execute since there is a high load of threads use due to the number of parallel imports.
Worker threads can execute only one task at the time, but the ForkJoinPool doesn’t create a separate thread for every single subtask. Instead, each thread in the pool has its own double-ended queue (or deque, pronounced deck) which stores tasks.

Those are the two things I'm seeing right now without having the chance to debug :(.

lbergelson · 2018-05-09T17:21:02Z

@francares Ah, sorry to bother you on your vacation. We can address this when you get back. The creation of the vcf readers is incredibly memory intensive because each one needs a large index. The number of open readers is the limiting factor that determines how large a batch size we can use.

It doesn't make sense in this case to use multiple importers in parallel if we're also restricting batch size. The only reason to run with a batch size smaller than the number of samples is to get around memory restrictions, so running two simultaneous imports will just force a smaller batch size which is obviously sub-optimal.

droazen · 2018-05-09T17:50:58Z

@francares The "batch size" argument should represent an absolute upper limit on the number of simultaneous readers open at once, as its only purpose is to control memory usage, as @lbergelson noted above.

kgururaj · 2018-05-10T02:42:55Z

Added a command line parameter to control the number of intervals to import in parallel - default 1
This doesn't affect Louis' case since he was importing a single interval.
I ran the old and new versions of GATK on 1000 samples with a batch size of 20. Verified that at most 20 readers are open at a given time (GNU/Linux open file descriptors). Memory consumption difference was less than 10%.
@cwhelan (edit, sorry @lbergelson ) was the import command run with a list of VCF files in the command line or with the --sample-name-map argument?

ldgauthier · 2018-05-10T14:58:19Z

@kgururaj I ran with batch size 50, using a sample map, 5 reader threads:

java -jar GATK_GDBfork.jar GenomicsDBImport --genomicsdb-workspace-path forkTest --batch-size 50 -L chr20:45840744-45870555 --sample-name-map gnarly_reblocked_all.sample_map --reader-threads 5

That sample map has 80K genomes because that's the project I'm working on now. Log was as follows:

16:27:48.831 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/humgen/gsa-hpprojects/dev/gauthier/reblockGVCF/GATK_GDBfork.jar!/com/intel/gkl/nati
ve/libgkl_compression.so
16:27:48.947 INFO  GenomicsDBImport - ------------------------------------------------------------
16:27:48.948 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.3.0-24-g8804e16-SNAPSHOT
16:27:48.948 INFO  GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
16:27:48.949 INFO  GenomicsDBImport - Executing as gauthier@gsa5.broadinstitute.org on Linux v2.6.32-642.15.1.el6.x86_64 amd64
16:27:48.949 INFO  GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_121-b13
16:27:48.950 INFO  GenomicsDBImport - Start Date/Time: May 4, 2018 4:27:48 PM EDT
16:27:48.950 INFO  GenomicsDBImport - ------------------------------------------------------------
16:27:48.950 INFO  GenomicsDBImport - ------------------------------------------------------------
16:27:48.950 INFO  GenomicsDBImport - HTSJDK Version: 2.14.3
16:27:48.951 INFO  GenomicsDBImport - Picard Version: 2.18.1
16:27:48.951 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:27:48.951 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:27:48.951 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:27:48.951 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:27:48.951 INFO  GenomicsDBImport - Deflater: IntelDeflater
16:27:48.951 INFO  GenomicsDBImport - Inflater: IntelInflater
16:27:48.951 INFO  GenomicsDBImport - GCS max retries/reopens: 20
16:27:48.951 INFO  GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
16:27:48.951 INFO  GenomicsDBImport - Initializing engine
16:28:03.989 INFO  IntervalArgumentCollection - Processing 29812 bp from intervals
16:28:03.994 INFO  GenomicsDBImport - Done initializing engine
Created workspace /humgen/gsa-hpprojects/dev/gauthier/reblockGVCF/forkTest
16:28:04.155 INFO  GenomicsDBImport - Vid Map JSON file will be written to forkTest/vidmap.json
16:28:04.155 INFO  GenomicsDBImport - Callset Map JSON file will be written to forkTest/callset.json
16:28:04.156 INFO  GenomicsDBImport - Complete VCF Header will be written to forkTest/vcfheader.vcf
16:28:04.156 INFO  GenomicsDBImport - Importing to array - forkTest/genomicsdb_array
16:28:04.158 INFO  ProgressMeter - Starting traversal
16:28:04.158 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
16:28:05.198 INFO  GenomicsDBImport - Starting batch input file preload
16:29:23.571 INFO  GenomicsDBImport - Finished batch preload
16:48:46.140 INFO  GenomicsDBImport - Shutting down engine
[May 4, 2018 4:48:46 PM EDT] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 20.96 minutes.
Runtime.totalMemory()=22281715712
java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: Java heap space
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
        at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1582)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at com.intel.genomicsdb.importer.SilentByteBufferStream.<init>(SilentByteBufferStream.java:55)
        at com.intel.genomicsdb.importer.GenomicsDBImporterStreamWrapper.<init>(GenomicsDBImporterStreamWrapper.java:70)
        at com.intel.genomicsdb.importer.GenomicsDBImporter.addBufferStream(GenomicsDBImporter.java:397)
        at com.intel.genomicsdb.importer.GenomicsDBImporter.addSortedVariantContextIterator(GenomicsDBImporter.java:358)
        at com.intel.genomicsdb.importer.GenomicsDBImporter.<init>(GenomicsDBImporter.java:167)
        at com.intel.genomicsdb.importer.GenomicsDBImporter.lambda$null$2(GenomicsDBImporter.java:598)
        at com.intel.genomicsdb.importer.GenomicsDBImporter$$Lambda$58/15335646.get(Unknown Source)
        at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
        ... 5 more

lbergelson · 2018-05-10T15:09:20Z

@ldgauthier You're running 80k? Does that run using the current master version of GATK? I assumed you were rerunning a 20k shard with the same settings we had used for the 20k.

Added a golden output for multi-interval CI tests

…allel in GenomicsDB

* Modified intervals in CI test for the Combine GVCF compare test

* Sites-only * GT field check * Spanning deletion in the input gvcf

droazen

👍 Looks like all comments have been addressed in this latest version -- merging!

lbergelson · 2018-07-06T17:34:39Z

@droazen Yay!

droazen self-requested a review April 13, 2018 19:24

droazen self-assigned this Apr 13, 2018

kgururaj added the GenomicsDB label Apr 18, 2018

droazen changed the title ~~Modifications in order to address GATK issue #3269~~ GenomicsDBImport: Modifications in order to address GATK issue #3269 Apr 23, 2018

droazen requested a review from lbergelson May 3, 2018 17:15

droazen assigned lbergelson May 3, 2018

lbergelson reviewed May 3, 2018

View reviewed changes

droazen suggested changes May 3, 2018

View reviewed changes

droazen mentioned this pull request May 3, 2018

Sites-only query and GT field support in GenomicsDB (DO NOT MERGE) #4496

Closed

droazen reviewed May 3, 2018

View reviewed changes

lbergelson reviewed May 3, 2018

View reviewed changes

lbergelson approved these changes May 3, 2018

View reviewed changes

francares and others added 16 commits June 13, 2018 17:40

Small refactoring to getGenomicsDBFeatureReader method

269081d

Remove single interval constraint

1870c0b

Add integration tests taking into account multple intervals

d8db51d

Fixes in CI tests

b69685a

Updated GenomicsDB version

eb3257e

Added a golden output for multi-interval CI tests

Removed unnecessary .gitattributes file

af953e3

Fix for spanning deletions in input VCF bug in GenomicsDB

38c2c96

Addressed review comments

fc8e054

Show more Protobuf API goodness

fd9fcd7

Added an argument to specify the number of intervals to import in par…

b67391d

…allel in GenomicsDB

Updated GenomicsDB jar version

00272e2

Batch import progress logged

9ef4a16

Modified arg description

5d7aa6d

* Addressed review comment

e1b5b19

* Modified intervals in CI test for the Combine GVCF compare test

Addressing PR comments

c61c5ef

Add a CI test with non-adjacent genomic intervals

f3dd353

kgururaj force-pushed the fmc_GenomicsDB_parallel_import branch from 59d2725 to 8f76d9f Compare June 14, 2018 00:41

Added CI tests based on David's feedback

dc649c3

* Sites-only * GT field check * Spanning deletion in the input gvcf

kgururaj force-pushed the fmc_GenomicsDB_parallel_import branch from 8f76d9f to dc649c3 Compare June 14, 2018 16:55

droazen mentioned this pull request Jun 20, 2018

Add multi-interval support to GenomicsDBImport #3269

Closed

cwhelan mentioned this pull request Jun 28, 2018

Make HaplotypeCaller genotype and output spanning deletions #4963

Merged

droazen approved these changes Jul 6, 2018

View reviewed changes

droazen merged commit b34b189 into broadinstitute:master Jul 6, 2018

This was referenced Jul 6, 2018

GenomicsDB Import/Select fails on record containing a spanning deletion allele #4716

Closed

"sites-only" query for GenomicsDB #3688

Closed

GenomicsDB: max # of alleles should be configurable #2687

Closed

lbergelson mentioned this pull request Jul 11, 2018

.so files not deleted from tmp dir on shutdown #4754

Closed

This was referenced Jul 23, 2018

Stack overflow fix - eliminate recursive implementation #4997

Merged

VariantStorageManagerException exception : status == TILEDB_OK #5064

Closed

		private static final String COMBINED = largeFileTestDir + "gvcfs/combined.gatk3.7.g.vcf.gz";
		private static final String COMBINED_MULTI_INTERVAL = largeFileTestDir + "gvcfs/combined_multi_interval.gatk3.7.g.vcf.gz";

GenomicsDBImport: Modifications in order to address GATK issue #3269 #4645

GenomicsDBImport: Modifications in order to address GATK issue #3269 #4645

Conversation

francares commented Apr 10, 2018 • edited by kgururaj Loading

kgururaj commented Apr 18, 2018

kgururaj commented Apr 30, 2018

cwhelan commented Apr 30, 2018

ldgauthier commented May 1, 2018

droazen commented May 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kgururaj commented May 3, 2018

droazen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kgururaj May 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kgururaj May 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kgururaj May 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droazen May 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droazen May 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droazen May 3, 2018 • edited Loading

Choose a reason for hiding this comment

lbergelson commented May 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

lbergelson commented May 8, 2018

kgururaj commented May 9, 2018

lbergelson commented May 9, 2018

lbergelson commented May 9, 2018

kgururaj commented May 9, 2018

francares commented May 9, 2018

lbergelson commented May 9, 2018

droazen commented May 9, 2018 • edited Loading

kgururaj commented May 10, 2018 • edited Loading

ldgauthier commented May 10, 2018

lbergelson commented May 10, 2018

droazen left a comment

Choose a reason for hiding this comment

lbergelson commented Jul 6, 2018

francares commented Apr 10, 2018 •

edited by kgururaj

Loading

kgururaj May 3, 2018 •

edited

Loading

kgururaj May 16, 2018 •

edited

Loading

kgururaj May 3, 2018 •

edited

Loading

droazen May 3, 2018 •

edited

Loading

droazen May 3, 2018 •

edited

Loading

droazen May 3, 2018 •

edited

Loading

droazen commented May 9, 2018 •

edited

Loading

kgururaj commented May 10, 2018 •

edited

Loading