New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix CRAMReferenceRegion updating. #1708

Merged

droazen merged 9 commits into master from cn_fix_reference_region

Jun 4, 2024

Collaborator

cmnbroad commented Apr 9, 2024 •

edited

Loading

Fix for broadinstitute/gatk#8768.


          Fix CRAM reference region transitions.

2d5b20f

cmnbroad force-pushed the cn_fix_reference_region branch from d769bce to 2d5b20f Compare

April 10, 2024 13:30

cmnbroad changed the title ~~Test potential fix CRAM for reference region updating.~~ Fix CRAM for reference region updating.

cmnbroad marked this pull request as ready for review

April 10, 2024 14:03

cmnbroad changed the title ~~Fix CRAM for reference region updating.~~ Fix CRAMReferenceRegion updating.


          Add roundtrip tests that fail without this change.

969bdf1

cmnbroad force-pushed the cn_fix_reference_region branch from bf67f68 to 969bdf1 Compare

May 7, 2024 16:00

cmnbroad added 4 commits

May 13, 2024 13:28


          Add roundtrip tests with 2 and 3 containers aligned to position 1.

4f465ab


          Add more roundtrip tests that validate bases.

f276317


          Add a large CRAM roundtrip test and samtools rountdtip test.

64180f8


          Always remember your index files.

19be0c7

droazen requested changes

View reviewed changes

Contributor

droazen left a comment

@cmnbroad Back to you with my review comments

src/main/java/htsjdk/samtools/cram/build/CRAMReferenceRegion.java Outdated

+                      if (newSequenceRecord == null) {
+                          throw new IllegalArgumentException(
+                                  String.format("Requested reference sequence index %d not found", referenceIndex));
+                      }

Contributor

droazen May 31, 2024

Can you call the existing getSAMSequenceRecord() method here? It does exactly the same thing.

Collaborator Author

cmnbroad Jun 3, 2024

Oh yeah, done.

src/main/java/htsjdk/samtools/cram/build/CRAMReferenceRegion.java

                       if ((referenceIndex != this.referenceIndex) ||
                               regionStart != 0 ||
-                              (regionLength < referenceBases.length)) {
+                              (regionLength != newSequenceRecord.getSequenceLength())) {
                           setCurrentSequence(referenceIndex);

Contributor

droazen May 31, 2024

Add a comment that the setCurrentSequence() call needs to happen first in this if block (in particular, before the assignment to regionLength below). Or alternatively, you could have regionLength = newSequenceRecord.getSequenceLength(); below to eliminate the dependency on order of operations.

Collaborator Author

cmnbroad Jun 3, 2024

Latter is done.

src/test/java/htsjdk/samtools/CRAMAllEncodingStrategiesTest.java

+                              { new File(TEST_DATA_DIR, "CEUTrio.HiSeq.WGS.b37.NA12878.20.21.10m-10m100.cram"),
+                                      new File("src/test/resources/htsjdk/samtools/reference/human_g1k_v37.20.21.fasta.gz"),
+                                      false, false },
+                              { new File(TEST_DATA_DIR, "NA12878.unmapped.cram"),

Contributor

droazen May 31, 2024

Can you add one-line comments explaining the provenance of these three crams, and what they are testing?

Collaborator Author

cmnbroad Jun 3, 2024

Done with as much specificity as I have about these, since they've been in the repo for a while.

src/test/java/htsjdk/samtools/CRAMAllEncodingStrategiesTest.java

+                              // these tests use lenient equality to only validate read names, bases and qual scores
+                              { new File(TEST_DATA_DIR, "mitoAlignmentStartTestGATKGen.cram"),
+                                      new File(TEST_DATA_DIR, "mitoAlignmentStartTest.fa"), true, false },
+                              { new File(TEST_DATA_DIR, "mitoAlignmentStartTest.cram"),

Contributor

droazen May 31, 2024

Add comments explaining what these files are as well

Collaborator Author

cmnbroad Jun 3, 2024

Done with as much specificity as I have about these, since they've been in the repo for a while.

src/test/java/htsjdk/samtools/CRAMAllEncodingStrategiesTest.java

+                                      new File(TEST_DATA_DIR, "mitoAlignmentStartTest.fa"), true, false },
+                              // files created by rewriting the htsjdk test file src/test/resources/htsjdk/samtools/cram/mitoAlignmentStartTest.cram
+                              // using code that replicates the first read (which is aligned to position 1 of the mito contig) either
+                              // 10,000 or 20,000 times, to create a file with 2 or 3 containers, respectively, that have reads aligned to

Contributor

droazen May 31, 2024

Did you confirm via direct inspection of the file that it did in fact create 2 or 3 containers with reads mapped to position 1?

Collaborator Author

cmnbroad Jun 3, 2024 •

edited

Loading

Yes. You can also see this in the GATK PR, which uses files that were created using the same code that created these files, to test the detector on files with multiple bad containers. Alternatively, you can run GATK PrintFileDiagnostics on these files and see the container alignments.

src/test/java/htsjdk/samtools/CRAMAllEncodingStrategiesTest.java

    
                          assertRoundTripFidelity(cramSourceFile, tempOutCRAM, referenceFile, false);

                          assertRoundtripFidelityWithSamtools(tempOutCRAM, referenceFile);

                          assertRoundTripFidelity(cramSourceFile, tempOutCRAM, referenceFile, lenientEquality, emitDetail);

                          assertRoundtripFidelityWithSamtools(tempOutCRAM, referenceFile, lenientEquality, emitDetail);

Contributor

droazen May 31, 2024

Add a comment explaining why we also do a test with samtools

Collaborator Author

cmnbroad Jun 3, 2024

Its basically testing interoperability with samtools for all of the different encodings, to make sure both tool sets agree on the data. Added a comment saying its for interop testing.

src/test/java/htsjdk/samtools/CRAMAllEncodingStrategiesTest.java

-                  @Test(dataProvider = "roundTripTestFiles")
-                  public final void testRoundTripDefaultEncodingStrategy(final File sourceFile, final File referenceFile) throws IOException {
+                  @Test(dataProvider = "defaultStrategyRoundTripTestFiles")
+                  public final void testRoundTripDefaultEncodingStrategy(

Contributor

droazen May 31, 2024

Add a comment before this test method describing the purpose of the test (since we're not just testing "encoding strategies" here....)

Collaborator Author

cmnbroad Jun 3, 2024 •

edited

Loading

This is testing the default encoding strategy. Comment added.

src/test/java/htsjdk/samtools/CRAMAllEncodingStrategiesTest.java

                   }
-                  @Test(dataProvider = "roundTripTestFiles")
-                  public final void testAllEncodingStrategyCombinations(final File cramSourceFile, final File referenceFile) throws IOException {
+                  @Test(dataProvider = "encodingStrategiesTestFiles")

Contributor

droazen May 31, 2024

Would there be value in having this testAllEncodingStrategyCombinations test use the defaultStrategyRoundTripTestFiles DataProvider as well, with its many additional test cases?

Collaborator Author

cmnbroad Jun 3, 2024

Perhaps, but there are currently 81 encoding strategies, so using all of those files would make this test take a really long time, and I would guess, dominate the CI test time, especially now that we have the large CRAM test file in that list. So instead I separated them this way.

src/test/java/htsjdk/samtools/cram/ref/CRAMReferenceRegionTest.java

@@ @@ -167,6 +168,39 @@ public void testGetReferenceBasesByRegionPositive( @@
                       Assert.assertEquals(bases, Arrays.copyOfRange(fullContigBases, requestedOffset, requestedOffset + requestedLength));
                   }
+                  // simulate the state transitions that occur when writing a CRAM file
+                  @Test
+                  public void testSerialStateTransitions() {

Contributor

droazen May 31, 2024

Add a comment here explaining and referencing the bug that prompted us to write this test.

Collaborator Author

cmnbroad Jun 3, 2024

Done.

src/test/java/htsjdk/samtools/cram/ref/CRAMReferenceRegionTest.java

+                      // now transition back to the full sequence
+                      cramReferenceRegion.fetchReferenceBases(CRAMStructureTestHelper.REFERENCE_SEQUENCE_ZERO);
+                      Assert.assertEquals(cramReferenceRegion.getRegionLength(), fullRegionFragmentLength);
+                  }

Contributor

droazen May 31, 2024

Is CRAMReferenceRegion ever used across multiple threads? Give how the fetch methods update the internal state of the object on every call, it's clearly not safe to use in a parallelized manner. Do we need to make these methods synchronized?

Collaborator Author

cmnbroad Jun 3, 2024

It is currently not used anywhere across threads. As you noted, it's not thread-safe. I guess I could add synchronized, but that still wouldn't make it thread-safe, since the class is is stateful, and the usage pattern is fetch... followed by one or more getCurrent.. methods. So synchronized won't fix the statefulness.

Collaborator Author

cmnbroad Jun 3, 2024

Added a comment saying it's not thread-safe.

droazen assigned cmnbroad


          Code review comments.

669fd7e

droazen requested changes

View reviewed changes

Contributor

droazen left a comment

@cmnbroad Back to you with one final comment

src/main/java/htsjdk/samtools/cram/build/CRAMReferenceRegion.java Outdated

                       if ((referenceIndex != this.referenceIndex) ||
                               regionStart != 0 ||
-                              (regionLength < referenceBases.length)) {
+                              (regionLength != newSequenceRecord.getSequenceLength())) {
                           setCurrentSequence(referenceIndex);
                           referenceBases = referenceSource.getReferenceBases(sequenceRecord, true);

Contributor

droazen Jun 4, 2024

There is still a potential order-of-operations issue here: since the call to getReferenceBases() uses the instance variable sequenceRecord, the call to setCurrentSequence() must happen first. If the order of the statements were reversed in some future refactoring, we might have another terrible bug. Either add a comment explaining that setCurrentSequence() must happen first, or eliminate the order dependency by using newSequenceRecord in the getReferenceBases() call.

cmnbroad added 2 commits

June 4, 2024 14:12


          One more change based on review.

ca12df0


          One more update.

7c473de

droazen approved these changes

View reviewed changes

droazen merged commit 127f3de into master

4 checks passed

droazen deleted the cn_fix_reference_region branch

June 4, 2024 18:40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet