-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require sequence dictionary MD5s when writing CRAM. #1612
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad I wasn't sure if you re-marked it as draft by accident or on purpose. It looks like it needs some touching up though. There's a bunch of stuff that I think should be changed though. Sorry if you're already aware / working on that.
// CRAMs are required to have a SAMHeader with a sequence dictionary that contains MD5 | ||
// checksums for each sequence. Check the proposed dictionary to see if it has MD5s, and | ||
// if not, attempt to update it with MD5s from the reference dictionary, if one exists. | ||
repairedHeader = repairSequenceMD5s(header, new HtsPath(referenceFasta.toUri().toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An aside, should we add direct HtsPath(Path) HtsPath(File) constructors?
repairedHeader.setSequenceDictionary(newDictionary); | ||
return repairedHeader; | ||
} catch (final IOException | RuntimeException e) { | ||
throw new RuntimeException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a CRAM or SAMException instead?
"%s Attempting to use the reference dictionary to repair missing MD5s required for CRAM output.", | ||
missingMD5Message)); | ||
final IOPath referenceDictionary = SAMSequenceDictionary.getFastaDictionaryFileName(referenceFasta); | ||
try (final InputStream fastaDictionaryStream = referenceDictionary.getInputStream()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems really awkward:
- We already have SAMSequenceDictionaryExtractor and this reimplements what it does although it might be more efficient.
- Immediately after this method returns we open a ReferenceSource which owns a ReferenceSequenceFile which can be queried directly for a sequence dictionary.
- This is IOPath aware, which is good, but it will fail in any complex cases in the future like using RefGt where there's no simple way to go from a path name to a dictionary file.
- I think the reference fasta can be null in the case the program wants to use the default one.
I think we should eliminate these lines and the new methods on SAMSequenceDictionary and instead add an accessor for the sequence dictionary to ReferenceSource. Then do the repair call after opening the reference source and get the sequence dictionary directly from the ReferenceSource.
/** | ||
* Given a fasta filename, return the name of the corresponding dictionary file. | ||
*/ | ||
public static IOPath getFastaDictionaryFileName(IOPath fastaFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be adding more filename guessing and munging here. I think we can rely on ReferenceSequenceFile to find and provide the SequenceDictionary.
* @param fastaDictionaryStream InputStream connected to a fasta dictionary | ||
* @return the SAMSequenceDictionary from the fastaDictionaryStream | ||
*/ | ||
public static SAMSequenceDictionary loadSAMSequenceDictionary( final InputStream fastaDictionaryStream ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a good thing to have here. Can you update SAMSequenceDictionaryExtractor to call this method?
@@ -148,6 +148,23 @@ public SAMSequenceRecord setMd5(final String value) { | |||
return this; | |||
} | |||
|
|||
/** | |||
* Set the M5 attribute for this sequence using the provided sequenceBases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: M5 -> MD5
public SAMSequenceRecord setComputedMd5(final byte[] sequenceBases) { | ||
try { | ||
final MessageDigest md5Digester = MessageDigest.getInstance("MD5"); | ||
if (md5Digester != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems broken. It doesn't actually digest the bases. Also, the bases should be normalized.
Why not use SequenceUtil.calculateMD5String()?
/** | ||
* @return all sequences in this dictionary that do not have an MD5 attribute | ||
*/ | ||
List<SAMSequenceRecord> getSequencesWithMissingMD5s() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem like very specific methods. It's a bit funny to have them here as package private. I wonder if it makes more sense to move them to the callsite.
Fixes #506 (and by extension will also fix broadinstitute/gatk#5195). If the SAM header used to create a CRAM doesn't have a sequence dictionary with MD5s, the code will attempt to use consult the reference's dictionary to inject them if possible. Otherwise, the write operation will fail with an error message.