-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CG tag has the incorrect type. #1560
Comments
The SAM specification explicitly states the type for this tag is CG:B:I. However Htsjdk (sometimes or always?) writes these tags as CG:B:i. Htslib then silently ignores the tag, assuming it's some unofficial (and incorrect) abuse of an uppercase tag for some local purpose. Given that there is data published in the wild using the incorrect data type, it would be less problematic for us to simple handle the incorrect value sign than the minimal risk of misinterpretting someone's private tag data as CIGAR. (Plus they'd bring such woe onto themselves by using the official name-space.) Fixes samtools#1477 See also samtools/htsjdk#1560
The SAM specification explicitly states the type for this tag is CG:B:I. However Htsjdk (sometimes or always?) writes these tags as CG:B:i. Htslib then silently ignores the tag, assuming it's some unofficial (and incorrect) abuse of an uppercase tag for some local purpose. Given that there is data published in the wild using the incorrect data type, it would be less problematic for us to simple handle the incorrect value sign than the minimal risk of misinterpretting someone's private tag data as CIGAR. (Plus they'd bring such woe onto themselves by using the official name-space.) Fixes samtools/samtools#1477 See also samtools/htsjdk#1560
@jkbonfield Thanks for reporting. Looks like an oversight where we forgot to set the "isUnsigned" bit when setting the tag value. Adding provenance to format conversions is a good idea, but I'll have to think about where that could live. Currently the writer in htsjdk doesn't know the type that the reader read since it just gets the same SAMRecord object no matter what the input was. Is that @cmnbroad still the case in the new code? I assume it is for BAM/CRAM since we don't have different header versions that would be propagated. Does htslib automatically add PG lines when performing conversions or is it something special in samtools view? |
Description of the issue:
Using picard
SamFormatConverter
to turn a long-read CRAM to BAM with more than 65536 cigar ops generates aCG:B:i,
tag. The SAM specification dictates this must beCG:B:I,
. This may be overly picky, but samtools validates against and thus rejects this data. Obviously we can make Samtools more lenient, but it would also be good to correct the issue at source as it is generating malformed data that will cause issues for older copies of samtools.Your environment:
Whatever version of htsjdk is bundled in Picard 2.25.7. (I'm assuming this issue is part of htsjdk and not picard.)
openjdk 11.0.11
Ubuntu 18.04.5 LTS (bionic)
Steps to reproduce
See the file reported in samtools/samtools#1477
If you convert this to SAM or CRAM (not using samtools so the CIGAR is correct) and then convert to BAM using picard then we see the uncompressed BAM binary contains
CGBi
.PS. It would also be good to get
PG
headers added. I only tried Picard as a punt, as there's nothing in the headers to indicate that is infact what did the format conversion. Please provide data provenance for format conversions.The text was updated successfully, but these errors were encountered: