Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CG tag has the incorrect type. #1560

Open
jkbonfield opened this issue Aug 9, 2021 · 1 comment
Open

CG tag has the incorrect type. #1560

jkbonfield opened this issue Aug 9, 2021 · 1 comment

Comments

@jkbonfield
Copy link

jkbonfield commented Aug 9, 2021

Description of the issue:

Using picard SamFormatConverter to turn a long-read CRAM to BAM with more than 65536 cigar ops generates a CG:B:i, tag. The SAM specification dictates this must be CG:B:I,. This may be overly picky, but samtools validates against and thus rejects this data. Obviously we can make Samtools more lenient, but it would also be good to correct the issue at source as it is generating malformed data that will cause issues for older copies of samtools.

Your environment:

  • version of htsjdk
  • version of java
  • which OS

Whatever version of htsjdk is bundled in Picard 2.25.7. (I'm assuming this issue is part of htsjdk and not picard.)
openjdk 11.0.11
Ubuntu 18.04.5 LTS (bionic)

Steps to reproduce

See the file reported in samtools/samtools#1477

If you convert this to SAM or CRAM (not using samtools so the CIGAR is correct) and then convert to BAM using picard then we see the uncompressed BAM binary contains CGBi.

PS. It would also be good to get PG headers added. I only tried Picard as a punt, as there's nothing in the headers to indicate that is infact what did the format conversion. Please provide data provenance for format conversions.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Aug 9, 2021
The SAM specification explicitly states the type for this tag is
CG:B:I.  However Htsjdk (sometimes or always?) writes these tags as
CG:B:i.  Htslib then silently ignores the tag, assuming it's some
unofficial (and incorrect) abuse of an uppercase tag for some local
purpose.

Given that there is data published in the wild using the incorrect
data type, it would be less problematic for us to simple handle the
incorrect value sign than the minimal risk of misinterpretting
someone's private tag data as CIGAR. (Plus they'd bring such woe onto
themselves by using the official name-space.)

Fixes samtools#1477

See also samtools/htsjdk#1560
jkbonfield added a commit to jkbonfield/htslib that referenced this issue Aug 9, 2021
The SAM specification explicitly states the type for this tag is
CG:B:I.  However Htsjdk (sometimes or always?) writes these tags as
CG:B:i.  Htslib then silently ignores the tag, assuming it's some
unofficial (and incorrect) abuse of an uppercase tag for some local
purpose.

Given that there is data published in the wild using the incorrect
data type, it would be less problematic for us to simple handle the
incorrect value sign than the minimal risk of misinterpretting
someone's private tag data as CIGAR. (Plus they'd bring such woe onto
themselves by using the official name-space.)

Fixes samtools/samtools#1477

See also samtools/htsjdk#1560
@lbergelson
Copy link
Member

@jkbonfield Thanks for reporting. Looks like an oversight where we forgot to set the "isUnsigned" bit when setting the tag value.

Adding provenance to format conversions is a good idea, but I'll have to think about where that could live. Currently the writer in htsjdk doesn't know the type that the reader read since it just gets the same SAMRecord object no matter what the input was. Is that @cmnbroad still the case in the new code? I assume it is for BAM/CRAM since we don't have different header versions that would be propagated.

Does htslib automatically add PG lines when performing conversions or is it something special in samtools view?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants