Skip to content

Commit

Permalink
Convert U to T instead of U to N when sam_parsing.
Browse files Browse the repository at this point in the history
As this changes seq_nt16_table it changes it for all other uses of
this lookup table too, which will be widespread.  However this feels
like a reasonable thing to do given it only has an impact on data
which is currently out of bounds of what is expected.

Fixes samtools/samtools#2131
  • Loading branch information
jkbonfield authored and whitwham committed Oct 25, 2024
1 parent 3c90481 commit 7f3b758
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 2 deletions.
7 changes: 5 additions & 2 deletions hts.c
Original file line number Diff line number Diff line change
Expand Up @@ -232,16 +232,19 @@ const char *hts_feature_string(void) {
}


// Converts ASCII to BAM nibble encoding.
// Note 0123 is treated as ACGT (ABI colourspace encoding) and
// U is treated as T.
HTSLIB_EXPORT
const unsigned char seq_nt16_table[256] = {
15,15,15,15, 15,15,15,15, 15,15,15,15, 15,15,15,15,
15,15,15,15, 15,15,15,15, 15,15,15,15, 15,15,15,15,
15,15,15,15, 15,15,15,15, 15,15,15,15, 15,15,15,15,
1, 2, 4, 8, 15,15,15,15, 15,15,15,15, 15, 0 /*=*/,15,15,
15, 1,14, 2, 13,15,15, 4, 11,15,15,12, 15, 3,15,15,
15,15, 5, 6, 8,15, 7, 9, 15,10,15,15, 15,15,15,15,
15,15, 5, 6, 8, 8, 7, 9, 15,10,15,15, 15,15,15,15,
15, 1,14, 2, 13,15,15, 4, 11,15,15,12, 15, 3,15,15,
15,15, 5, 6, 8,15, 7, 9, 15,10,15,15, 15,15,15,15,
15,15, 5, 6, 8, 8, 7, 9, 15,10,15,15, 15,15,15,15,

15,15,15,15, 15,15,15,15, 15,15,15,15, 15,15,15,15,
15,15,15,15, 15,15,15,15, 15,15,15,15, 15,15,15,15,
Expand Down
1 change: 1 addition & 0 deletions htslib/hts.h
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,7 @@ int hts_parse_opt_list(htsFormat *opt, const char *str);
The input character may be either an IUPAC ambiguity code, '=' for 0, or
'0'/'1'/'2'/'3' for a result of 1/2/4/8. The result is encoded as 1/2/4/8
for A/C/G/T or combinations of these bits for ambiguous bases.
Additionally RNA U is treated as a T (8).
*/
HTSLIB_EXPORT
extern const unsigned char seq_nt16_table[256];
Expand Down
3 changes: 3 additions & 0 deletions test/compare_sam.pl
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,9 @@
$ln1[9] = uc($ln1[9]);
$ln2[9] = uc($ln2[9]);

# RNA U to T is an expected change
$ln1[9] =~ s/U/T/g;

# Cram will populate a sequence string that starts as "*"
$ln2[9] = "*" if ($ln1[9] eq "*");

Expand Down
3 changes: 3 additions & 0 deletions test/xx#u.sam
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
@SQ SN:xx LN:20
a1 99 xx 1 1 16M = 11 20 =ACMGRSVTWYHKDBN ****************
b1 99 xx 1 1 16M = 11 20 =ACMGRSVUWYHKDBN ****************

0 comments on commit 7f3b758

Please sign in to comment.