Skip to content

Commit

Permalink
Disallow commas and other punctuation in RNAME etc (PR #333)
Browse files Browse the repository at this point in the history
Disallow \ , "`' ()[]{}<> punctuation characters in reference sequence
names. Commas and angle brackets are used to delimit refnames in other
SAM fields (e.g. SA) and in VCF files, and restricting these other
characters facilitates future delimiter and quoting syntax.

Statistics gathered from various reference sequence archives suggest
that these characters appear vanishingly infrequently in refnames in
existing files in the wild.

Fixes the SAM aspects of #124, #167, #258, and #291.

Add appendix describing parsing `name:beg-end` when name allows colons:
pseudocode description of algorithm to detect ambiguous input, as proposed
in a comment on #124; suggest also accepting an alternative `{name}:beg-end`
delimited notation.

Add previously omitted SQ-AN history note.
  • Loading branch information
jmarshall committed Jan 11, 2019
1 parent 4f2b35f commit f36d071
Showing 1 changed file with 76 additions and 6 deletions.
82 changes: 76 additions & 6 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@
\newcommand*{\firstbytebox}[2]{\byteboxAux{#1}{#2}{\put(0,0){\line(0,1){\bytetotalheight}}}}
\newcommand*{\bytebox}[2]{\byteboxAux{#1}{#2}{}}

\newcommand*{\cclass}[1]{{\rm\sf :#1:}}
\newcommand*{\caret}{\textsuperscript{$\wedge$}}

\begin{document}

\input{SAMv1.ver}
Expand Down Expand Up @@ -176,6 +179,31 @@ \subsection{Terminologies and Concepts}
mapping, all the other mappings get mapping quality $<$Q3
and are ignored by most SNP/INDEL callers.}

\subsubsection{Character set restrictions}\label{sec:charset}

Reference sequence names, CIGAR strings, and several other field types are used as values or parts of values of other fields in SAM and related formats such as VCF.
To ensure that these other fields' representations are unambiguous, these field types disallow particular delimiter characters.

Query or read names may contain any printable ASCII characters in the range \verb"[!-~]" apart from `\verb"@"', so that SAM alignment lines can be easily distinguished from header lines.
(They are also limited in length.)

Reference sequence names may contain any printable ASCII characters in the range {\tt [!-\verb:~:]} apart from backslashes, commas, quotation marks, and brackets---i.e., apart from `{\tt \verb:\:\,,\,"`'\,()\,[]\,\verb:{}:\,<>}'---and may not start with `{\tt *}' or `{\tt =}'.%
\footnote{Characters that are \emph{not} disallowed include `{\tt |}', which historically appeared in reference names derived from NCBI FASTA files, and `{\tt :}', which appears in HLA allele names.
Appendix~\ref{sec:parse-region} describes approaches for parsing \emph{name}{\tt [:}\emph{begin}{\tt -}\emph{end}{\tt ]} region notation unambiguously even though \emph{name} may itself contain colons.}
Thus they match the following regular expression:
\begin{center}
{\tt [\verb"0-9A-Za-z!#$%&+./:;?@^_|~-"][\verb"0-9A-Za-z!#$%&*+./:;=?@^_|~-"]*}
\end{center}
% Pedantically this should be [[:rname:]^*=][[:rname:]]*, but we take advantage
% of POSIX (Issue 7) section 9.3.5/8 to elide the excess brackets for clarity.
\newcommand*{\rnameRegexp}{[\cclass{rname}\caret*=][\cclass{rname}]*}
\noindent
For clarity, elsewhere in this specification we write this set of allowed characters as a character class~{\tt [\cclass{rname}]} and extend the POSIX regular expression notation to use {\tt\caret *=} to indicate the omission of `{\tt *}' and `{\tt =}' from the character class.
Thus this regular expression can be written more clearly as {\tt\rnameRegexp}.
\subsection{The header section}
Each header line begins with the character `{\tt @}' followed by
one of the two-letter header record type codes defined in this section.
Expand Down Expand Up @@ -235,7 +263,7 @@ \subsection{The header section}
The {\tt SN} tags and all individual {\tt AN} names in all {\tt @SQ} lines
must be distinct.
The value of this field is used in the
alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt [!-)+-\char60\char62-\char126][!-\char126]*}\\\cline{2-3}
alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt\rnameRegexp}\\\cline{2-3}
& {\tt LN}* & Reference sequence length. \emph{Range}: $[1,\,2^{31}-1]$\\\cline{2-3}
& {\tt AH} & Indicates that this sequence is an alternate locus.%
\footnote{See \url{https://www.ncbi.nlm.nih.gov/grc/help/definitions} for descriptions of \emph{alternate locus} and \emph{primary assembly}.}
Expand All @@ -246,13 +274,12 @@ \subsection{The header section}
to this reference sequence.%
\footnote{For example, given `{\tt @SQ SN:MT AN:chrMT,M,chrM LN:16569}',
tools can ensure that a user's request for any of `MT', `chrMT', `M',
or~`chrM' succeeds and refers to the same sequence.
Note the restricted set of characters allowed in an alternative name.}
or~`chrM' succeeds and refers to the same sequence.}
These alternative names are not used elsewhere within the SAM file;
in particular, they must not appear in alignment records' {\sf RNAME}
or~{\sf RNEXT} fields.
\emph{Regular expression}: \emph{name}{\tt (,}\emph{name}{\tt )*}
where \emph{name} is {\tt [0-9A-Za-z][0-9A-Za-z*+.@\_|-]*}\\\cline{2-3}
where \emph{name} is {\tt\rnameRegexp}\\\cline{2-3}
& {\tt AS} & Genome assembly identifier. \\\cline{2-3}
& {\tt DS} & Description. UTF-8 encoding may be used.\\\cline{2-3}
& {\tt M5} & MD5 checksum of the sequence. See Section~\ref{sec:ref-md5}\\\cline{2-3}
Expand Down Expand Up @@ -354,11 +381,11 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord}
\hline
1 & {\sf QNAME} & String & \verb:[!-?A-~]{1,254}: & Query template NAME\\
2 & {\sf FLAG} & Int & $[0,\,2^{16}-1]$ & bitwise FLAG \\
3 & {\sf RNAME} & String & {\tt \char92*|[!-()+-\char60\char62-\char126][!-\char126]*} & Reference sequence NAME\\
3 & {\sf RNAME} & String & {\tt \verb"\*"|\rnameRegexp} & Reference sequence NAME\footnotemark \\
4 & {\sf POS} & Int & $[0,\,2^{31}-1]$ & 1-based leftmost mapping POSition \\
5 & {\sf MAPQ} & Int & $[0,\,2^8-1]$ & MAPping Quality \\
6 & {\sf CIGAR} & String & {\tt \char92*|([0-9]+[MIDNSHPX=])+} & CIGAR string \\
7 & {\sf RNEXT} & String & {\tt \char92*|=|[!-()+-\char60\char62-\char126][!-\char126]*} & Ref. name of the mate/next read\\
7 & {\sf RNEXT} & String & {\tt \verb"\*"|=|\rnameRegexp} & Reference name of the mate/next read \\
8 & {\sf PNEXT} & Int & $[0,\,2^{31}-1]$ & Position of the mate/next read \\
9 & {\sf TLEN} & Int & $[-2^{31}+1,\,2^{31}-1]$ & observed Template LENgth \\
10 & {\sf SEQ} & String & {\tt \char92*|[A-Za-z=.]+} & segment SEQuence\\
Expand All @@ -367,6 +394,9 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord}
\end{tabular}
\end{center}
\footnotetext{Reference sequence names may contain any printable ASCII characters with the exception of certain punctuation characters, and may not start with `{\tt *}' or `{\tt =}'.
See Section~\ref{sec:charset} for details and an explanation of the {\tt [\cclass{rname}]} notation.}
\begin{enumerate}
\item {\sf QNAME}: Query template NAME. Reads/segments having identical {\sf QNAME}
are regarded to come from the same template. A {\sf QNAME} `{\tt *}'
Expand Down Expand Up @@ -1227,6 +1257,40 @@ \subsection{C source code for computing bin number and overlapping bins}\label{s
\begin{appendices}
\appendix
\section{Parsing region notation}\label{sec:parse-region}
Parsing region notation such as \emph{name}{\tt [:}\emph{begin}{\tt [-}\emph{end}{\tt ]]} (in which omission of the outer bracketed portion indicates a request for the entire reference sequence) would be simple if \emph{name} could not itself contain `{\tt :}' characters, but this is not the case.
The set of valid reference sequence names is usually already known when parsing this notation---for example, because the associated {\tt @SQ} headers have already been encountered.
Tools can use this set to determine unambiguously which colons could delimit a known-valid reference sequence name.
In pseudocode form, a string~\emph{str} can be parsed as follows:
\begin{tabbing}
\qquad\quad
\= consider the rightmost `{\tt :}' character, if any, of \emph{str} \+\\
if \emph{str} is of the form `\emph{prefix}{\tt :NUM}' \= or `\emph{prefix}{\tt :NUM-NUM}' \\
\> or generally `\emph{prefix}{\tt :}\emph{suffix}' for some plausible interval suffix \\
then \\
\qquad \= if both \emph{prefix} and \emph{str} are in the known set then\quad{\sf\ldots error: ambiguous representation} \\
\> else if \emph{prefix} is in the known set then return (\emph{prefix}, {\tt NUM}\ldots{\tt NUM}) \\
\> else if \emph{str} is in the known set then return (\emph{str}, entire sequence) \\
\> else\quad{\sf\ldots error: unknown reference sequence name} \\
\\
else\qquad{\sf\ldots either {\sl str} does not contain a colon or the suffix is not plausibly numeric}
\\
\> if \emph{str} is in the known set then return (\emph{str}, entire sequence) \\
\> else\quad{\sf\ldots error: unknown reference sequence name or invalid interval syntax}
\end{tabbing}
\noindent
The check leading to ``{\sf error: ambiguous representation}'' is important as it prevents confusing interpretations of actually ambiguous input.
Typically the set of valid reference sequence names will not contain names that are prefixes of other names in the set, so in practice this error will not usually be encountered in non-malicious data.
Either in addition to this algorithm or as an alternative to it, tools can use additional delimiter characters to make an unambigiously parsable notation.
We recommend a convention using curly brackets around the reference sequence name--- \verb"{"\emph{name}\verb"}"{\tt [:}\emph{begin}{\tt [-}\emph{end}{\tt ]]} ---as being memorable, easily typed, unambiguous, and not expanded by most shells.
% (RNAME cannot contain commas, so Bash's {a,b} brace expansion won't occur.)
\section{SAM Version History}\label{sec:history}
This lists the date of each tagged SAM version along with changes that
Expand All @@ -1242,6 +1306,11 @@ \section{SAM Version History}\label{sec:history}
\subsection*{1.6: 28 November 2017 to current}
\begin{itemize}
\item\textbf{Restricted the allowable punctuation characters in reference sequence names} (in {\tt @SQ SN}, {\sf RNAME}, etc).
The sets of characters allowed in {\tt @SQ SN} and {\tt @SQ AN} are now identical, which enlarges the previous {\tt AN} set. (Jan 2019)
We recommend that implementations validating reference sequence names do so using the rules in Section~\ref{sec:charset}; are more lenient for files declaring $\mbox{\tt @HD VN} \leq 1.5$; and validate {\tt AN} only against these rules, not the previous more restrictive {\tt AN} rules.
\item Add {\tt @HD SS} sorting details header tag. (Oct 2018)
\item B array optional fields may have no entries---this was already representable in BAM, clarified that empty arrays are permitted in SAM too. (Jul 2018)
\item Add {\tt @SQ DS} header tag. (Jul 2018)
Expand All @@ -1253,6 +1322,7 @@ \subsection*{1.6: 28 November 2017 to current}
\subsection*{1.5: 23 May 2013 to November 2017}
\begin{itemize}
\item Add {\tt @SQ AN} header tag, allowing only alphanumeric and `\verb"*+.@_|-"' characters in its names. (Jul 2017)
\item Add {\tt @SQ AH} header tag. (Mar 2017)
\item Auxiliary tags migrated to SAMtags document. (Sep 2016)
\item Z and H auxiliary tags are permitted to be zero length. (Jun 2016)
Expand Down

0 comments on commit f36d071

Please sign in to comment.