Skip to content

Commit

Permalink
Add appendix describing parsing name:beg-end when name allows colons
Browse files Browse the repository at this point in the history
Pseudocode description of algorithm to detect ambiguous input,
as proposed in a comment on #124.
  • Loading branch information
jmarshall committed Sep 19, 2018
1 parent 19d4da0 commit 8c27bc4
Showing 1 changed file with 34 additions and 1 deletion.
35 changes: 34 additions & 1 deletion SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,10 @@ \subsubsection{Character set restrictions}\label{sec:charset}
Query or read names may contain any printable ASCII characters in the range \verb"[!-~]" apart from `\verb"@"', so that SAM alignment lines can be easily distinguished from header lines.
(They are also limited in length.)

Reference sequence names may contain any printable ASCII characters in the range {\tt [!-\verb:~:]} apart from backslashs, commas, quotation marks, and brackets---i.e., apart from `{\tt \verb:\:\,,\,"`'\,()\,[]\,\verb:{}:\,<>}'---and may not start with `{\tt *}' or `{\tt =}'.
Reference sequence names may contain any printable ASCII characters in the range {\tt [!-\verb:~:]} apart from backslashs, commas, quotation marks, and brackets---i.e., apart from `{\tt \verb:\:\,,\,"`'\,()\,[]\,\verb:{}:\,<>}'---and may not start with `{\tt *}' or `{\tt =}'.%
\footnote{Characters that are \emph{not} disallowed include `{\tt |}', which historically appeared in reference names derived from NCBI FASTA files, and `{\tt :}', which appear in HLA allele names.
Appendix~\ref{sec:parse-region} describes an approach for parsing \emph{name}{\tt [:}\emph{begin}{\tt -}\emph{end}{\tt ]} region notation unambiguously even though \emph{name} may itself contain colons.}
Thus they match the following regular expression:
\begin{center}
{\tt [\verb"0-9A-Za-z!#$%&+./:;?@^_|~-"][\verb"0-9A-Za-z!#$%&*+./:;=?@^_|~-"]*}
Expand Down Expand Up @@ -1245,6 +1248,36 @@ \subsection{C source code for computing bin number and overlapping bins}\label{s
\begin{appendices}
\appendix
\section{Parsing region notation}\label{sec:parse-region}
Parsing region notation such as \emph{name}{\tt [:}\emph{begin}{\tt [-}\emph{end}{\tt ]]} (in which omission of the outer bracketed portion indicates a request for the entire reference sequence) would be simple if \emph{name} could not itself contain `{\tt :}' characters, but this is not the case.
The set of valid reference sequence names is usually already known when parsing this notation---for example, because the associated {\tt @SQ} headers have already been encountered.
Tools can use this set to determine unambiguously which colons could delimit a known-valid reference sequence name.
In pseudocode form, a string~\emph{str} can be parsed as follows:
\begin{tabbing}
\qquad\quad
\= consider the rightmost `{\tt :}' character, if any, of \emph{str} \+\\
if \emph{str} is of the form `\emph{prefix}{\tt :NUM}' \= or `\emph{prefix}{\tt :NUM-NUM}' \\
\> or generally `\emph{prefix}{\tt :}\emph{suffix}' for some plausible interval suffix \\
then \\
\qquad \= if both \emph{prefix} and \emph{str} are in the known set then\quad{\sf\ldots error: ambiguous representation} \\
\> else if \emph{prefix} is in the known set then return (\emph{prefix}, {\tt NUM}\ldots{\tt NUM}) \\
\> else if \emph{str} is in the known set then return (\emph{str}, entire sequence) \\
\> else\quad{\sf\ldots error: unknown reference sequence name} \\
\\
else\qquad{\sf\ldots either {\sl str} does not contain a colon or the suffix is not plausibly numeric}
\\
\> if \emph{str} is in the known set then return (\emph{str}, entire sequence) \\
\> else\quad{\sf\ldots error: unknown reference sequence name}
\end{tabbing}
\noindent
The check leading to ``{\sf error: ambiguous representation}'' is important as it prevents confusing interpretations of actually ambiguous input.
Typically the set of valid reference sequence names will not contain names that are prefixes of other names in the set, so in practice this error will not usually be encountered in non-malicious data.
\section{SAM Version History}\label{sec:history}
This lists the date of each tagged SAM version along with changes that
Expand Down

0 comments on commit 8c27bc4

Please sign in to comment.