Skip to content

Commit

Permalink
[DRAFT] Disallow commas and other punctuation in RNAME etc
Browse files Browse the repository at this point in the history
Add previously omitted SQ-AN history note.
  • Loading branch information
jmarshall committed Sep 5, 2018
1 parent ecf37f8 commit 16cff42
Showing 1 changed file with 29 additions and 6 deletions.
35 changes: 29 additions & 6 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@
\newcommand*{\firstbytebox}[2]{\byteboxAux{#1}{#2}{\put(0,0){\line(0,1){\bytetotalheight}}}}
\newcommand*{\bytebox}[2]{\byteboxAux{#1}{#2}{}}

\newcommand*{\cclass}[1]{{\rm\sf :#1:}}
\newcommand*{\caret}{\textsuperscript{$\wedge$}}

\makeindex

\begin{document}
Expand Down Expand Up @@ -178,6 +181,20 @@ \subsection{Terminologies and Concepts}
mapping, all the other mappings get mapping quality $<$Q3
and are ignored by most SNP/INDEL callers.}

\subsubsection{Query and reference sequence names}\label{sec:qrnames}

Reference sequence names may contain any printable ASCII characters in the range {\tt [!-\verb:~:]} apart from backslashs, commas, quotation marks, and brackets---i.e., apart from `{\tt \verb:\:\,,\,"`'\,()\,[]\,\verb:{}:\,<>}'---and may not start with `{\tt *}' or `{\tt =}'.
Thus they match the following regular expression:
\begin{center}
{\tt [\verb"0-9A-Za-z!#$%&+./:;?@^_|~-"][\verb"0-9A-Za-z!#$%&*+./:;=?@^_|~-"]*}
\end{center}
\newcommand*{\rnameRegexp}{[\cclass{rname}\caret*=][\cclass{rname}]*}
\noindent
For clarity, elsewhere in this specification we write this set of characters as a character class~{\tt [\cclass{rname}]} and extend the POSIX regular expression notation to use {\tt\caret *=} to indicate the omission of `{\tt *}' and `{\tt =}' from the character class.
Thus this regular expression can be written more clearly as {\tt\rnameRegexp}.
\subsection{The header section}
Each header line begins with the character `{\tt @}' followed by
one of the two-letter header record type codes defined in this section.
Expand Down Expand Up @@ -229,7 +246,7 @@ \subsection{The header section}
The {\tt SN} tags and all individual {\tt AN} names in all {\tt @SQ} lines
must be distinct.
The value of this field is used in the
alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt [!-)+-\char60\char62-\char126][!-\char126]*}\\\cline{2-3}
alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt\rnameRegexp}\\\cline{2-3}
& {\tt LN}* & Reference sequence length. \emph{Range}: $[1,\,2^{31}-1]$\\\cline{2-3}
& {\tt AH} & Indicates that this sequence is an alternate locus.%
\footnote{See \url{https://www.ncbi.nlm.nih.gov/grc/help/definitions} for descriptions of \emph{alternate locus} and \emph{primary assembly}.}
Expand All @@ -240,13 +257,12 @@ \subsection{The header section}
to this reference sequence.%
\footnote{For example, given `{\tt @SQ SN:MT AN:chrMT,M,chrM LN:16569}',
tools can ensure that a user's request for any of `MT', `chrMT', `M',
or~`chrM' succeeds and refers to the same sequence.
Note the restricted set of characters allowed in an alternative name.}
or~`chrM' succeeds and refers to the same sequence.}
These alternative names are not used elsewhere within the SAM file;
in particular, they must not appear in alignment records' {\sf RNAME}
or~{\sf RNEXT} fields.
\emph{Regular expression}: \emph{name}{\tt (,}\emph{name}{\tt )*}
where \emph{name} is {\tt [0-9A-Za-z][0-9A-Za-z*+.@\_|-]*}\\\cline{2-3}
where \emph{name} is {\tt\rnameRegexp}\\\cline{2-3}
& {\tt AS} & Genome assembly identifier. \\\cline{2-3}
& {\tt DS} & Description. UTF-8 encoding may be used.\\\cline{2-3}
& {\tt M5} & MD5 checksum of the sequence. See Section~\ref{sec:ref-md5}\\\cline{2-3}
Expand Down Expand Up @@ -348,11 +364,11 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord}
\hline
1 & {\sf QNAME} & String & \verb:[!-?A-~]{1,254}: & Query template NAME\\
2 & {\sf FLAG} & Int & $[0,\,2^{16}-1]$ & bitwise FLAG \\
3 & {\sf RNAME} & String & {\tt \char92*|[!-()+-\char60\char62-\char126][!-\char126]*} & Reference sequence NAME\\
3 & {\sf RNAME} & String & {\tt \verb"\*"|\rnameRegexp} & Reference sequence NAME\footnotemark \\
4 & {\sf POS} & Int & $[0,\,2^{31}-1]$ & 1-based leftmost mapping POSition \\
5 & {\sf MAPQ} & Int & $[0,\,2^8-1]$ & MAPping Quality \\
6 & {\sf CIGAR} & String & {\tt \char92*|([0-9]+[MIDNSHPX=])+} & CIGAR string \\
7 & {\sf RNEXT} & String & {\tt \char92*|=|[!-()+-\char60\char62-\char126][!-\char126]*} & Ref. name of the mate/next read\\
7 & {\sf RNEXT} & String & {\tt \verb"\*"|=|\rnameRegexp} & Reference name of the mate/next read \\
8 & {\sf PNEXT} & Int & $[0,\,2^{31}-1]$ & Position of the mate/next read \\
9 & {\sf TLEN} & Int & $[-2^{31}+1,\,2^{31}-1]$ & observed Template LENgth \\
10 & {\sf SEQ} & String & {\tt \char92*|[A-Za-z=.]+} & segment SEQuence\\
Expand All @@ -361,6 +377,9 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord}
\end{tabular}
\end{center}
\footnotetext{Reference sequence names may contain any printable ASCII characters with the exception of certain punctuation characters, and may not start with `{\tt *}' or `{\tt =}'.
See Section~\ref{sec:qrnames} for details and an explanation of the {\tt [\cclass{rname}]} notation.}
\begin{enumerate}
\item {\sf QNAME}: Query template NAME. Reads/segments having identical {\sf QNAME}
are regarded to come from the same template. A {\sf QNAME} `{\tt *}'
Expand Down Expand Up @@ -1233,6 +1252,9 @@ \section{SAM Version History}\label{sec:history}
\subsection*{1.6: 28 November 2017 to current}
\begin{itemize}
\item Restricted the allowable punctuation characters in RNAME and similar fields.
The sets of characters allowed in {\tt @SQ SN} and {\tt @SQ AN} are now identical, which slightly enlarges the previous {\tt AN} set.
(Sep 2018)
\item B array optional fields may have no entries---this was already representable in BAM, clarified that empty arrays are permitted in SAM too. (Jul 2018)
\item Add {\tt @SQ DS} header tag. (Jul 2018)
\item Add {\tt @RG BC} header tag. (Apr 2018)
Expand All @@ -1243,6 +1265,7 @@ \subsection*{1.6: 28 November 2017 to current}
\subsection*{1.5: 23 May 2013 to November 2017}
\begin{itemize}
\item Add {\tt @SQ AN} header tag, allowing only alphanumeric and `\verb"*+-.@_|"' characters in its names. (Jul 2017)
\item Add {\tt @SQ AH} header tag. (Mar 2017)
\item Auxiliary tags migrated to SAMtags document. (Sep 2016)
\item Z and H auxiliary tags are permitted to be zero length. (Jun 2016)
Expand Down

0 comments on commit 16cff42

Please sign in to comment.