Skip to content

Commit

Permalink
Restrict allowed VCF Contig ID chars to those allowed in SAM RNAMEs
Browse files Browse the repository at this point in the history
Disallow \ , "`' (){} punctuation characters in VCF contig IDs.
The characters []<> were already disallowed in VCF; this also relaxes
the prohibition of * to merely disallowing initial *.

Statistics gathered from various reference sequence archives suggest
that the characters restricted appear vanishingly infrequently in SAM
reference sequence names in existing files in the wild. To the extent
that all contig IDs in VCF files come from corresponding SAM/BAM files,
this means there is little concern about making the same restrictions
in VCF contig IDs.

Fixes samtools#124 and fixes samtools#167 for VCF; their SAM aspects were previously
fixed by PR samtools#333.
  • Loading branch information
jmarshall committed Feb 6, 2020
1 parent 51e28f5 commit df03971
Showing 1 changed file with 12 additions and 1 deletion.
13 changes: 12 additions & 1 deletion VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,14 @@ \subsubsection{Contig field format}
\end{verbatim}

\noindent
Valid contig names must follow the reference sequence names allowed by the SAM format ("{\tt [!-)+-\char60\char62-\char126][!-\char126]*}") excluding the characters "\texttt{\textless\textgreater[]*}" to avoid clashes with symbolic alleles.
Contig names follow the same rules as the SAM format's reference sequence names:
they may contain any printable ASCII characters in the range \verb|[!-~]| apart from `{\tt\verb|\|\,,\,"`'\,()\,[]\,\verb|{}|\,<>}' and may not start with `{\tt *}' or `{\tt =}'.
Thus they match the following regular expression:
\begin{verbatim}
[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
\end{verbatim}
\noindent
In particular, excluding commas facilitates parsing \verb|##contig| lines, and excluding the characters `\verb|<>[]|' and initial~`{\tt *}' avoids clashes with symbolic alleles.
The contig names must not use a reserved symbolic allele name.
Expand Down Expand Up @@ -2047,6 +2054,10 @@ \subsection{Changes to VCFv4.3}
\item Tables with Type and Number definitions for INFO and FORMAT reserved keys
\item
The set of characters allowed in VCF contig names is now the same as that allowed in SAM reference sequence names, which was restricted in January 2019.
The characters `{\tt\verb|\|\,,\,"`'\,()\,\verb|{}|}' are now invalid in VCF contig names, while `{\tt *}' is now valid when not the first character.
(The characters `{\tt []\,<>}' and initial~`{\tt *}'/`{\tt =}' were already invalid and remain so.)
The VCF specification previously disallowed colons (`{\tt :}') in contig names to avoid confusion when parsing breakends, but this was unnecessary.
Even with contig names containing colons, the breakend mate position notation can be unambiguously parsed because the ``{\tt :}\emph{pos}'' part is \textbf{always} present.
\end{itemize}
Expand Down

0 comments on commit df03971

Please sign in to comment.