- GSuite version: 0.9
- Document version: 1.0a
- Date: May 27, 2015
- Authors: Sveinung Gundersen, Boris Simovski, Abdulrahman Azab, Diana Domanska, Eivind Hovig, and Geir Kjetil Sandve
- A. Introduction and background
- B. Example GSuite files
- C. Syntax of the GSuite format
- D. References
- E. Change log
(Return to: Contents)
(Return to: Contents)
GSuite is a simple tabular text format for use in specifying a suite (ie. a set or collection) of genomic annotation tracks (simply called tracks in this document). A GSuite file does not contain any genomic data as such, but provides metadata necessary to locate the track contents, info on whether the track has been preprocessed in a manner suitable for analysis (see the BTrack file format), some basic information on how to analyze the data (see the track type concept), as well as the reference genome build that the track coordinates are based upon. In addition to this, the user may add as many custom metadata columns he/she needs.
(Return to: Contents)
Central to the concept of track suites is the idea that tracks which take part in a GSuite file should be somewhat related in contents and format. Although the GSuite format allows heterogeneous tracks to be banded together in a single file, such files will typically not be useful for analysis purposes, as one would almost always need to restrict the contents and/or format as required by the analysis tools. For instance, a tool that finds the intersection of base pairs covered by all tracks in a suite would require all tracks to be of type "points" or "segments", not "function", as tracks of that type cover all base pairs (see section Track Types below for more info). For this reason, the GSuite file format specifies a set of four header variables that could (and should) be stated in the beginning of the file. These header variables function as a summary over the tracks in the file, providing a specific value if all the tracks are in accordance with each other. If the different tracks varies on this particular aspect, the header variable is set to the reserved keyword "multiple". This typically indicates that the collection of tracks is not yet focused enough to be usable as a suite of tracks for analysis purposes.
(Return to: Contents)
In order to analyze multiple tracks, one obviously needs to acquire such tracks. Some tracks one might have acquired directly from sequencing endeavors (e.g. ChIP-seq peaks), but often one needs to fetch such tracks from public repositories and databases such as those provided by the ENCODE 1 and Roadmap Epigenomics 2 projects. GSuite supports the specification of suites of tracks before the actual track files has been retrieved from a server. In such cases, the location of the tracks are termed "remote" and the GSuite file would typically contain an HTTP or FTP address to the remote location. Tracks that have been retrieved and is stored at the same place as the GSuite file are termed as "local".
(Return to: Contents)
As part of the implementation of an analysis tool, one would typically need a track to be translated, or preprocessed, into a binary format before analyses takes place, as this greatly improves analysis speed. Often this is done behind-the-scenes inside the analysis tool. In the GSuite format, however, the concept of preprocessed binary versions of a track has been included explicitly as part of the format. The reason for this is that preprocessing typically takes some time per track, and when one works with multiple tracks (often hundreds) this step will thus consume a significant amount of time. Carrying out the preprocessing step as a one-time process, instead of every time one runs an analysis tool, will thus save much time for the user. Analysis tools therefore typically require the tracks in a GSuite to be preprocessed in advance.
Preprocessing of a GSuite file results in the tracks being stored in the BTrack format. BTrack is a binary format for genomic tracks that allows for fast retrieval and efficient analyses by the storage of data columns as numeric arrays. An analogue to the BTrack format in the domain of sequence alignment is the BAM format, which is a binary version of the textual SAM format. BTrack is thus the binary version of the previously published GTrack format 3.
BTrack is the new name for the previously unnamed internal track storage format used in the Genomic HyperBrowser [3,4]. The BTrack format has seen several major updates as part of the HyperBrowser code base, and will now soon be released as a separate binary format that allows multiple tracks to be stored in a single binary file (currently unpublished). The GSuite format is intimately linked to the BTrack format, as a BTrack file would be able to store both a GSuite file together with the actual track contents.
(Return to: Contents)
The concept of track types has been examined in detail in a previous publication 2. Briefly, a track type is a characterization of a the geometrical/mathematical properties of a track. A track is typically envisioned as data somehow located along the DNA sequence of a particular reference genome. The simplest track type is "points", which refer to single base pairs scattered along the genome, e.g. SNPs. "Segments" are the more common ones, which represents regions of the DNA, e.g. genes. With the addition of values and/or cross-genomic links, a total of 15 track types was delineated in 3. The main usage scenario of track types is to limit which tracks it makes sense to use as input to a particular analysis tool. For example, an analysis of the base pair overlap of two tracks would typically require the tracks to be of type "segments". When it comes to the analysis of multiple tracks, one would typically require the tracks to be analyzed to be of the same track type. The GSuite format thus supports "track type" as one of the main header variables (see below). The following is a list of all the supported 15 track types, as delineated in 3:
- Points (P)
- Valued Points (VP)
- Segments (S)
- Valued Segments (VS)
- Genome Partition (GP)
- Step Function (SF)
- Function (F)
- Linked Points (LP)
- Linked Valued Points (LVP)
- Linked Segments (LS)
- Linked Valued Segments (LVS)
- Linked Genome Partition (LGP)
- Linked Step Function (LSF)
- Linked Function (LF)
- Linked Base Pairs (LBP)
(Return to: Contents)
Before going into the details of the GSuite format, one should be able to get a quick overview of the format by looking at these example files:
(Return to: Contents)
- List of URLs
http://www.server.com/path/to/file.bed
http://www.server.com/path/to/file2.bed
http://www.server2.com/path/to/other_file.bed
ftp://www.server3.com/path/to/new_file.wig
(Return to: Contents)
- Header lines
- List of URLs
##location: remote
##file format: primary
##track type: segments
##genome: hg38
http://www.server.com/path/to/file.bed
http://www.server.com/path/to/file2.bed
http://www.server2.com/path/to/other_file.bed
ftp://www.server3.com/path/to/new_file.gff
(Return to: Contents)
- Header lines
- List of URLs
- including GSuite-specific URL schemes
- Comment lines
- Extra columns
##location: multiple
##file format: multiple
##track type: segments
##genome: hg38
###uri title p-values
http://www.server.com/path/to/file.bed track_1 0.002
http://www.server.com/path/to/file2.bed track_2 0.1
http://www.server2.com/path/to/other_file.bed track_3 1.0
ftp://www.server3.com/path/to/new_file.gff track_4 0.8
galaxy:/abcd1234abcd;bed track_5 0.012
hb:/my/track/name track_6 .
- Note:
This example text has been formatted in columns using spaces, not tabs, as required by the GSuite format.
(Return to: Contents)
GSuite is a tabular text file format. All GSuite filenames should end with ".gsuite". The GSuite format consists of 5 different line types, distinguished by the leading characters and numbered here by order of appearance in the file:
- A.1 Overview
- A.2 Suites of tracks
- A.3 Location of tracks
- A.4 Preprocessing of tracks into a binary format (BTrack)
- A.5 Track types
- B.1 Example 1
- B.2 Example 2
- B.3 Example 3
- C.i Empty lines
- C.ii Comment lines
- C.1 Header lines
- C.2 Column specification line
- C.3 Track lines
Note:
The arabic number preceding each line type defines the order in which the lines
must be present: i.e., the column specification line (C.2) must follow the
header lines (C.1). Roman numbers indicate comments and emtpy lines, which may
be present anywhere.
(Return to: Contents) / C. Syntax of the GSuite format)
(none)
only whitespace characters (space, tab, newline, return)
optional
Empty lines are allowed anywhere in the GSuite file.
These will be ignored by the parsers
(Return to: Contents / C. Syntax of the GSuite format)
#
(a single hash character)
# this is a comment
optional
Comments are allowed anywhere and will be ignored by parsers. Note that a comment line following a track line is considered to be a comment for that track and can for instance be used by tools that creates GSuite files to present track-specific error messages to the user.
(Return to: Contents / C. Syntax of the GSuite format)
##
##variable:[ ]*value
- where
variable
= Header variable name[ ]*
= Optional space charactersvalue
= Header variable value
##location: local
##file format: preprocessed
##track type: segments
##genome: hg38
optional in an input GSuite file, but auto-generated when a GSuite is created as output from a tool
A header variable contains information that relates to the whole of the GSuite file, and is thus a summary over all the tracks in the file. The header variables names are limited to a set of reserved keywords, each with a restricted set of values. The header variables are related to reserved columns of the track lines (see the section "Column specification line" below).
The following header variable names have reserved meaning, and are the only headers allowed by the GSuite format:
`location`
`file format`
`track type`
`genome`
In the following, the reserved header variables are described in detail:
Specifies whether the data contents of all tracks in the GSuite are found at remote locations on the Internet, or if they have been downloaded locally to the service parsing the GSuite file (see section "Location of tracks" above). Note that the service parsing the GSuite may itself be located on e.g. a web server, but the tracks of the GSuite is still considered as local if they are on the same server as the service.
The location header is a summary of the different types of URI schemes
present in the uri
column in the track lines (see the section "Column
specification line" below). All supported types of URIs are thus defined as
either remote or local.
Allowed values:
unknown
remote
local
multiple
Specifies whether all tracks have been preprocessed into the binary format
BTrack, which is a prerequisite for most analysis tools. The file format
header variable is a summary of the contents of the file_format
column in
the track lines (see the section "Column specification line" below).
Allowed values:
unknown
primary
preprocessed
multiple
Specifies the track type common for all the tracks in the GSuite file, if
any. See the section "Track types" above for more information. The track type
header variable is a summary of the contents of the track_type
column
in the track lines (see the section "Column specification line" below).
Note that if the track types of the tracks are different, but based upon the
same basic type, the common track type of the GSuite file is set to the
simplest track type that can used to describe all tracks, if any. E.g. if two
tracks have the types valued segments
and linked segments
, respectively,
the track type of the GSuite file is segments
. If there is no such simple
track type, the keyword multiple
is used.
Allowed values:
unknown
points
valued points
segments
valued segments
genome partition
step function
function
linked points
linked valued points
linked segments
linked valued segments
linked genome partition
linked step function
linked function
linked base pairs
multiple
Specifies the reference genome for all the tracks in the GSuite file. The
genome
header variable is a summary of the contents of the genome
column
in the track lines (see the section "Column specification line" below. The
actual keyword for the genome build is dependent on the implementation of the
analysis tools that will make use of the information. The GSuite format
accepts any string as the genome.
Allowed values:
unknown
multiple
+ any other string specifying a reference genome assembly
If a header variable is missing, it will be auto-generated from the track lines. If a header variable is present, but with a value that is inconsistent with the track lines, the parser will return an error. Note that all header variable lines except for the "genome" variable allow a mix of lower- and uppercase characters.
The following logic for the values unknown
and multiple
will hold for all
header variables:
-
unknown
:
if at least one track hasunknown
as it value, the value of the GSuite header variable will also beunknown
, regardless of the values for the other tracks. -
multiple
:
if at least one track has a different value than the others, the value of the GSuite header variable will bemultiple
(unless the value for one of the tracks isunknown
, in which case that keyword takes precedence).
(Return to: Contents / C. Syntax of the GSuite format)
###
###col1 col2 col3...
- where
col1
,col2
,col3
= Column names" "
= tab character
###uri title file_format track_type genome description p-value
(here: spaces instead of tabs)
###uri
Optional, but if not defined the column specification line retains the default value. This means that a list of URI's is a valid GSuite file.
The column specification line is a tab-separated list of column names, containing metadata information about the tracks in the track suite defined i a GSuite file.
The GSuite specification defines a set of five reserved column names:
`uri`
`title`
`file_format`
`track_type`
`genome`
In addition, any number of custom column names can be specified, each representing varied types of metadata available for the tracks.
In the following, the reserved column names are described in detail:
A unique identifier of a track file, following the Universal Resource Identifier (URI) format 5. GSuite supports a diverse set of URI schemes:
-
URI schemes for data residing at a remote location (i.e., when
location
isremote
):ftp
http
https
rsync
Examples:
ftp://ftp.server.com/path/to/file.bed http://www.server.com:8080/index?filename=track.wig rsync://server.com/path/to/file
-
URI schemes for data residing locally (i.e., when
location
islocal
):file
galaxy
hb
defined as follows:
-
file
Standard URI schema for local files.Example:
file:///path/to/file/bed
Note:
Thefile
scheme does not support files residing other places than "localhost". The host part of the URI is thus unneeded, hence the triple/
characters.
-
galaxy
Thegalaxy
scheme uniquely identifies a Galaxy dataset, but currently only works for the local installation of the Galaxy analysis framework that is set up with GSuite support, i.e. one cannot (yet) provide an URI to a remote Galaxy installation 6. TheSyntax:
galaxy:/dataset_key[/directory/structure/to/file]
Example
Multiple files can be stored within one Galaxy history element using the directory structure syntax.
-
hb
The "HB" scheme identifies a track stored as the BTrack format within the local installation of GSuite HyperBrowser. The syntax is as follows:hb:/track/name/hierarchy
Note that for all the URI schemes except the "HB" one, GSuite supports the additional specification of file suffix after a semicolon, as in this example:
ftp://ftp.server.com/path/to/file;bed
This usable if the file path itself does not contain the suffix, and hence does not contain any information on the actual file format of the track.
- Parser notes:
Note that services available from e.g. the web should disable the "file" scheme, as this is inherently insecure.
The title of the track, as specified by the user. Each track title must be
unique within a specific GSuite, so that one may use the title as a key to
uniquely reference specific tracks in a GSuite.
Allowed values: *any*
-
File_format
:Specifies whether the track has been preprocessed into the binary format BTrack or not, as described in the section "Header lines" above.
If the GSuite parser understands the file suffix to be an un-preprocessed format, file format is automatically set to "primary". Similarly, tracks in the BTrack format (including those with "HB" as URI) automatically gets "preprocessed" as "file_format".
Allowed values:
unknown
,primary
,preprocessed
Default value:
unknown
-
Track_type
:Specifies the track type of the track, as described in the section "Header lines" above. If the track is preprocessed into a BTrack file, the value of the "track_type" is automatically collected from the BTrack file(s) themselves.
Allowed values:
unknown
,points
,valued points
,segments
,valued segments
,genome partition
,step function
,function
,linked points
,linked valued
points
,linked segments
,linked valued segments
,linked genome partition
,linked step function
,linked function
,linked base pairs
Default value:
unknown
-
Genome
:Specifies the reference genome build used as basis of the track, as described in the section "Header lines" above.
Allowed:
unknown
, any other string specifying a reference genomeDefault value:
unknown
-
Custom columns
Any number of custom columns can be added. Any string can be used as value for each track, so there are little or no rules on the content defined within the GSuite format. Missing values for custom columns are denoted with the period character: '.'
If the value in the "file_format" column is the same for all tracks in a GSuite, the column can be removed, leaving only the value of the "file format" header variable to speak for all individual tracks. The same logic holds also for the columns "track_type" and "genome".
Column names are treated as case insensitive. All column names must also be unique. The columns can be ordered in any way, but it is recommended for readability to use "uri" and "title" as the first two rows, if defined.
(Return to: Contents / C. Syntax of the GSuite format)
(none)
val1 val2 val3...
where
val1, val2, val3
= column values" "
= tab character
###uri title p-value
http://www.server.com/path/to/file.bed My cool track 0.00013
(here: spaces instead of tabs)
Track lines are optional. If no track lines are specified, the GSuite file represents an empty collection of tracks.
Each track is specified as a tab-separated list of metadata values, as defined by the column specification line. See the section "C.2.6 Reserved column names" for a more detailed discussion on the allowed values.
(Return to: Contents
- ENCODE Project Consortium. "An integrated encyclopedia of DNA elements in the human genome." Nature 489.7414 (2012): 57-74.
- Kundaje, Anshul, et al. "Integrative analysis of 111 reference human epigenomes." Nature 518.7539 (2015): 317-330.
- Gundersen, Sveinung, et al. "Identifying elemental genomic track types and representing them uniformly." BMC Bioinformatics 12.1 (2011): 1.
- Sandve, Geir K., et al. "The Genomic HyperBrowser: inferential genomics at the sequence level." Genome Biology 11.12 (2010): 1-12.
- Uniform Resource Identifier (URI): Generic Syntax (https://tools.ietf.org/html/rfc3986)
- Goecks, Jeremy, Anton Nekrutenko, and James Taylor. "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences." Genome Biology 11.8 (2010): R86.
(Return to: Contents
v0.1 - 2015.07.06:
- Initial version of the GSuite specification document.
v0.2 - 2016.07.06:
- Fixed typos and cleaned up text several places. Ready for initial submission of the GSuite HyperBrowser manuscript.
v0.3 - 2018.05.23:
- Converted specification to Markdown
- Smaller changes in formatting and wording
v1.0a - 2020.07.13:
- Heavy cleanup and reformatting of Markdown
- Still quite a bit of cleanup to do in the syntax section