-
Notifications
You must be signed in to change notification settings - Fork 12
Interim taxonomy file format
This page describes the format used to represent the taxonomies that are the inputs and outputs of the Open Tree of Life taxonomy build system.
The format derives from NCBI and is intentionally rudimentary because our needs are minimal. A better format to use in the long run might be Darwin Core Archive, which is what is used by GBIF, EOL, and the Global Names Architecture (GNA).
Each source taxonomy (NCBI, GBIF, Index Fungorum, ...) has its own script that converts its native format into this format.
A taxonomy consists of a directory of files with fixed names. Example: mycobank/taxonomy.tsv
, mycobank/synonyms.tsv
, mycobank/about.md
.
All files use the UTF-8 character encoding. Native taxonomy files often use some other encoding, so conversion might be necessary. Some aggregated taxonomies on the web have gotten this wrong and are a mess of mixed encodings and spurious re-encodings.
Four required columns, each column followed by tab - vertical bar - tab (even for the last column, which is unlike NCBI). The taxonomy build tool 'smasher' doesn't require the vertical bars; they are optional although they should be either all present or all absent. But some other consumers of these files may still require the vertical bars.
A header row of column names is recommended, but not required (for Smasher
). If provided, it looks like:
uid | parent_uid | name | rank |
All following rows are one row per taxon
Columns:
-
uid
- an identifier for the taxon, unique within this file. Should be native accession number whenever possible. Usually this is an integer, but it need not be. -
parent_uid
- the identifier of this taxon's parent, or the empty string if there is no parent (i.e., it's a root). -
name
- arbitrary text for the taxon name; not necessarily unique within the file. -
rank
, e.g. species, family, class. Should be all lower case. If no rank is assigned, or the rank is unknown, put "no rank".
Example (from NCBI):
5157 | 1028423 | Ceratocystis | genus |
5156 | 91171 | Gondwanamyces proteae | species |
Optional additional columns:
-
sourceinfo
: a comma-separated list of source specifiers, each one either a URL or a CURIE. If a URL, it should be either a DOI in the form of a URL, or a link to some other source such as a database. URLs usually begin 'http://' or 'https://' and DOI URLs begin 'http://dx.doi.org/10.'. A CURIE is an abbreviated URI using a prefix drawn from a known set, e.g. ncbi:1234 is taxon 1234 in the NCBI taxonomy. Other prefixes include gbif:, if: (Index Fungorum), mb: (Mycobank). New prefixes can be added but this is a manual process, so please request explicitly. -
uniqueName
: a human-readable string that is unique to this taxon, typically the taxon name if it is unique, or taxon name followed by "([rank] in [ancestor])" where rank is the taxon's rank and ancestor is an ancestor that is unique to this taxon (among the taxa that have the same name). If the field is empty, the taxon name is already unique in the taxonomy. -
flags
: a comma-separated list of flags or markers. Usually these are generated by taxonomy synthesis and are used to decide whether a taxon is 'hidden' or not. For example, if there's an 'extinct' flag then it may be desirable to suppress the taxon in an application. See here.
Example (from OTT) (long line):
2829583 | 4037065 | Symbiodinium pilosum | species | ncbi:2952,gbif:3207147,irmng:10996086,irmng:11902428 | | unclassified_inherited,infraspecific |
Usually there are synonyms. These go into a second file, synonyms.tsv
. This file must have a header row
uid | name | type | rank |
The header is necessary because it designates the order of the columns, which can sometimes change. These are the four columns:
- uid - the id for the taxon (from the taxonomy file) that this synonym resolves to
- name - the synonymic taxon name
- type - typically will be 'synonym' but could be any of the NCBI synonym types (authority, common name, etc.)
- rank - currently ignored for taxonomy synthesis.
Example from NCBI:
89373 | Flexibacteraceae | synonym | |
When two records are combined into one, as when a newly learned synonymy reveals that two names name the same taxon, one of the records' ids is kept and the other one is retired. The file forwards.tsv lists all such retired ids and tells the records they were merged with.
The file format is a simple tab-separated file with two columns and a header row, e.g.
id replacement
5533177 886365
5533176 135041
5533174 195815
3878986 385523
5533172 135041
5533171 898152
5533170 5533295
2983263 2983269
4967339 2915806
File version.txt contains just the OTT version number e.g. "2.9draft12"
Taxonomies that are the output of smasher also contain a number of files to assist research into decisions made by smasher.
- conflicts.tsv - gives details on conflicts between source taxonomies.
- log.tsv - traces how node mappings were chosen, for a selected subset of nodes (the entire trace for all nodes would be way too big).
- deprecated.tsv - lists ids that were retired in this version (but only for ids that occur as OTUs in phylesystem). Also lists ids that were not suppressed before, but are suppressed now.
- a few others
Overall metadata for the taxonomy is placed in a separate file. The metadata format is currently under development. Smasher
generates this in JSON format as about.json
, but this file is currently not used programmatically, and is in the process of being overhauled. When generating a taxonomy according to this format in external tools, for now it is best to simply write a markdown or plain text file called about.md
(in the same directory as taxonomy.tsv
and synonyms.tsv
).
The metadata provided in the file should include the source of the taxonomy (article or database) as a URL and any other descriptive information that's available. The purpose of the metadata is not just explanatory but also to explain how to check the correctness of the taxonomy against its source and make corrections and other improvements should the source be updated. When using information from changing sources (databases) the date or dates of retrieval should be recorded.
This page was originally part of the open tree wiki, and was transferred, since then maintained here on 2014-02-06.