Skip to content

KEGG databases description

Andrea Martinez Vernon edited this page Mar 12, 2018 · 20 revisions

MetQy relies mainly on three KEGG databases for analysing physiological functions: KEGG orthology, enzyme, module and genome. See below for brief descriptions.

Use of KEGG data

MetQy contains in-built KEGG data (downloaded 20/02/2018) which is hidden from the user, in compliance with the KEGG FTP licence. Users with FTP access can use the parsing functions to process the KEGG database files and to provide up-to-date information to the query functions.

MetQy includes the following data entries:

DATABASE NUMBER OF ENTRIES NOTES
KEGG orthology 21,800
KEGG genome 5,244 Genomes without annotations were removed. Genomes prn (T04692) and con (T04096) are not included due to limitations of the Windows OS folder naming convention.
KEGG enzyme 6,087
KEGG module 780 Modules M00611 to M00618 have been removed, as these have KEGG module definitions that involve other modules.

KEGG orthology

Modified from http://www.kegg.jp/kegg/ko.html

KEGG orthology contains information on individual genes and their functional orthologs, where individual orthologs are identified by a unique K number.

KEGG genome

Modified from http://www.kegg.jp/kegg/genome.html

KEGG genome is a repository of complete genomes identified by a unique T number and by a 3-4 letter code (Kanehisa 2017). These genomes are annotated for their gene content using KEGG orthology (i.e. K numbers), with 99.9% of the annotated genomes come from the RefSeq and GenBank databases.

KINGDOM Number of genomes
Eukaryota 434
Bacteria 4548
Archaea 262

Enzyme Commission (EC) numbers have been mapped to KEGG orthologs (KOs). Hence, KEGG genomes also have both KEGG orthologs (K numbers) and EC numbers associated with them.

EC numbers

The EC (Enzyme Commission) nomenclature consists of 4 numerical positions separated by periods (e.g. "1.10.3.9" or "6.5.1.3"). The first position refers to the enzyme class and can be one of 6:

  • EC 1 - Oxidoreductases
  • EC 2 - Transferases
  • EC 3 - Hydrolases
  • EC 4 - Lyases
  • EC 5 - Isomerases
  • EC 6 - Ligases

The remaining positions provide more information, depending on the enzyme class.

See http://www.enzyme-database.org/class.php to investigate the classes, subclasses and sub-subclasses.

KEGG module

Modified from http://www.kegg.jp/kegg/module.html

Finally, KEGG module is an expert-curated database that groups K numbers into modules.

There are four types of modules:

  • pathway modules refer to functional units in KEGG metabolic pathway maps,
  • structural complexes refer to molecular machines or complexes,
  • functional sets describe other essential sets, and
  • signature modules are groups of genes associated with a phenotype.

Examples of modules are those for the TCA cycle, nitrogen assimilation or methane oxidation.

KEGG module definition

Each KEGG module is defined by a logical expression of the involved KEGG orthologs. For example, the cysteine biosynthesis module (M00021) has two blocks, each composed of the following genes:

  1. K00640
  2. K01738|K12339|K13034|K17069

Note that the pipe (|) denotes an OR operation. In other examples, the ampersand (&) denotes an AND operation

The block-based definition of modules facilitates the evaluation of whether a genome contains a given module by assessing each module block. Here, we define the module completeness fraction (mcf) for each module, which is calculated as the number of fully complete blocks divided by the total number of blocks. A genome with a complete gene set would result in a mcf of 1.

REFERENCES