-
Notifications
You must be signed in to change notification settings - Fork 9
KEGG databases description
MetQy relies mainly on three KEGG databases for analysing physiological functions: KEGG orthology, enzyme, module and genome. See below for brief descriptions.
MetQy contains in-built KEGG data (downloaded 20/02/2018) which is hidden from the user, in compliance with the KEGG FTP licence. Users with FTP access can use the parsing functions to process the KEGG database files and to provide up-to-date information to the query functions.
MetQy includes the following data entries:
DATABASE | NUMBER OF ENTRIES | NOTES |
---|---|---|
KEGG orthology | 21,800 | |
KEGG genome | 5,244 | Genomes without annotations were removed. Genomes prn (T04692) and con (T04096) are not included due to limitations of the Windows OS folder naming convention. |
KEGG enzyme | 6,087 | |
KEGG module | 780 | Modules M00611 to M00618 have been removed, as these have KEGG module definitions that involve other modules. |
Modified from http://www.kegg.jp/kegg/ko.html
KEGG orthology contains information on individual genes and their functional orthologs, where individual orthologs are identified by a unique K number.
Modified from http://www.kegg.jp/kegg/genome.html
KEGG genome is a repository of complete genomes identified by a unique T number and by a 3-4 letter code (Kanehisa 2017). These genomes are annotated for their gene content using KEGG orthology (i.e. K numbers), with 99.9% of the annotated genomes come from the RefSeq and GenBank databases.
KINGDOM | Number of genomes |
---|---|
Eukaryota | 434 |
Bacteria | 4548 |
Archaea | 262 |
Enzyme Commission (EC) numbers have been mapped to KEGG orthologs (KOs). Hence, KEGG genomes also have both KEGG orthologs (K numbers) and EC numbers associated with them.
The EC (Enzyme Commission) nomenclature consists of 4 numerical positions separated by periods (e.g. "1.10.3.9" or "6.5.1.3"). The first position refers to the enzyme class and can be one of 6:
- EC 1 - Oxidoreductases
- EC 2 - Transferases
- EC 3 - Hydrolases
- EC 4 - Lyases
- EC 5 - Isomerases
- EC 6 - Ligases
The remaining positions provide more information, depending on the enzyme class.
See http://www.enzyme-database.org/class.php to investigate the classes, subclasses and sub-subclasses.
Modified from http://www.kegg.jp/kegg/module.html
Finally, KEGG module is an expert-curated database that groups K numbers into modules.
There are four types of modules:
- pathway modules refer to functional units in KEGG metabolic pathway maps,
- structural complexes refer to molecular machines or complexes,
- functional sets describe other essential sets, and
- signature modules are groups of genes associated with a phenotype.
Examples of modules are those for the TCA cycle, nitrogen assimilation or methane oxidation.
Each KEGG module is defined by a logical expression of the involved KEGG orthologs. For example, the cysteine biosynthesis module (M00021) has two blocks, each composed of the following genes:
- K00640
- K01738|K12339|K13034|K17069
Note that the pipe (|) denotes an OR operation. In other examples, the ampersand (&) denotes an AND operation
The block-based definition of modules facilitates the evaluation of whether a genome contains a given module by assessing each module block. Here, we define the module completeness fraction (mcf) for each module, which is calculated as the number of fully complete blocks divided by the total number of blocks. A genome with a complete gene set would result in a mcf of 1.
- Kanehisa, M. et al., 2017. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research, 45(D1), pp.D353–D361.
- http://www.kegg.jp/kegg/ko.html
- http://www.kegg.jp/kegg/genome.html
- http://www.kegg.jp/kegg/module.html