From 74672b33d5c395ce906baa2c0cdbb107710f012e Mon Sep 17 00:00:00 2001
From: Wei Shen Downloadrelease page.
taxonkit name2taxid
:taxonkit create-taxdump
:taxonkit taxid-changelog/create-taxdump
:create-taxdump
. #91taxonkit create-taxdump
has no problem, it's just the changelog might not be perfect.taxonkit lca
:-K/--keep-invalid
: print the query even if no single valid taxid left. #89taxonkit name2taxid
:taxonkit reformat
:Related projects:
$HOME/.taxonkit
list
List taxonomic subtrees (TaxIds) bellow given TaxIds lineage
Query taxonomic lineage of given TaxIds reformat
Reformat lineage in canonical ranks name2taxid
Convert taxon names to TaxIds filter
Filter TaxIds by taxonomic rank range lca
Compute lowest common ancestor (LCA) for TaxIds taxid-changelog
Create TaxId changelog from dump archives profile2cami
* Convert metagenomic profile table to CAMI format cami-filter
* Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile create-taxdump
* Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Note: *New commands since the publication.
"},{"location":"#benchmark","title":"Benchmark","text":"Versions: ETE=3.1.2, taxopy=0.5.0 (faster since 0.6.0), TaxonKit=0.7.2.
"},{"location":"#dataset","title":"Dataset","text":"taxdump.tar.gz
: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz names.dmp
, nodes.dmp
, delnodes.dmp
and merged.dmp
to data directory: $HOME/.taxonkit
, e.g., /home/shenwei/.taxonkit
,--data-dir
, or environment variable TAXONKIT_DB
.All-in-one command:
wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \ntar -zxvf taxdump.tar.gz\n\nmkdir -p $HOME/.taxonkit\ncp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit\n
Update dataset: Simply re-download the taxdump files, uncompress and override old ones.
"},{"location":"#installation","title":"Installation","text":"Go to Download Page for more download options and changelogs.
TaxonKit
is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.
Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz
command or other tools. And then:
For Linux-like systems
If you have root privilege simply copy it to /usr/local/bin
:
sudo cp taxonkit /usr/local/bin/\n
Or copy to anywhere in the environment variable PATH
:
mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/\n
For Windows, just copy taxonkit.exe
to C:\\WINDOWS\\system32
.
conda install -c bioconda taxonkit\n
"},{"location":"#method-3-install-via-homebrew-out-of-date","title":"Method 3: Install via homebrew (out of date)","text":"brew install brewsci/bio/taxonkit\n
"},{"location":"#method-4-compile-from-source-latest-stabledev-version","title":"Method 4: Compile from source (latest stable/dev version)","text":"Install go
wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz\n\ntar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/\n\n# or \n# echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc\n# source ~/.bashrc\nexport PATH=$PATH:$HOME/go/bin\n
Compile TaxonKit
# ------------- the latest stable version -------------\n\ngo get -v -u github.com/shenwei356/taxonkit/taxonkit\n\n# The executable binary file is located in:\n# ~/go/bin/taxonkit\n# You can also move it to anywhere in the $PATH\nmkdir -p $HOME/bin\ncp ~/go/bin/taxonkit $HOME/bin/\n\n# --------------- the development version --------------\n\ngit clone https://github.com/shenwei356/taxonkit\ncd taxonkit/taxonkit/\ngo build\n\n# The executable binary file is located in:\n# ./taxonkit\n# You can also move it to anywhere in the $PATH\nmkdir -p $HOME/bin\ncp ./taxonkit $HOME/bin/\n
Supported shell: bash|zsh|fish|powershell
Bash:
# generate completion shell\ntaxonkit genautocomplete --shell bash\n\n# configure if never did.\n# install bash-completion if the \"complete\" command is not found.\necho \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion\necho \"source ~/.bash_completion\" >> ~/.bashrc\n
Zsh:
# generate completion shell\ntaxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit\n\n# configure if never did\necho 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc\necho \"autoload -U compinit; compinit\" >> ~/.zshrc\n
fish:
taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish\n
"},{"location":"#citation","title":"Citation","text":"If you use TaxonKit in your work, please cite:
Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006
"},{"location":"#contact","title":"Contact","text":"Create an issue to report bugs, propose new functions or ask for help.
"},{"location":"#license","title":"License","text":"MIT License
"},{"location":"#starchart","title":"Starchart","text":""},{"location":"bioinf/","title":"Bioinf","text":""},{"location":"chinese-dev/","title":"\u5f00\u53d1\u7b14\u8bb0","text":""},{"location":"chinese-dev/#_1","title":"\u73b0\u6709\u5de5\u5177\u6bd4\u8f83","text":"\u60f3\u8981\u4eceNCBI\u83b7\u53d6\u751f\u7269\u7684\u8c31\u7cfb\u4fe1\u606f\uff0c\u53ef\u4ee5\u5728 NCBI Taxonomy\u7f51\u7ad9\u4e0a\u7528TaxID\u6216\u8005\u540d\u79f0\u67e5\u8be2\u3002 \u6bd4\u5982\u53ef\u4ee5\u7528Homo sapiens
\u62169606
\u641c\u7d22\u201c\u4eba\u201d\u7684\u5206\u7c7b\u5b66\u4fe1\u606f\uff0c\u4ee5\u53ca\u5bc6\u7801\u5b50\u8868\uff0cEntrez\u8bb0\u5f55\u7edf\u8ba1\u7b49\u3002
\u540c\u65f6\u4e5f\u53ef\u4ee5\u901a\u8fc7NCBI\u7684\u5b98\u65b9\u5de5\u5177\u5305 E-utilities (ftp)\u3002
$ esearch -db taxonomy -query \"txid9606 [Organism]\" \\\n | efetch -format xml \\\n | xtract -pattern Lineage -element Lineage\n
\u6b64\u5916\u4e5f\u6709\u4e00\u4e9b\u5de5\u5177\u63d0\u4f9b\u7c7b\u4f3c\u7684\u529f\u80fd\uff0c\u90e8\u5206\u8f6f\u4ef6\uff1a
\u5de5\u5177 \u7f16\u7a0b\u8bed\u8a00 \u6570\u636e\u83b7\u53d6\u65b9\u5f0f \u4f7f\u7528\u65b9\u5f0f \u5907\u6ce8 E-utilities shell/Perl/C++ \u8fdc\u7a0bWeb\u8c03\u7528 \u547d\u4ee4\u884c \u5b98\u65b9\u7a0b\u5e8f\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd BioPython Python \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c \u5305\u88c5entrez\u63a5\u53e3\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd ETE Toolkit Python \u672c\u5730\u6570\u636e\u5e93 \u811a\u672c/\u547d\u4ee4\u884c Taxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd Taxize R \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c ropensci\uff1b\u652f\u6301\u591a\u79cd\u6570\u636e\u5e93\uff1b\u529f\u80fd\u8f83\u4e30\u5bcc Taxopy Python \u672c\u5730\u6570\u636e\u6587\u4ef6 \u811a\u672c/\u547d\u4ee4\u884c \u4ec5\u57fa\u672c\u529f\u80fd\u9009\u62e9\u5de5\u5177\u4e00\u822c\u8003\u8651\u51e0\u4e2a\u65b9\u9762\uff1a
\u6700\u521d\u6211\u60f3\u8981\u7684\u529f\u80fd\u53ea\u662f\u6839\u636e\u83b7\u53d6\"\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\"\u683c\u5f0f\u7684\u8c31\u7cfb\uff0c\u53d1\u73b0\u6ca1\u6709\u73b0\u6210\u5de5\u5177\uff0c\u800c\u540e\u53c8\u6709\u65b0\u7684\u9700\u6c42\u65e0\u6cd5\u6ee1\u8db3\uff0c\u5373\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\u6240\u6709\u7684TaxID\u3002 \u6545\u5f00\u59cb\u7f16\u5199\u5de5\u5177\u6765\u5b9e\u73b0\uff0c\u5e76\u9010\u6b65\u6269\u5c55\u5176\u529f\u80fd\u3002
\u5176\u5b9e\u6700\u7b80\u5355\u7684\u65b9\u6cd5\u5c31\u662f\u81ea\u5df1\u4e0b\u8f7d\u6570\u636e\u6587\u4ef6\u8fdb\u884c\u89e3\u6790\u3002
"},{"location":"chinese-dev/#ncbi-taxonomy","title":"NCBI Taxonomy \u6570\u636e\u6587\u4ef6","text":"NCBI Taxonomy\u6570\u636e\u5e93\u5c06\u6240\u6709\u751f\u7269\u7684\u5206\u7c7b\u5b66\u5173\u7cfb\u7ec4\u7ec7\u4e3a\u4e00\u68f5\u201c\u6709\u6839\u6811\u201d\uff08rooted tree\uff09, \u4e0e\u8fdb\u5316\u6811\uff08Phylogenetic tree\uff09\u4e0d\u540c: \u8fdb\u5316\u6811\u662f\u6309\u8fdb\u5316\u5173\u7cfb\u201d\u7ec4\u7ec7\uff0c\u4e14\u53ef\u4ee5\u4e3a\u201c\u65e0\u6839\u6811\u201d(unrooted tree)\u3002
NCBI Taxonomy\u516c\u5f00\u6570\u636e\u683c\u5f0f\u6709\u4e24\u79cd\uff0c\u65e7\u7684\u540d\u79f0\u4e3a taxdump.tar.gz
\uff0c\u6587\u4ef6\u5927\u5c0f\u7ea650Mb\uff0c\u5185\u542b\u4ee5\u4e0b\u6587\u4ef6\u3002
nodes.dmp # [\u5f53\u524d\u7248\u672c] \u8282\u70b9\u4fe1\u606f\n # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, parent tax_id, rank\nnames.dmp # [\u5f53\u524d\u7248\u672c] \u540d\u79f0\u4fe1\u606f\n # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, name_txt\nmerged.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5408\u5e76\u7684\u8282\u70b9\u4fe1\u606f\n # \u91cd\u8981\u5185\u5bb9\uff1a old_tax_id, new_tax_id\ndelnodes.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5220\u9664\u7684nodes\u4fe1\u606f\n # \u91cd\u8981\u5185\u5bb9\uff1a tax_id\n\ncitations.dmp # \u5f15\u7528\u4fe1\u606f\ndivision.dmp # division\u4fe1\u606f\ngencode.dmp # \u9057\u4f20\u7f16\u7801\u4fe1\u606f\ngc.prt # \u9057\u4f20\u7f16\u7801\u8868\nreadme.txt # \u8bf4\u660e\u6587\u6863\n
\u5176\u4e2d\u6700\u4e3b\u8981\u7684\u662f\u524d4\u4e2a\u6587\u4ef6\uff1a
nodes.dmp
\u4e3b\u8981\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709\u5206\u7c7b\u5b66\u5355\u5143\u8282\u70b9\uff08taxon\uff09 \u7684\u552f\u4e00\u6807\u8bc6\u7b26\uff08taxonomic identifier, \u7b80\u79f0TaxId, taxid, tax_id)\uff0c \u5206\u7c7b\u5b66\u6c34\u5e73(rank\uff09\uff0c\u53ca\u5176\u7236\u8282\u70b9\u7684TaxID\u3002names.dmp
\u4e3b\u8981\u5305\u542b\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709TaxID\u53ca\u5176\u7edf\u4e00\u79d1\u5b66\u540d\u79f0\uff08scientific name\uff09\u548c\u522b\u540d\u3002merged.dmp
\u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5408\u5e76\u7684TaxID\u4e0e\u5408\u5e76\u5230\u7684\u65b0TaxID\u3002delnodes.dmp
\u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5220\u9664\u7684TaxID\u3002\u57282018\u5e742\u6708\u7684\u65f6\u5019\uff0c\u63a8\u51fa\u4e86\u65b0\u7684\u683c\u5f0f\uff0c \u989d\u5916\u5305\u542b\u4e86\u8c31\u7cfb\uff08lineage\uff09\uff0c\u7c7b\u578b\uff08type\uff09\u548c\u5bbf\u4e3b\uff08host\uff09\u4fe1\u606f\u3002 \u6587\u4ef6\u540d\u79f0\u4e3anew_taxdump.tar.gz
\uff0c\u6587\u4ef6\u5927\u5c0f\u7ea6110Mb\u3002 \u76f8\u5bf9\u65e7\u7248\uff0c\u65b0\u7248\u672c\u6587\u4ef6\u6570\u91cf\u548c\u5185\u5bb9\u66f4\u591a\uff0c\u4e3b\u8981\u662f\u56e0\u4e3a\u589e\u52a0\u4e86lineage\u548c\u7c7b\u578b\u4fe1\u606f\u3002 \u4e8b\u5b9e\u4e0alineage\u662f\u53ef\u4ee5\u4ecenodes.dmp
\u548cnames.dmp
\u8ba1\u7b97\u800c\u6765\u3002 \u65b0\u7248\u683c\u5f0f\u6240\u542b\u6587\u4ef6\u5982\u4e0b\uff1a
nodes.dmp\nnames.dmp\nmerged.dmp\ndelnodes.dmp\n\nfullnamelineage.dmp\nTaxIDlineage.dmp\nrankedlineage.dmp\n\nhost.dmp\ntypeoftype.dmp\ntypematerial.dmp\n\ncitations.dmp\ndivision.dmp\ngencode.dmp\nreadme.txt\n
NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/
\u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002
\u5927\u5bb6\u5e94\u8be5\u90fd\u6709\u5b89\u88c5\u751f\u7269\u4fe1\u606f\u8f6f\u4ef6\u7684\u75db\u82e6\u56de\u5fc6\uff0c\u5728conda\u51fa\u73b0\u4e4b\u524d\uff0c\u5f88\u591a\u8f6f\u4ef6\u90fd\u9700\u8981\u624b\u52a8\u5b89\u88c5\u4f9d\u8d56\u3001\u518d\u7f16\u8bd1\u5b89\u88c5\u3002 \u4e0d\u540c\u64cd\u4f5c\u7cfb\u7edf\uff0c\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\uff0c\u7f16\u8bd1\u5668\u7248\u672c\u7ed9\u8f6f\u4ef6\u5b89\u88c5\u5e26\u6765\u4e86\u5de8\u5927\u7684\u56f0\u96be\u3002 \u5982\u679c\u5f00\u53d1\u8005\u6ca1\u6ce8\u610f\u8f6f\u4ef6\u7684\u8de8\u5e73\u53f0\u3001\u53ef\u79fb\u690d\u6027\u66f4\u662f\u5982\u6b64\u3002
\u597d\u7684\u8f6f\u4ef6\u4e00\u5b9a\u8981\u8003\u8651\u4ee5\u4e0b\u51e0\u4e2a\u65b9\u9762\uff1a
\u5728\u5b9e\u73b0TaxonKit\u7684\u65f6\u5019\uff0c\u6211\u5df2\u7ecf\u5f00\u59cb\u7f16\u5199seqkit\u548ccsvtk\u8f6f\u4ef6\uff0c\u6709\u4e86\u4e00\u5b9a\u7684\u7ecf\u9a8c\uff0c\u4e5f\u57fa\u672c\u80fd\u8fbe\u5230\u4e0a\u8ff0\u6240\u6709\u8981\u6c42\u3002
TaxonKit\u4f7f\u7528Go\u8bed\u8a00\u7f16\u5199\uff0c\u8fd9\u6837\u53ef\u4ee5\u8f7b\u677e\u7f16\u8bd1\u51fa\u652f\u6301Linux, Windows, macOS\u7b49\u64cd\u4f5c\u7cfb\u7edf\u7684\u4e0d\u540c\u67b6\u6784\uff08x86/arm\uff09\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u7531\u4e8eGo\u662f\u7f16\u8bd1\u578b\u8bed\u8a00\uff0c\u5728\u8fd0\u884c\u6548\u7387\u4e0a\u4e5f\u6709\u4fdd\u8bc1\u3002 \u81f3\u4e8e\u914d\u7f6e\u3001\u4f7f\u7528\u7b49\u4fbf\u5229\u6027\u5219\u4f9d\u8d56\u4e8e\u5f00\u53d1\u8005\u3002
\u5206\u7c7b\u5b66\u6570\u636e\u4f7f\u7528NCBI taxonomy\u7684\u516c\u5f00\u6570\u636e\u3002 \u6570\u636e\u8bbf\u95ee\u65b9\u5f0f\u7684\u9009\u62e9\uff1a\u901a\u8fc7\u7f51\u7edc\u8bbf\u95ee\u5b98\u65b9Web\u63a5\u53e3\u7684\u65b9\u5f0f\u592a\u6162\uff0c\u53ea\u8003\u8651\u672c\u5730\u8bbf\u95ee\u3002 \u672c\u5730\u8bbf\u95ee\u6709\u51e0\u79cd\u65b9\u5f0f\uff1a
\u6700\u540e\u6d4b\u8bd5\u53d1\u73b0\uff0c\u76f4\u63a5\u89e3\u6790\u6570\u636e\u6587\u4ef6\u7684\u901f\u5ea6\u4e5f\u5f88\u5feb\uff0c5\u79d2\u5de6\u53f3\uff08\u5b58\u50a8\u4e3aNVMe SSD\uff09\uff0c\u5b8c\u5168\u6ee1\u8db3\u8981\u6c42\u3002 \u5b8c\u5168\u4e0d\u7528\u642d\u5efa\u6570\u636e\u5e93\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\u3002 \u8fd1\u65e5\u53c8\u8fdb\u4e00\u6b65\u4f18\u5316\u52302\u79d2\u5de6\u53f3\uff0c\u975e\u5e38\u5feb\u901f\u3002\u5185\u5b58\u4e5f\u5728500Mb-1.5G\u5de6\u53f3\uff0c\u5b8c\u5168\u53ef\u4ee5\u63a5\u53d7\u3002
TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c\u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\u3002
"},{"location":"chinese-dev/#_2","title":"\u5c40\u9650\u6027","text":"\u4ece\u4e8b\u751f\u7269\u591a\u6837\u6027\u7684\u7814\u7a76\u8005\u5bf9NCBI Taxonomy\u6570\u636e\u5e93\u4e00\u5b9a\u4e0d\u4f1a\u964c\u751f\uff0c \u5b83\u5305\u542b\u4e86NCBI\u6240\u6709\u6838\u9178\u548c\u86cb\u767d\u5e8f\u5217\u6570\u636e\u5e93\u4e2d\u6bcf\u6761\u5e8f\u5217\u5bf9\u5e94\u7684\u7269\u79cd\u540d\u79f0\u4e0e\u5206\u7c7b\u5b66\u4fe1\u606f\u3002 \u5927\u591a\u6570\u751f\u6001\u5b66\u7814\u7a76\u5bf9\u7269\u79cd\u7ec4\u6210\u7684\u63cf\u8ff0\u90fd\u662f\u57fa\u4e8eNCBI Taxonomy\u6570\u636e\u5e93\uff0c \u5f53\u7136\u76ee\u524d\u4e5f\u5f00\u59cb\u4f7f\u7528\u5176\u4ed6\u6570\u636e\u5e93\uff0c\u5982GTDB\u7b49\u3002
NCBI Taxonomy\u6570\u636e\u5e93\u59cb\u4e8e1991\u5e74\uff0c\u4e00\u76f4\u968f\u7740Entrez\u6570\u636e\u5e93\u548c\u5176\u4ed6\u6570\u636e\u5e93\u66f4\u65b0\uff0c 1996\u5e74\u63a8\u51fa\u7f51\u9875\u7248\u3002NCBI Taxonomy\u6570\u636e\u5e93\u5b98\u65b9\u5730\u5740\u4e3a https://www.ncbi.nlm.nih.gov/taxonomy \uff0c \u516c\u5f00\u6570\u636e\u4e0b\u8f7d\u5730\u5740\u4e3a https://ftp.ncbi.nih.gov/pub/taxonomy/ \uff0c \u6570\u636e\u6bcf\u5c0f\u65f6\u66f4\u65b0\uff0c\u6bcf\u4e2a\u6708\u521d\u751f\u6210\u4e00\u4efd\u6570\u636e\u5f52\u6863\u5b58\u4e8e taxdump_archive \u76ee\u5f55\uff0c\u6700\u65e9\u53ef\u8ffd\u6eaf\u52302014\u5e748\u6708\u3002
"},{"location":"chinese/#taxonkit","title":"TaxonKit \u4f7f\u7528","text":"TaxonKit\u662f\u91c7\u7528Go\u8bed\u8a00\u7f16\u5199\u7684\u547d\u4ee4\u884c\u5de5\u5177\uff0c \u63d0\u4f9bLinux, Windows, macOS\u64cd\u4f5c\u7cfb\u7edf\u4e0d\u540c\u67b6\u6784\uff08x86-64/arm64\uff09\u7684\u9759\u6001\u7f16\u8bd1\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u53d1\u5e03\u7684\u538b\u7f29\u5305\u4e0d\u8db33Mb\uff0c\u9664\u4e86Github\u6258\u7ba1\u5916\uff0c\u8fd8\u63d0\u4f9b\u56fd\u5185\u955c\u50cf\u4f9b\u4e0b\u8f7d\uff0c\u540c\u65f6\u8fd8\u652f\u6301conda\u548chomebrew\u5b89\u88c5\u3002 \u7528\u6237\u53ea\u9700\u8981\u4e0b\u8f7d\u3001\u89e3\u538b\uff0c\u5f00\u7bb1\u5373\u7528\uff0c\u65e0\u9700\u914d\u7f6e\uff0c\u4ec5\u9700\u4e0b\u8f7d\u89e3\u538bNCBI Taxonomy\u6570\u636e\u6587\u4ef6\u89e3\u538b\u5230\u6307\u5b9a\u76ee\u5f55\u5373\u53ef\u3002
\u9009\u62e9\u7cfb\u7edf\u5bf9\u5e94\u7684\u7248\u672c\u4e0b\u8f7d\u6700\u65b0\u7248 https://github.com/shenwei356/taxonkit/releases \uff0c\u89e3\u538b\u540e\u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u5373\u53ef\u4f7f\u7528\u3002\u6216\u53ef\u9009conda\u5b89\u88c5
conda install taxonkit -c bioconda -y\n# \u8868\u683c\u6570\u636e\u5904\u7406\uff0c\u63a8\u8350\u4f7f\u7528 csvtk \u66f4\u9ad8\u6548\nconda install csvtk -c bioconda -y\n
\u6d4b\u8bd5\u6570\u636e\u4e0b\u8f7d\u53ef\u76f4\u63a5 https://github.com/shenwei356/taxonkit \u4e0b\u8f7d\u9879\u76ee\u538b\u7f29\u5305\uff0c\u6216\u4f7f\u7528git clone\u4e0b\u8f7d\u9879\u76ee\u6587\u4ef6\u5939\uff0c\u5176\u4e2d\u7684example\u4e3a\u6d4b\u8bd5\u6570\u636e
git clone https://github.com/shenwei356/taxonkit\n
TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c \u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\uff0c \u8f7b\u677e\u6574\u5408\u8fdb\u5206\u6790\u6d41\u7a0b\u4e2d\u3002
\u5b50\u547d\u4ee4 \u529f\u80fdlist
\u5217\u51fa\u6307\u5b9aTaxId\u4e0b\u6240\u6709\u5b50\u5355\u5143\u7684\u7684TaxID lineage
\u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\uff08lineage\uff09 reformat
\u5c06\u5b8c\u6574\u8c31\u7cfb\u8f6c\u5316\u4e3a\u201c\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u682a\"\u7684\u81ea\u5b9a\u4e49\u683c\u5f0f name2taxid
\u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID filter
\u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs lca
\u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA) taxid-changelog
\u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55 version
\u663e\u793a\u7248\u672c\u4fe1\u606f\u3001\u68c0\u6d4b\u65b0\u7248\u672c genautocomplete
\u751f\u6210shell\u81ea\u52a8\u8865\u5168\u914d\u7f6e\u811a\u672c \u5907\u6ce8\uff1a
>
\uff09\u5199\u5165\u6587\u4ef6\u3002-o
\u6216--out-file
\u6307\u5b9a\u8f93\u51fa\u6587\u4ef6\uff0c\u4e14\u53ef\u81ea\u52a8\u8bc6\u522b\u8f93\u51fa\u6587\u4ef6\u540e\u7f00\uff08.gz
\uff09\u8f93\u51fagzip\u683c\u5f0f\u3002list
\u4e0etaxid-changelog
\u4e4b\u5916\uff0clineage
, reformat
, name2taxid
, filter
\u4e0e lca
\u5747\u53ef\u4ece\u6807\u51c6\u8f93\u5165\uff08stdin\uff09\u8bfb\u53d6\u8f93\u5165\u6570\u636e\uff0c\u4e5f\u53ef\u901a\u8fc7\u4f4d\u7f6e\u53c2\u6570\uff08positional arguments\uff09\u8f93\u5165\uff0c\u5373\u547d\u4ee4\u540e\u9762\u4e0d\u5e26 \u4efb\u4f55flag\u7684\u53c2\u6570\uff0c\u5982 taxonkit lineage taxids.txt
-i
\u6216--taxid-field
\u6307\u5b9a\u3002TaxonKit\u76f4\u63a5\u89e3\u6790NCBI Taxonomy\u6570\u636e\u6587\u4ef6\uff082\u79d2\u5de6\u53f3\uff09\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\uff0c\u5360\u7528\u5185\u5b58\u5728500Mb-1.5G\u5de6\u53f3\u3002 \u6570\u636e\u4e0b\u8f7d\uff1a
# \u6709\u65f6\u4e0b\u8f7d\u5931\u8d25\uff0c\u53ef\u591a\u8bd5\u51e0\u6b21\uff1b\u6216\u5c1d\u8bd5\u6d4f\u89c8\u5668\u4e0b\u8f7d\u6b64\u94fe\u63a5\nwget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \ntar -zxvf taxdump.tar.gz\n\n# \u89e3\u538b\u6587\u4ef6\u5b58\u4e8e\u5bb6\u76ee\u5f55\u4e2d.taxonkit/\uff0c\u7a0b\u5e8f\u9ed8\u8ba4\u6570\u636e\u5e93\u9ed8\u8ba4\u76ee\u5f55\nmkdir -p $HOME/.taxonkit\ncp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit\n
"},{"location":"chinese/#list-taxidtaxid","title":"list \u5217\u51fa\u6307\u5b9aTaxId\u6240\u5728\u5b50\u6811\u7684\u6240\u6709TaxID","text":"taxonkit list
\u7528\u4e8e\u5217\u51fa\u6307\u5b9aTaxID\u6240\u5728\u5206\u7c7b\u5b66\u5355\u5143\uff08taxon\uff09\u7684\u5b50\u6811\uff08subtree\uff09\u7684\u6240\u6709taxon\u7684TaxID\uff0c\u53ef\u9009\u663e\u793a\u540d\u79f0\u548c\u5206\u7c7b\u5b66\u6c34\u5e73\u3002 \u6b64\u529f\u80fd\u4e0eNCBI Taxonomy\u7f51\u9875\u7248\u7c7b\u4f3c\u3002
\u5982\uff0c
# \u4ee5\u4eba\u5c5e(9605)\u548c\u80a0\u9053\u4e2d\u8457\u540d\u7684Akk\u83cc\u5c5e(239934)\u4e3a\u4f8b\n$ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934\n9605 [genus] Homo\n 9606 [species] Homo sapiens\n 63221 [subspecies] Homo sapiens neanderthalensis\n 741158 [subspecies] Homo sapiens subsp. 'Denisova'\n 1425170 [species] Homo heidelbergensis\n 2665952 [no rank] environmental samples\n 2665953 [species] Homo sapiens environmental sample\n\n239934 [genus] Akkermansia\n 239935 [species] Akkermansia muciniphila\n 349741 [strain] Akkermansia muciniphila ATCC BAA-835\n 512293 [no rank] environmental samples\n 512294 [species] uncultured Akkermansia sp.\n 1131822 [species] uncultured Akkermansia sp. SMG25\n 1262691 [species] Akkermansia sp. CAG:344\n 1263034 [species] Akkermansia muciniphila CAG:154\n 1679444 [species] Akkermansia glycaniphila\n 2608915 [no rank] unclassified Akkermansia\n 1131336 [species] Akkermansia sp. KLE1605\n ...\n
list\u4f7f\u7528\u6700\u5e7f\u6cdb\u7684\u7684\u529f\u80fd\u662f\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\uff08\u6bd4\u5982\u7ec6\u83cc\u3001\u75c5\u6bd2\u3001\u67d0\u4e2a\u5c5e\u7b49\uff09\u4e0b\u6240\u6709\u7684TaxID\uff0c \u7528\u6765\u4eceNCBI nt/nr\u4e2d\u83b7\u53d6\u5bf9\u5e94\u7684\u6838\u9178/\u86cb\u767d\u5e8f\u5217\uff0c\u4ece\u800c\u642d\u5efa\u7279\u5f02\u6027\u7684BLAST\u6570\u636e\u5e93\u3002 \u5b98\u7f51\u63d0\u4f9b\u4e86\u76f8\u5e94\u7684\u8be6\u7ec6\u6b65\u9aa4\uff1a http://bioinf.shenwei.me/taxonkit/tutorial \u3002
# \u6240\u6709\u7ec6\u83cc\u7684TaxID\n$ taxonkit list --show-rank --show-name --ids 2 > /dev/null\n
"},{"location":"chinese/#lineage-taxid","title":"lineage \u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb","text":"\u5206\u7c7b\u5b66\u6570\u636e\u76f8\u5173\u6700\u5e38\u89c1\u7684\u529f\u80fd\u5c31\u662f\u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\u3002 TaxonKit\u53ef\u6839\u636e\u8f93\u5165\u6587\u4ef6\u63d0\u4f9b\u7684TaxID\u5217\u8868\u5feb\u901f\u8ba1\u7b97lineage\uff0c\u5e76\u53ef\u9009\u63d0\u4f9b\u540d\u79f0\uff0c\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u4ee5\u53ca\u8c31\u7cfb\u5bf9\u5e94\u7684TaxID\u3002
\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u968f\u7740Taxonomy\u6570\u636e\u7684\u9891\u7e41\u66f4\u65b0\uff0c\u6709\u7684TaxID\u53ef\u80fd\u88ab\u5220\u9664\u3001\u6216\u5408\u5e76\uff08merge\uff09\u5230\u5176\u5b83TaxID\u4e2d\uff0c TaxonKit\u4f1a\u81ea\u52a8\u8bc6\u522b\uff0c\u5e76\u8fdb\u884c\u63d0\u793a\uff0c\u5bf9\u4e8e\u88ab\u5408\u5e76\u7684TaxID\uff0cTaxonKit\u4f1a\u6309\u65b0TaxID\u8fdb\u884c\u8ba1\u7b97\u3002
# \u4f7f\u7528example\u4e2d\u7684\u6d4b\u8bd5\u6570\u636e\n$ head taxids.txt\n9606\n9913\n376619\n# \u67e5\u627e\u6307\u5b9ataxids\u5217\u8868\u7684\u7269\u79cd\u4fe1\u606f\uff0ctee\u53ef\u8f93\u51fa\u5c4f\u5e55\u5e76\u5199\u5165\u6587\u4ef6\n$ taxonkit lineage taxids.txt | tee lineage.txt \n19:22:13.077 [WARN] taxid 92489 was merged into 796334\n19:22:13.077 [WARN] taxid 1458427 was merged into 1458425\n19:22:13.077 [WARN] taxid 123124124 not found\n19:22:13.077 [WARN] taxid 3 was deleted\n9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\n9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus\n376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS\n349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B\n11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\n1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y\n123124124\n3\n92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\n1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n
\u4e0e\u5176\u5b83\u8f6f\u4ef6\u7684\u6027\u80fd\u76f8\u6bd4\uff0c\u5f53\u67e5\u8be2\u6570\u91cf\u8f83\u5c11\u65f6ETE\u8f83\u5feb\uff0c\u6570\u91cf\u8f83\u591a\u65f6\u5219TaxonKit\u66f4\u5feb\u3002 \u5728\u4e0d\u540c\u6570\u636e\u91cf\u89c4\u6a21\u4e0a TaxonKit\u901f\u5ea6\u4e00\u76f4\u5f88\u7a33\u5b9a\uff0c\u5747\u4e3a2-3\u79d2\uff0c\u65f6\u95f4\u4e3b\u8981\u82b1\u5728\u89e3\u6790Taxonomy\u6570\u636e\u6587\u4ef6\u4e0a\u3002
\u5217\u51falineage\u6bcf\u4e2a\u5206\u7c7b\u5b66\u5355\u5143\u7684\u7684TaxId\u548crank\u548c\u540d\u79f0\uff0c\u6bd4\u5982SARS-COV-2\u3002
# lineage\u63d0\u53d6SARS-COV-2\u7684\u4e16\u7cfb\n$ echo \"2697049\" \\\n | taxonkit lineage -t -R \\\n | sed \"s/\\t/\\n/g\"\n2697049\nViruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2\n10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049\nsuperkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank\n
"},{"location":"chinese/#reformat","title":"reformat \u751f\u6210\u6807\u51c6\u5c42\u7ea7\u7269\u79cd\u6ce8\u91ca","text":"\u6709\u65f6\u5019\uff0c\u6211\u4eec\u5e76\u4e0d\u9700\u8981\u5b8c\u6574\u7684\u5206\u7c7b\u5b66\u8c31\u7cfb\uff08complete lineage\uff09\uff0c\u56e0\u4e3a\u5f88\u591a\u7ea7\u522b\u5373\u4e0d\u5e38\u7528\uff0c\u800c\u4e14\u4e0d\u5b8c\u6574\u3002\u901a\u5e38\u53ea\u60f3\u4fdd\u7559\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u3002
\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u4e0d\u662f\u6240\u6709\u7269\u79cd\u90fd\u6709\u5b8c\u6574\u7684\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u6c34\u5e73\uff0c\u7279\u522b\u662f\u75c5\u6bd2\u4ee5\u53ca\u4e00\u4e9b\u73af\u5883\u6837\u54c1\u3002 TaxonKit\u53ef\u4ee5\u7528\u81ea\u5b9a\u4e49\u5185\u5bb9\u66ff\u4ee3\u7f3a\u5931\u7684\u5206\u7c7b\u5355\u5143\uff0c\u5982\u7528\u201c__\u201d\u66ff\u4ee3\u3002 \u66f4\u5389\u5bb3\u6709\u7528\u7684\u662f\uff0cTaxonKit\u8fd8\u53ef\u4ee5\u7528\u66f4\u9ad8\u5c42\u7ea7\u7684\u5206\u7c7b\u5355\u5143\u4fe1\u606f\u6765\u8865\u9f50\u7f3a\u5931\u7684\u5c42\u7ea7 (-F/--fill-miss-rank
)\uff0c\u6bd4\u5982
# \u6ca1\u6709genus\u7684\u75c5\u6bd2\n$ echo 1327037 | taxonkit lineage | taxonkit reformat | cut -f 1,3\n1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y\n\n# -F\u53c2\u6570\u4f1a\u7528family\u4fe1\u606f\u6765\u8865\u9f50genus\u4fe1\u606f\n$ echo 1327037 | taxonkit lineage | taxonkit reformat -F | cut -f 1,3\n1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y\n
\u8f93\u51fa\u683c\u5f0f\u53ef\u9009\u53ea\u8f93\u51fa\u90e8\u5206\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u8fd8\u652f\u6301\u5236\u8868\u7b26\uff08\"\\t\"
\uff09\uff0c\u518d\u914d\u5408\u4f5c\u8005\u7684\u53e6\u4e00\u4e2a\u5de5\u5177csvtk\uff0c\u53ef\u4ee5\u8f93\u51fa\u6f02\u4eae\u7684\u7ed3\u679c\u3002
\u5176\u5b83\u6709\u7528\u7684\u9009\u9879\uff1a
-P/--add-prefix
\uff1a\u7ed9\u6bcf\u4e2a\u5206\u7c7b\u5b66\u6c34\u5e73\u6dfb\u52a0\u524d\u7f00\uff0c\u6bd4\u5982s__species
\u3002-t/--show-lineage-taxids
\uff1a\u8f93\u51fa\u5206\u7c7b\u5b66\u5355\u5143\u5bf9\u5e94\u7684TaxID\u3002-r/--miss-rank-repl
: \u66ff\u4ee3\u6ca1\u6709\u5bf9\u5e94rank\u7684taxon\u540d\u79f0-S/--pseudo-strain
: \u5bf9\u4e8e\u4f4e\u4e8especies\u4e14rank\u65e2\u4e0d\u662fsubspecies\u4e5f\u4e0d\u662fstrain\u7684taxid\uff0c\u4f7f\u7528\u6c34\u5e73\u6700\u4f4etaxon\u540d\u79f0\u505a\u4e3a\u83cc\u682a\u540d\u79f0\u3002\u4f8b\uff0c
$ echo -ne \"349741\\n1327037\"\\\n | taxonkit lineage \\\n | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\\n | csvtk cut -t -f -2 \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\\n | csvtk pretty -t\n\ntaxid kindom phylum class order family genus species\n349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila\n1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y\n\n# \u4fbf\u4e8e\u5c0f\u5c4f\u5e55\u67e5\u770b\uff0c\u7528csvtk\u8fdb\u884c\u8f6c\u7f6e\n$ echo -ne \"349741\\n1327037\"\\\n | taxonkit lineage \\\n | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\\n | csvtk cut -t -f -2 \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\\n | csvtk transpose -t \\\n | csvtk pretty -H -t\n\ntaxid 349741 1327037\nkindom k__Bacteria k__Viruses\nphylum p__Verrucomicrobia p__Uroviricota\nclass c__Verrucomicrobiae c__Caudoviricetes\norder o__Verrucomicrobiales o__Caudovirales\nfamily f__Akkermansiaceae f__Siphoviridae\ngenus g__Akkermansia g__unclassified Siphoviridae genus\nspecies s__Akkermansia muciniphila s__Croceibacter phage P2559Y\n\n# \u5230\u682a\u6c34\u5e73\uff0c\u4ee5sars-cov-2\u4e3a\u4f8b\n$ echo -ne \"2697049\"\\\n | taxonkit lineage \\\n | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" -F -P -S \\\n | csvtk cut -t -f -2 \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species,strain \\\n | csvtk transpose -t \\\n | csvtk pretty -H -t\n\ntaxid 2697049\nkindom k__Viruses\nphylum p__Pisuviricota\nclass c__Pisoniviricetes\norder o__Nidovirales\nfamily f__Coronaviridae\ngenus g__Betacoronavirus\nspecies s__Severe acute respiratory syndrome-related coronavirus\nstrain t__Severe acute respiratory syndrome coronavirus 2\n
"},{"location":"chinese/#name2taxid-taxid","title":"name2taxid \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID","text":"\u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID\u975e\u5e38\u5bb9\u6613\u7406\u89e3\uff0c\u552f\u4e00\u8981\u6ce8\u610f\u7684\u662f\u67d0\u4e9bTaxId\u5bf9\u5e94\u76f8\u540c\u7684\u540d\u79f0\uff0c\u6bd4\u5982
# -i\u6307\u5b9a\u5217\uff0c-r\u663e\u793a\u7ea7\u522b\uff0c-L\u4e0d\u663e\u793a\u4e16\u7cfb\n$ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L\nDrosophila 7215 genus\nDrosophila 32281 subgenus\nDrosophila 2081351 genus\n
\u83b7\u53d6TaxID\u4e4b\u540e\uff0c\u53ef\u4ee5\u7acb\u5373\u4f20\u7ed9taxonkit\u8fdb\u884c\u540e\u7eed\u64cd\u4f5c\uff0c\u4f46\u8981\u6ce8\u610f\u7528-i
\u6307\u5b9aTaxId\u6240\u5728\u5217\u3002
filter\u53ef\u4ee5\u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs\uff0c\u6ce8\u610f\uff0c\u4e0d\u4ec5\u4ec5\u662f\u7279\u5b9a\u7684Rank\uff0c\u800c\u662f\u4e00\u4e2a\u8303\u56f4\u3002 \u6bd4\u5982genus\u53ca\u4ee5\u4e0b\u7684\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u7528-L genus -E genus
\uff0c\u7c7b\u4f3c\u4e8e <= genus
\u3002
$ cat taxids2.txt \\\n | taxonkit filter -L genus -E genus \\\n | taxonkit lineage -r -n -L \\\n | csvtk -Ht cut -f 1,3,2 \\\n | csvtk pretty -H -t\n239934 genus Akkermansia\n239935 species Akkermansia muciniphila\n349741 strain Akkermansia muciniphila ATCC BAA-835\n
"},{"location":"chinese/#lca-lca","title":"lca \u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA)","text":"\u6bd4\u5982\u4eba\u5c5e\u7684\u4f8b\u5b50
$ taxonkit list --ids 9605 -nr --indent \" \" \n9605 [genus] Homo\n 9606 [species] Homo sapiens\n 63221 [subspecies] Homo sapiens neanderthalensis\n 741158 [subspecies] Homo sapiens subsp. 'Denisova'\n 1425170 [species] Homo heidelbergensis\n 2665952 [no rank] environmental samples\n 2665953 [species] Homo sapiens environmental sample\n
TaxID\u7684\u5206\u9694\u7b26\u53ef\u7528-s/--separater
\u6307\u5b9a\uff0c\u9ed8\u8ba4\u4e3a\" \"\u3002
# \u8ba1\u7b97\u4e24\u4e2a\u7269\u79cd\u7684\u6700\u8fd1\u5171\u540c\u7956\u5148\uff0c\u4ee5\u4e0a\u9762\u5c3c\u5b89\u5fb7\u7279\u4eba\u4e9a\u79cd\u548c\u6d77\u5fb7\u5821\u4eba\u79cd\n$ echo 63221 2665953 | taxonkit lca\n63221 2665953 9605\n\n# \u5176\u5b83\u5206\u9694\u7b26\uff0c\u4e14\u4e0d\u5c0f\u5fc3\u591a\u4e86\u7a7a\u683c\n$ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\"\na 63221,2665953\nb 63221, 741158\n\n$ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\\n | taxonkit lca -i 2 -s \",\"\na 63221,2665953 9605\nb 63221, 741158 9606\n
"},{"location":"chinese/#taxid-changelog-taxid","title":"TaxID changelog \u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55","text":"NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/
\u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002
TaxonKit\u53ef\u4ee5\u8ffd\u8e2a\u6240\u6709TaxID\u6bcf\u4e2a\u6708\u7684\u53d8\u5316\uff0c\u8f93\u51fa\u5230csv\u6587\u4ef6\u4e2d\uff0c\u53ef\u4ee5\u901a\u8fc7\u547d\u4ee4\u884c\u5de5\u5177\u8fdb\u884c\u67e5\u8be2\u3002 \u6570\u636e\u548c\u6587\u6863\u5355\u72ec\u6258\u7ba1\u5728 https://github.com/shenwei356/taxid-changelog \u3002
\u9664\u4e86\u7b80\u5355\u7684\u589e\u52a0\u3001\u5220\u9664\u3001\u5408\u5e76\u4e4b\u5916\uff0c\u4f5c\u8005\u5c06TaxID\u6539\u53d8\u505a\u4e86\u7ec6\u5206\u3002\u8f93\u51fa\u683c\u5f0f\u5982\u4e0b
# \u5217 \u5907\u6ce8\ntaxid # taxid\nversion # version / time of archive, e.g, 2019-07-01\nchange # change, values:\n # NEW \u65b0\u589e\n # REUSE_DEL \u524d\u671f\u88ab\u5220\u9664\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165\n # REUSE_MER \u524d\u671f\u88ab\u5408\u5e76\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165\n # DELETE \u5220\u9664\n # MERGE \u5408\u5e76\u5230\u53e6\u4e00\u4e2aTaxID\n # ABSORB \u5176\u4ed6TaxID\u5408\u5e76\u5230\u5f53\u524dTaxID\n # CHANGE_NAME \u540d\u79f0\u6539\u53d8\n # CHANGE_RANK \u5206\u7c7b\u5b66\u6c34\u5e73\u6539\u53d8\n # CHANGE_LIN_LIN \u8c31\u7cfb\u7684TaxID\u6ca1\u6709\u53d8\u5316\uff0c\u8c31\u7cfb\u6539\u53d8\uff08\u67d0\u4e9bTaxID\u7684\u540d\u79f0\u53d8\u4e86\uff09\n # CHANGE_LIN_TAX \u8c31\u7cfb\u7684TaxID\u6539\u53d8\n # CHANGE_LIN_LEN \u8c31\u7cfb\u7684\u957f\u5ea6/\u6df1\u5ea6\u53d1\u751f\u53d8\u5316\nchange-value # variable values for changes: \n # 1) new taxid for MERGE\n # 2) merged taxids for ABSORB\n # 3) empty for others\nname # scientific name\nrank # rank\nlineage # complete lineage of the taxid\nlineage-taxids # taxids of the lineage\n
\u6570\u636e\u6587\u4ef6\u53ef\u4ee5\u5728\u524d\u9762\u7f51\u7ad9\u4e0a\u4e0b\u8f7d\uff0ctaxid-changelog.csv.gz
\uff0c130M\u5de6\u53f3\uff0c\u89e3\u538b\u540e2.2G\uff0c\u56e0\u4e3a\u662fgzip\u683c\u5f0f\uff0c\u5b8c\u5168\u4e0d\u9700\u8981\u89e3\u538b\u5373\u53ef\u5206\u6790\u3002 \u4e0b\u6587\u4f7f\u7528\u4e86pigz
\u4ee3\u66ffzcat
\u548cgzip
\u63d0\u9ad8\u89e3\u538b\u901f\u5ea6\u3002
\u4f8b1 superkingdom\u4e5f\u80fd\u6d88\u5931 \uff0c\u6bd4\u5982\u7c7b\u75c5\u6bd2(Viroids)\u57282019\u5e745\u6708\u88ab\u5220\u9664\u4e86\u3002 \u4f5c\u8005\u662f\u5728\u67d0\u4e00\u5929\u65e0\u610f\u4e2d\u53d1\u73b0\u6b64\u4e8b\uff0c\u6240\u4ee5\u51b3\u5b9a\u5228\u6839\u95ee\u5e95\uff0c\u5f00\u53d1\u4e86\u8fd9\u4e2a\u5b50\u547d\u4ee4\u3002
# \u4e0b\u8f7d\nwget -c https://github.com/shenwei356/taxid-changelog/releases/download/v2021.01/taxid-changelog.csv.gz\n# \u5b89\u88c5\u591a\u7ebf\u7a0b\u89e3\u538b\u7d22\u8f6f\u4ef6\u3002\u6216\u8005\u7528gzip\u66ff\u6362\u3002\nconda install pigz\n\n$ pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f rank -p superkingdom \\\n | csvtk pretty \ntaxid version change change-value name rank lineage lineage-taxids\n2 2014-08-01 NEW Bacteria superkingdom cellular organisms;Bacteria 131567;2\n2157 2014-08-01 NEW Archaea superkingdom cellular organisms;Archaea 131567;2157\n2759 2014-08-01 NEW Eukaryota superkingdom cellular organisms;Eukaryota 131567;2759\n10239 2014-08-01 NEW Viruses superkingdom Viruses 10239\n12884 2014-08-01 NEW Viroids superkingdom Viroids 12884\n12884 2019-05-01 DELETE Viroids superkingdom Viroids 12884\n
\u4f8b2 SARS-CoV-2 \u3002\u53ef\u89c1\u65b0\u51a0\u75c5\u6bd2\u57282020\u5e742\u6708\u52a0\u5165\uff0c\u968f\u540e3\u6708\u548c6\u6708\u4efd\u6539\u4e86\u540d\u79f0\uff0c\u8c31\u7cfb\u7b49\u4fe1\u606f\u3002\u67e5\u8be2\u901f\u5ea6\u4e5f\u5f88\u5feb\u3002
# \u672c\u4f8b\u5b50\u53ea\u663e\u793a\u4e86\u90e8\u5206\u5217\u3002\n$ time pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f taxid -p 2697049 \\\n | csvtk cut -f version,change,name,rank \\\n | csvtk pretty\n\nversion change name rank\n2020-02-01 NEW Wuhan seafood market pneumonia virus species\n2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank\n2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank\n2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank\n2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank\n2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate\n2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank\n\nreal 0m7.644s\nuser 0m16.749s\nsys 0m3.985s\n
\u66f4\u591a\u6709\u610f\u601d\u7684\u53d1\u73b0\u8be6\u89c1taxid-changelog
"},{"location":"download/","title":"Download","text":"TaxonKit
is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.
taxonkit name2taxid
:Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006
"},{"location":"download/#links","title":"Links","text":"Tips
taxonkit version
to check update !!!taxonkit genautocomplete
to update Bash completion !!!Download Page
TaxonKit
is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.
Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz
command or other tools. And then:
For Linux-like systems
If you have root privilege simply copy it to /usr/local/bin
:
sudo cp taxonkit /usr/local/bin/\n
Or copy to anywhere in the environment variable PATH
:
mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/\n
For windows, just copy taxonkit.exe
to C:\\WINDOWS\\system32
.
conda install -c bioconda taxonkit\n
"},{"location":"download/#method-3-install-via-homebrew-may-not-the-lastest-version","title":"Method 3: Install via homebrew (may not the lastest version)","text":"brew install brewsci/bio/taxonkit\n
"},{"location":"download/#method-4-compile-from-source-latest-stabledev-version","title":"Method 4: Compile from source (latest stable/dev version)","text":"Install go
wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz\n\ntar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/\n\n# or \n# echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc\n# source ~/.bashrc\nexport PATH=$PATH:$HOME/go/bin\n
Compile TaxonKit
# ------------- the latest stable version -------------\n\ngo get -v -u github.com/shenwei356/taxonkit/taxonkit\n\n# The executable binary file is located in:\n# ~/go/bin/taxonkit\n# You can also move it to anywhere in the $PATH\nmkdir -p $HOME/bin\ncp ~/go/bin/taxonkit $HOME/bin/\n\n# --------------- the development version --------------\n\ngit clone https://github.com/shenwei356/taxonkit\ncd taxonkit/taxonkit/\ngo build\n\n# The executable binary file is located in:\n# ./taxonkit\n# You can also move it to anywhere in the $PATH\nmkdir -p $HOME/bin\ncp ./taxonkit $HOME/bin/\n
Supported shell: bash|zsh|fish|powershell
Bash:
# generate completion shell\ntaxonkit genautocomplete --shell bash\n\n# configure if never did.\n# install bash-completion if the \"complete\" command is not found.\necho \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion\necho \"source ~/.bash_completion\" >> ~/.bashrc\n
Zsh:
# generate completion shell\ntaxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit\n\n# configure if never did\necho 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc\necho \"autoload -U compinit; compinit\" >> ~/.zshrc\n
fish:
taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish\n
"},{"location":"download/#dataset","title":"Dataset","text":"taxdump.tar.gz
: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz names.dmp
, nodes.dmp
, delnodes.dmp
and merged.dmp
to data directory: $HOME/.taxonkit
, e.g., /home/shenwei/.taxonkit
,--data-dir
, or environment variable TAXONKIT_DB
.All-in-one command:
wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \ntar -zxvf taxdump.tar.gz\n\nmkdir -p $HOME/.taxonkit\ncp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit\n
Update dataset: Simply re-download the taxdump files, uncompress and override old ones.
"},{"location":"download/#release-history","title":"Release history","text":"taxonkit reformat
:-T/--trim
also does not add the prefix for missing ranks lower than the current rank. #82-s/--miss-rank-repl-suffix
to set the suffix for estimated taxon names. #85taxonkit filter
:taxonkit lca
:-b/--buffer-size
to set the size of the line buffer. #75--separater
-> --separater
, the former is still available for backward compatibility.taxonkit reformat
:taxonkit taxid-changelog
:taxonkit reformat
:-S/--pseudo-strain
does not require -F/--fill-miss-rank
now.{t}
, {S}
, and T
outputs nothing when using -S/--pseudo-strain
.taxonkit create-taxdump
:int32
instead of uint32
, as BLAST and DIAMOND do. #70taxonkit list
:taxonkit
:TAXONKIT_DB
is set, explicitly setting --data-dir
will override the value of TAXONKIT_DB
.taxonkit reformat
:{K}
for rank kingdom
. #64-I--taxid-field
.taxonkit create-taxdump
: -A/--field-accession
and no rank names given: the colname of the accession column would be treated as one of the ranks, which messed up all the ranks.--field-accession-re
which wrongly remove prefix like Sp_
. #65taxonkit list
:taxonkit create-taxdump
: taxonkit create-taxdump
: fix bug of missing Class rank, contributed by @apcamargo. The flag --gtdb
was not effected. #57taxonkit create-taxdump
: Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV. #56taxonkit cami2-filter
: fix option --show-rank
which did not work in v0.10.0.taxonkit cami2-filter
: Remove taxa of given TaxIds and their descendants in CAMI metagenomic profiletaxonkit reformat
: fix panic for deleted taxid using -F/--fill-miss-rank
. #55taxonkit profile2cami
: converting metagenomic profile table to CAMI formattaxonkit reformat
:-I/--taxid-field
.taxonkit lca
:taxonkit genautocomplete
:taxonkit lineage
:-R/--show-lineage-ranks
for appending ranks of all levels.taxonkit filter
:-E/--equal-to
supports multiple values.-n/--save-predictable-norank
: do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff.taxonkit reformat
:{t}
for subspecies/strain
, {T}
for strain
. Thanks @wqssf102 for feedback.-S/--pseudo-strain
for using the node with lowest rank as strain name, only if which rank is lower than \"species\" and not \"subpecies\" nor \"strain\". taxonkit filter
: --list-order
or --list-ranks
. #36-N/--discard-noranks
to explicitly filter out \"no rank\", \"clade\". #37taxonkit
: 2-3X faster taxonomy data loading.taxonkit filter
: filtering TaxIds by taxonomic rank range. #32taxonkit lca
: Computing lowest common ancestor (LCA) for TaxIds.taxonkit reformat
:-P/--add-prefix
: add prefixes for all ranks, single prefix for a rank is defined by flag --prefix-X
, where X
may be k
, p
, c
, o
, f
, s
, S
.-T/--trim
: do not fill missing rank lower than current rank.taxonkit list
: do not duplicate root node.taxonkit reformat -F
: fix taxids of abbreviated lineage containing names shared by different taxids. #35taxonkit lineage
: -n/--show-name
for appending scientific name.-L/--no-lineage
for hide lineage, this is for fast retrieving names or/and ranks.taxonkit reformat
:-F/--fill-miss-rank
.taxonkit list
:taxonkit name2taxid
: new flag -s/--sci-name
for limiting to searching scientific names. #29taxonkit version
: make checking update optionaltaxonkit
: requiring delnodes.dmp and merged.dmp.taxonkit lineage
: detect deleted and merged taxids now. #19taxonkit list/name2taxid
: add short flag -r
for --show-rank
, -n
for --show-name
.taxonkit taxid-changelog
: rewrite logic, fix bug and add more change typestaxonkit taxid-changelog
: change output of ABSORB
, do not merged into one record for changes in different versions.taxonkit taxid-changelog
: name
and rank
.taxonkit taxid-changelog
: for creating taxid changelog from dump archive--line-buffered
to disable output buffer. #11--names-file
and --nodes-file
with --data-dir
, also support environment variable TAXONKIT_DB
. #17taxonkit reformat
: detects lineages containing unofficial taxon name and won't show panic message.taxonkit name2taxid
: supports synonyms names. #9taxokit lineage
: add flag -r/--show-rank
to print rank at another new column.taxonkit reformat
:-F/--fill-miss-rank
to estimate and fill missing rank with original lineage information\\t
, \\n
, #5taxonkit lineage
:1
#7-d/--delimiter
.taxonkit list
: fix bug of no output for leaf nodes of the taxonomic tree. #4genautocomplete
to generate shell autocompletion script!name2taxid
to query taxid by taxon scientific name.lineage
, reformat
: changed flags and default operations, check the usage.taxonkit lineage
, add an extra column of lineage in Taxid. #3. e.g.,taxonkit reformat
: supports reading stdin from output of taxonkit lineage
, reformated lineages are appended to input data.-f/--formated-rank
from taxonkit lineage
, using taxonkit reformat
can archieve same result.--fill
for taxonkit reformat
, which estimates and fills missing rank with original lineage informationtaxonkit reformat
which reformats full lineage to custom formattaxonkit lineage
, users can query lineage of given taxon IDs from filetaxonkit list
, users can choose output in readable JSON format by flag --json
so the taxonomy tree could be collapse and uncollapse in modern text editor.Show lineage detail of a TaxId. The command below works on Windows with help of csvtk.
$ echo \"2697049\" \\\n | taxonkit lineage -t \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -Ht\n\n10239 superkingdom Viruses\n2559587 clade Riboviria\n2732396 kingdom Orthornavirae\n2732408 phylum Pisuviricota\n2732506 class Pisoniviricetes\n76804 order Nidovirales\n2499399 suborder Cornidovirineae\n11118 family Coronaviridae\n2501931 subfamily Orthocoronavirinae\n694002 genus Betacoronavirus\n2509511 subgenus Sarbecovirus\n694009 species Severe acute respiratory syndrome-related coronavirus\n2697049 no rank Severe acute respiratory syndrome coronavirus 2\n
Example data.
$ cat taxids3.txt\n376619\n349741\n239935\n314101\n11932\n1327037\n83333\n1408252\n2605619\n2697049\n
Format to 7-level ranks (\"superkingdom phylum class order family genus species\").
$ cat taxids3.txt \\\n | taxonkit reformat -I 1\n\n376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis\n349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B\n11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\n1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y\n83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli\n1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli\n2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli\n2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus\n
Format to 8-level ranks (\"superkingdom phylum class order family genus species subspecies/rank\").
$ cat taxids3.txt \\\n | taxonkit reformat -I 1 -f \"{k};{p};{c};{o};{f};{g};{s};{t}\"\n\n376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica LVS\n349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;\n314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B;\n11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle;\n1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y;\n83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli K-12\n1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli R178\n2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;\n2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus;\n
Replace missing ranks with Unassigned
and output tab-delimited format.
$ cat taxids3.txt \\\n | taxonkit reformat -I 1 -r \"Unassigned\" -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n | csvtk pretty -H -t\n\n376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis Francisella tularensis subsp. holarctica LVS\n349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835\n239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Unassigned\n314101 Bacteria Unassigned Unassigned Unassigned Unassigned Unassigned uncultured murine large bowel bacterium BAC 54B Unassigned\n11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle Unassigned\n1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Unassigned Croceibacter phage P2559Y Unassigned\n83333 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli K-12\n1408252 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli R178\n2605619 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Unassigned\n2697049 Viruses Pisuviricota Pisoniviricetes Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Unassigned\n
Fill missing ranks and add prefixes.
$ cat taxids3.txt \\\n | taxonkit reformat -I 1 -F -P -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n | csvtk pretty -H -t\n\n376619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Thiotrichales f__Francisellaceae g__Francisella s__Francisella tularensis t__Francisella tularensis subsp. holarctica LVS\n349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__Akkermansia muciniphila ATCC BAA-835\n239935 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__unclassified Akkermansia muciniphila subspecies/strain\n314101 k__Bacteria p__unclassified Bacteria phylum c__unclassified Bacteria class o__unclassified Bacteria order f__unclassified Bacteria family g__unclassified Bacteria genus s__uncultured murine large bowel bacterium BAC 54B t__unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain\n11932 k__Viruses p__Artverviricota c__Revtraviricetes o__Ortervirales f__Retroviridae g__Intracisternal A-particles s__Mouse Intracisternal A-particle t__unclassified Mouse Intracisternal A-particle subspecies/strain\n1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y t__unclassified Croceibacter phage P2559Y subspecies/strain\n83333 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli K-12\n1408252 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli R178\n2605619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__unclassified Escherichia coli subspecies/strain\n2697049 k__Viruses p__Pisuviricota c__Pisoniviricetes o__Nidovirales f__Coronaviridae g__Betacoronavirus s__Severe acute respiratory syndrome-related coronavirus t__unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain\n
When these's no nodes of rank \"subspecies\" nor \"strain\", we can switch -S/--pseudo-strain
to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\".
$ cat taxids3.txt \\\n | taxonkit lineage -r -L \\\n | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n | cut -f 1,2,9,10 \\\n | csvtk add-header -t -n \"taxid,rank,species,strain\" \\\n | csvtk pretty -t\n\ntaxid rank species strain\n------- ---------- ----------------------------------------------------- ------------------------------------------------------------------------------\n376619 strain Francisella tularensis Francisella tularensis subsp. holarctica LVS\n349741 strain Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835\n239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain\n314101 species uncultured murine large bowel bacterium BAC 54B unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain\n11932 species Mouse Intracisternal A-particle unclassified Mouse Intracisternal A-particle subspecies/strain\n1327037 species Croceibacter phage P2559Y unclassified Croceibacter phage P2559Y subspecies/strain\n83333 strain Escherichia coli Escherichia coli K-12\n1408252 subspecies Escherichia coli Escherichia coli R178\n2605619 no rank Escherichia coli Escherichia coli O16:H48\n2697049 no rank Severe acute respiratory syndrome-related coronavirus Severe acute respiratory syndrome coronavirus 2\n
List eight-level lineage for all TaxIds of rank lower than or equal to species, including some nodes with \"no rank\". But when filtering with -L/--lower-than
, you can use -n/--save-predictable-norank
to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff.
$ time taxonkit list --ids 1 \\\n | taxonkit filter -L species -E species -R -N -n \\\n | taxonkit lineage -n -r -L \\\n | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n | csvtk cut -Ht -l -f 1,3,2,1,4-11 \\\n | csvtk add-header -t -n \"taxid,rank,name,lineage,kingdom,phylum,class,order,family,genus,species,strain\" \\\n | pigz -c > result.tsv.gz\n\nreal 0m25.167s\nuser 2m14.809s\nsys 0m7.197s\n\n$ pigz -cd result.tsv.gz \\\n | csvtk grep -t -f taxid -p 2697049 \\\n | csvtk transpose -t \\\n | csvtk pretty -H -t\n\ntaxid 2697049\nrank no rank\nname Severe acute respiratory syndrome coronavirus 2\nlineage Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2\nkingdom Viruses\nphylum Pisuviricota\nclass Pisoniviricetes\norder Nidovirales\nfamily Coronaviridae\ngenus Betacoronavirus\nspecies Severe acute respiratory syndrome-related coronavirus\nstrain Severe acute respiratory syndrome coronavirus 2\n
"},{"location":"tutorial/#mapping-old-species-names-to-new-ones","title":"Mapping old species names to new ones","text":"Some species names in papers or websites might changed, we can try querying their TaxIds via their old new names and then retrieve the new ones.
cat example/changed_species_names.txt\nLactobacillus fermentum\nMycoplasma gallinaceum\n\n# TaxonKit >= v0.15.1\ncat example/changed_species_names.txt \\\n | taxonkit name2taxid \\\n | taxonkit lineage -i 2 -n \\\n | cut -f 1,4\n\nLactobacillus fermentum Limosilactobacillus fermentum\nMycoplasma gallinaceum\n
Woops, there's no information of Mycoplasma gallinaceum
. Then we check the taxid-changelog.
zcat taxonkit/taxid-changelog.csv.gz \\\n | csvtk grep -f name -P example/changed_species_names.txt\n | csvtk cut -f taxid,version,change,name,rank \\\n | csvtk pretty\n\ntaxid version change name rank\n----- ---------- -------------- ----------------------- -------\n1613 2013-02-21 NEW Lactobacillus fermentum species\n1613 2016-03-01 ABSORB Lactobacillus fermentum species\n1613 2016-03-01 CHANGE_LIN_LEN Lactobacillus fermentum species\n29556 2013-02-21 NEW Mycoplasma gallinaceum species\n29556 2016-03-01 CHANGE_LIN_LEN Mycoplasma gallinaceum species\n29556 2021-01-01 CHANGE_NAME Mycoplasma gallinaceum species\n29556 2021-01-01 CHANGE_LIN_LIN Mycoplasma gallinaceum species\n
We can see the names are changed. Full changes can be queried with the taxid. e.g.,
taxid version change change-value name rank\n----- ---------- -------------- ------------ ------------------------- -------\n29556 2013-02-21 NEW Mycoplasma gallinaceum species\n29556 2016-03-01 CHANGE_LIN_LEN Mycoplasma gallinaceum species\n29556 2020-09-01 CHANGE_NAME Mycoplasmopsis gallinacea species\n29556 2020-09-01 CHANGE_LIN_TAX Mycoplasmopsis gallinacea species\n29556 2021-01-01 CHANGE_NAME Mycoplasma gallinaceum species\n29556 2021-01-01 CHANGE_LIN_LIN Mycoplasma gallinaceum species\n29556 2021-09-01 CHANGE_NAME Mycoplasmopsis gallinacea species\n29556 2021-09-01 CHANGE_LIN_LIN Mycoplasmopsis gallinacea species\n29556 2023-03-01 CHANGE_LIN_LIN Mycoplasmopsis gallinacea species\n
Then we just use their TaxIds to rertrieve the new names. The final commands are:
zcat taxonkit/taxid-changelog.csv.gz \\\n | csvtk grep -f name -P example/changed_species_names.txt \\\n | csvtk uniq -f taxid \\\n | csvtk cut -f name,taxid \\\n | csvtk del-header \\\n | csvtk csv2tab \\\n | taxonkit lineage -i 2 -n \\\n | cut -f 1,4\n\nLactobacillus fermentum Limosilactobacillus fermentum\nMycoplasma gallinaceum Mycoplasmopsis gallinacea\n
"},{"location":"tutorial/#add-taxonomy-information-to-blast-result","title":"Add taxonomy information to BLAST result","text":"An blast result file blast_result.txt
, where the second column is the accession of matched sequences.
head -n 5 blast_result.txt | csvtk pretty -Ht\n\nxxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.745 494 99 3 6361 6851 895 1385 6.53e-83 326 \nxxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.543 494 100 3 17168 17658 895 1385 3.04e-81 320 \nxxxxxxxxxxxxxxxxxxxxx/76/ccs LR699760.1 100.000 37 0 0 8139 8175 14507874 14507910 4.27e-06 69.4\nxxxxxxxxxxxxxxxxxxxxx/80/ccs HG994975.1 80.556 540 81 16 8269 8798 3821290 3820765 8.65e-104 394 \nxxxxxxxxxxxxxxxxxxxxx/80/ccs HG994975.1 77.805 410 89 2 9590 9998 3819858 3819450 5.51e-61 252\n
Prepare acc2taxid.tsv
file from nucl_gb.accession2taxid.gz file. Here we use the accession
column instead of accession.version
column, in case of unmatched versions for some accessions.
zcat nucl_gb.accession2taxid.gz | cut -f 1,3 | gzip -c > acc2taxid.tsv.gz\n
Extract needed acc2taxid subset to reduce memory usage.
# extract accession and deduplicate and remove versions\ncut -f 2 blast_result.txt | csvtk uniq -Ht | csvtk replace -Ht -p '\\.\\d+$' > acc.txt\n\n# grep from acc2taxid.tsv.gz\nzcat acc2taxid.tsv.gz | grep -w -f acc.txt > hit.acc2taxid.tsv\n
Prepare taxid2name.tsv
, species name are retrived for the taxids.
cut -f 2 hit.acc2taxid.tsv | taxonkit reformat -f '{s}' -I 1 > hit.taxid2name.tsv\n
Append taxids according to the accessions, and append species names for the taxids.
csvtk add-header -t --names \"qseqid,sseqid,pident,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore\" blast_result.txt \\\n | csvtk mutate -t -f sseqid -n taxid \\\n | csvtk replace -t -k hit.acc2taxid.tsv -f taxid -p '(.+)\\.\\d+' -r '{kv}' \\\n | csvtk mutate -t -f taxid -n species \\\n | csvtk replace -t -k hit.taxid2name.tsv -f species -p '(.+)' -r '{kv}' \\\n | head -n 5 | csvtk pretty -t\n\nqseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore taxid species \n---------------------------- -------------- ------- ------ -------- ------- ------ ----- -------- -------- --------- -------- ----- --------------------\nxxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.745 494 99 3 6361 6851 895 1385 6.53e-83 326 44415 Eimeria mitis \nxxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.543 494 100 3 17168 17658 895 1385 3.04e-81 320 44415 Eimeria mitis \nxxxxxxxxxxxxxxxxxxxxx/76/ccs LR699760.1 100.000 37 0 0 8139 8175 14507874 14507910 4.27e-06 69.4 3702 Arabidopsis thaliana\nxxxxxxxxxxxxxxxxxxxxx/80/ccs HG994975.1 80.556 540 81 16 8269 8798 3821290 3820765 8.65e-104 394 5802 Eimeria tenella\n
"},{"location":"tutorial/#parsing-krakenbracken-result","title":"Parsing kraken/bracken result","text":"Example Data
Run Kraken2 and Bracken
KRAKEN_DB=/home/shenwei/ws/db/kraken/k2_pluspf\nTHREADS=16\n\nCLASSIFICATION_LVL=S\nTHRESHOLD=10\n\nREAD_LEN=100\nSAMPLE=SRS014459-Stool.fasta.gz\n\nBRACKEN_OUTPUT_FILE=$SAMPLE\n\nkraken2 --db ${KRAKEN_DB} --threads ${THREADS} -report ${SAMPLE}.kreport $SAMPLE > ${SAMPLE}.kraken\n\nest_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib \\\n -l ${CLASSIFICATION_LVL} -t ${THRESHOLD} -o ${BRACKEN_OUTPUT_FILE}.bracken\n
Orignial format
$ head -n 15 SRS014459-Stool.fasta.gz_bracken_species.kreport\n100.00 9491 0 R 1 root\n99.85 9477 0 R1 131567 cellular organisms\n99.85 9477 0 D 2 Bacteria\n66.08 6271 0 D1 1783270 FCB group\n66.08 6271 0 D2 68336 Bacteroidetes/Chlorobi group\n66.08 6271 0 P 976 Bacteroidetes\n66.08 6271 0 C 200643 Bacteroidia\n66.08 6271 0 O 171549 Bacteroidales\n34.45 3270 0 F 815 Bacteroidaceae\n34.45 3270 0 G 816 Bacteroides\n10.43 990 990 S 246787 Bacteroides cellulosilyticus\n7.98 757 757 S 28116 Bacteroides ovatus\n3.10 293 0 G1 2646097 unclassified Bacteroides\n1.06 100 100 S 2755405 Bacteroides sp. CACC 737\n0.49 46 46 S 2650157 Bacteroides sp. HF-5287\n
Converting to MetaPhlAn2 format. (Similar to kreport2mpa.py)
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\\n | csvtk cut -Ht -f 5,1 \\\n | taxonkit lineage \\\n | taxonkit reformat -i 3 -P -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}\" \\\n | csvtk cut -Ht -f 4,2 \\\n | csvtk replace -Ht -p \"(\\|[kpcofgs]__)+$\" \\\n | csvtk replace -Ht -p \"\\|[kpcofgs]__\\|\" -r \"|\" \\\n | csvtk uniq -Ht \\\n | csvtk grep -Ht -p k__ -v \\\n > SRS014459-Stool.fasta.gz_bracken_species.kreport.format\n\n$ head -n 10 SRS014459-Stool.fasta.gz_bracken_species.kreport.format\n\nk__Bacteria 99.85\nk__Bacteria|p__Bacteroidetes 66.08\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia 66.08\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales 66.08\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae 34.45\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides 34.45\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides cellulosilyticus 10.43\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus 7.98\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. CACC 737 1.06\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. HF-5287 0.49\n
Converting to Qiime format
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\\n | csvtk cut -Ht -f 5,1 \\\n | taxonkit lineage \\\n | taxonkit reformat -i 3 -P -f \"{k}; {p}; {c}; {o}; {f}; {g}; {s}\" \\\n | csvtk cut -Ht -f 4,2 \\\n | csvtk replace -Ht -p \"(; [kpcofgs]__)+$\" \\\n | csvtk replace -Ht -p \"; [kpcofgs]__; \" -r \"; \" \\\n | csvtk uniq -Ht \\\n | csvtk grep -Ht -p k__ -v \\\n | head -n 10\n\nk__Bacteria 99.85\nk__Bacteria; p__Bacteroidetes 66.08\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia 66.08\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales 66.08\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae 34.45\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides 34.45\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides cellulosilyticus 10.43\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides ovatus 7.98\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. CACC 737 1.06\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. HF-5287 0.49\n
Save taxon proportion and taxid, and get lineage, name and rank.
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\\n | csvtk cut -Ht -f 1,5 \\\n | taxonkit lineage -i 2 -n -r \\\n | csvtk cut -Ht -f 1,2,5,4,3 \\\n | head -n 10 \\\n | csvtk pretty -Ht\n\n100.00 1 no rank root root\n99.85 131567 no rank cellular organisms cellular organisms\n99.85 2 superkingdom Bacteria cellular organisms;Bacteria\n66.08 1783270 clade FCB group cellular organisms;Bacteria;FCB group\n66.08 68336 clade Bacteroidetes/Chlorobi group cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group\n66.08 976 phylum Bacteroidetes cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes\n66.08 200643 class Bacteroidia cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia\n66.08 171549 order Bacteroidales cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales\n34.45 815 family Bacteroidaceae cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae\n34.45 816 genus Bacteroides cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides\n
Only save species or lower level and get lineage in format of \"superkingdom phylum class order family genus species\".
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\\n | csvtk cut -Ht -f 1,5 \\\n | taxonkit filter -N -E species -L species -i 2 \\\n | taxonkit lineage -i 2 -n -r \\\n | taxonkit reformat -i 3 -f \"{k};{p};{c};{o};{f};{g};{s}\" \\\n | csvtk cut -Ht -f 1,2,5,4,6 \\\n | csvtk add-header -t -n abundance,taxid,rank,name,lineage \\\n | head -n 10 \\\n | csvtk pretty -t\n\nabundance taxid rank name lineage\n--------- ------- ------- ---------------------------- --------------------------------------------------------------------------------------------------------\n10.43 246787 species Bacteroides cellulosilyticus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides cellulosilyticus\n7.98 28116 species Bacteroides ovatus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides ovatus\n1.06 2755405 species Bacteroides sp. CACC 737 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CACC 737\n0.49 2650157 species Bacteroides sp. HF-5287 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5287\n0.99 2528203 species Bacteroides sp. A1C1 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. A1C1\n0.28 2763022 species Bacteroides sp. M10 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. M10\n0.16 2650158 species Bacteroides sp. HF-5141 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5141\n0.12 2715212 species Bacteroides sp. CBA7301 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CBA7301\n5.10 817 species Bacteroides fragilis Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis\n
"},{"location":"tutorial/#making-nr-blastdb-for-specific-taxids","title":"Making nr blastdb for specific taxids","text":"Attention:
(2023-11-27) BLAST+ 2.2.15 supports limiting a group of organisms without first using a custom script to get all species-level Taxonomy IDs (taxids) for the group. Details.
E.g., Search of the nr BLAST database limited to Bacteria (taxID 2).
blastp -db nr -taxids 2 -query ...\n
(2019) BLAST+ 2.8.1 is released with new databases, which allows you to limit your search by taxonomy using information built into the BLAST databases. So you don't need to build blastdb for specific taxids now.
Changes:
Data:
Hardware in this tutorial
Tools:
Steps:
Listing all taxids below $id
using taxonkit.
id=6656\n\n# 6656 is the phylum Arthropoda\n# echo 6656 | taxonkit lineage | taxonkit reformat\n# 6656 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda Eukaryota;Arthropoda;;;;;\n\n# 2 bacteria\n# 2157 archaea\n# 4751 fungi\n# 10239 virus\n\n# time: 2s\ntaxonkit list --ids $id --indent \"\" > $id.taxid.txt\n\n# taxonkit list --ids 2,4751,10239 --indent \"\" > microbe.taxid.txt\n\nwc -l $id.taxid.txt\n# 518373 6656.taxid.txt\n
Retrieving target accessions. There are two options:
From prot.accession2taxid.gz (faster, recommended). Note that some accessions are not in nr
.
# time: 4min\npigz -dc prot.accession2taxid.gz \\\n | csvtk grep -t -f taxid -P $id.taxid.txt \\\n | csvtk cut -t -f accession.version,taxid \\\n | sed 1d \\\n > $id.acc2taxid.txt\n\ncut -f 1 $id.acc2taxid.txt > $id.acc.txt\n\nwc -l $id.acc.txt\n# 8174609 6656.acc.txt\n
From pre-formated nr
blastdb
# time: 40min\nblastdbcmd -db nr -entry all -outfmt \"%a %T\" | pigz -c > nr.acc2taxid.txt.gz\n\npigz -dc nr.acc2taxid.txt.gz | wc -l\n# 555220892\n\n# time: 3min\npigz -dc nr.acc2taxid.txt.gz \\\n | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \\\n | cut -d ' ' -f 1 \\\n > $id.acc.txt\n\nwc -l $id.acc.txt\n# 6928021 6656.acc.txt\n
Retrieving FASTA sequences from pre-formated blastdb. There are two options:
From nr.fa
exported from pre-formated blastdb (faster, smaller output file, recommended). DO NOT directly download nr.gz
from ncbi ftp, in which the FASTA headers are not well formated.
# 1. exporting nr.fa from pre-formated blastdb\n\n# time: 117min (run only once)\nblastdbcmd -db nr -dbtype prot -entry all -outfmt \"%f\" -out - | pigz -c > nr.fa.gz\n\n# =====================================================================\n\n# 2. filtering sequence belong to $taxid\n\n# ---------------------------------------------------------------------\n\n# methond 1) (for cases where $id.acc.txt is not very huge)\n# time: 80min\n# perl one-liner is used to unfold records having mulitple accessions\ntime cat <(echo) <(pigz -dc nr.fa.gz) \\\n | perl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } ' \\\n | seqkit grep -f $id.acc.txt -o nr.$id.fa.gz\n\n# ---------------------------------------------------------------------\n\n# method 2) (**faster**)\n\n# 33min (run only once)\n# (1). split nr.fa.gz. # Note: I have 16 cpus.\n$ time seqkit split2 -p 15 nr.fa.gz\n\n# (2). parallize unfolding\n$ cat _unfold_blastdb_fa.sh\n#!/bin/sh\nperl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } '\n\n# 10 min\ntime ls nr.fa.gz.split/nr.part_*.fa.gz \\\n | rush -j 15 -v id=$id 'cat <(echo) <(pigz -dc {}) \\\n | ./_unfold_blastdb_fa.sh \\\n | seqkit grep -f {id}.acc.txt -o nr.{id}.{%@nr\\.(.+)$} '\n\n# (3). merge result\ncat nr.$id.part*.fa.gz > nr.$id.fa.gz\nrm nr.$id.part*.fa.gz\n\n# ---------------------------------------------------------------------\n\n# method 3) (for huge $id.acc.txt file, e.g., bacteria)\n\n# (1). split ${id}.acc.txt into several parts. chunk size depends on lines and RAM (64G for me).\nsplit -d -l 300000000 $id.acc.txt $id.acc.txt.part_\n\n# (2). filter\ntime ls $id.acc.txt.part_* \\\n | rush -j 1 --immediate-output -v id=$id \\\n 'echo {}; cat <(echo) <(pigz -dc nr.fa.gz ) \\\n | ./_unfold_blastdb_fa.sh \\\n | seqkit grep -f {} -o nr.{id}.{%@(part_.+)}.fa.gz '\n\n# (3). merge\ncat nr.$id.part*.fa.gz > nr.$id.fa.gz\n\n# clean\nrm nr.$id.part*.fa.gz\nrm $id.acc.txt.part_\n\n# (4). optionally adding taxid, you may edit replacement (-r) below\n# split\ntime split -d -l 200000000 $id.acc2taxid.txt $id.acc2taxid.txt.part_\n\nln -s nr.$id.fa.gz nr.$id.with-taxid.part0.fa.gz \ni=0\nfor f in $id.acc2taxid.txt.part_* ; do\n echo $f\n time pigz -cd nr.$id.with-taxid.part$i.fa.gz \\\n | seqkit replace -k $f -p \"^([^\\-]+?) \" -r \"{kv}-\\$1 \" -K -U -o nr.$id.with-taxid.part$(($i+1)).fa.gz;\n /bin/rm nr.$id.with-taxid.part$i.fa.gz\n i=$(($i+1));\ndone\nmv nr.$id.with-taxid.part$i.fa.gz nr.$id.with-taxid.fa.gz\n\n# =====================================================================\n\n# 3. counting sequences\n#\n# ls -lh nr.$id.fa.gz\n# -rw-r--r-- 1 shenwei shenwei 902M 9\u6708 13 01:42 nr.6656.fa.gz\n#\npigz -dc nr.$id.fa.gz | grep '^>' -c\n\n# 6928017\n# Here 6928017 ~= 6928021 ($id.acc.txt)\n
Directly from pre-formated blastdb
# time: 5h20min\nblastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz\n\n# counting sequences\n#\n# Note that the headers of outputed fasta by blastdbcmd are \"folded\"\n# for accessions from different species with same sequences, so the\n# number may be small than $(wc -l $id.acc.txt).\npigz -dc nr.$id.fa.gz | grep '^>' -c\n# 1577383\n\n# counting accessions\n#\n# ls -lh nr.$id.fa.gz\n# -rw-r--r-- 1 shenwei shenwei 2.1G 9\u6708 13 03:38 nr.6656.fa.gz\n#\npigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\\n>/g' | grep '^>' -c\n# 288415413\n
makeblastdb
pigz -dc nr.$id.fa.gz > nr.$id.fa\n\n# time: 3min ($nr.$id.fa from step 3 option 1)\n#\n# building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error:\n#\n# BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1\n#\nmakeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id\n\n# rm nr.$id.fa\n
blastp (optional)
# blastdb nr.$id is built from sequences in step 3 option 1\n#\nblastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast\n# real 0m20.866s\n\n# $ cat t4.fa.blast | grep Query= -A 10\n# Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a\n#\n# Length=35\n Score E\n# Sequences producing significant alignments: (Bits) Value\n\n# 2MPQ_A Chain A, Solution structure of the sodium channel toxin Hd1a 72.4 2e-17\n# A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-... 72.4 2e-17\n# ADB56726.1 HNTX-IV.2 precursor [Haplopelma hainanum] 66.6 9e-15\n# D2Y233.1 RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H... 66.6 9e-15\n# ADB56830.1 HNTX-IV.3 precursor [Haplopelma hainanum] 66.6 9e-15\n
You can change the TaxId of interest.
Rank counts of common categories.
$ echo Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae \\\n | rush -D ' ' -T b \\\n 'taxonkit list --ids $(echo {} | taxonkit name2taxid | cut -f 2) \\\n | sed 1d \\\n | taxonkit filter -i 2 -E genus -L genus \\\n | taxonkit lineage -L -r \\\n | csvtk freq -H -t -f 2 -nr \\\n > stats.{}.tsv '\n\n$ csvtk -t join --outer-join stats.*.tsv \\\n | csvtk add-header -t -n \"rank,$(ls stats.*.tsv | rush -k 'echo {@stats.(.+).tsv}' | paste -sd, )\" \\\n | csvtk csv2md -t\n
Similar data on NCBI Taxonomy
rank Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae species 12482 460940 1349648 156908 957297 191026 strain 354 40643 3486 2352 33 50 genus 205 4112 90882 6844 64148 16202 isolate 7 503 809 76 17 3 species group 2 77 251 22 214 5 serotype 218 serogroup 136 subsection 21 21 subspecies 632 24523 158 17043 7212 forma specialis 521 220 179 33 1 species subgroup 23 101 101 biotype 7 10 morph 12 3 4 5 section 437 37 2 398 genotype 12 12 series 9 5 4 varietas 25 8499 1100 2 7188 forma 4 560 185 6 315 subgenus 1 1558 10 1414 112 pathogroup 5 subvariety 5 5Count of all ranks
$ time taxonkit list --ids 1 \\\n | taxonkit lineage -L -r \\\n | csvtk freq -H -t -f 2 -nr \\\n | csvtk pretty -H -t\n\nspecies 1879659\nno rank 222743\ngenus 96625\nstrain 44483\nsubspecies 25174\nfamily 9492\nvarietas 8524\nsubfamily 3050\ntribe 2213\norder 1660\nsubgenus 1618\nisolate 1319\nserotype 1216\nclade 886\nsuperfamily 865\nforma specialis 741\nforma 564\nsubtribe 508\nsection 437\nclass 429\nsuborder 372\nspecies group 330\nphylum 272\nsubclass 156\nserogroup 138\ninfraorder 130\nspecies subgroup 124\nsuperorder 55\nsubphylum 33\nparvorder 26\nsubsection 21\ngenotype 20\ninfraclass 18\nbiotype 17\nmorph 12\nkingdom 11\nseries 9\nsuperclass 6\ncohort 5\npathogroup 5\nsubvariety 5\nsuperkingdom 4\nsubcohort 3\nsubkingdom 1\nsuperphylum 1\n\nreal 0m3.663s\nuser 0m15.897s\nsys 0m1.010s\n
Ranks of taxa at or below species.
$ taxonkit list --ids 1 \\\n | taxonkit filter --lower-than species --equal-to species \\\n | taxonkit lineage -L -r \\\n | csvtk freq -Ht -nr -f 2 \\\n | csvtk add-header -t -n rank,count \\\n | csvtk pretty -t\n\nrank count\n--------------- -------\nspecies 1880044\nno rank 222756\nstrain 44483\nsubspecies 25171\nvarietas 8524\nisolate 1319\nserotype 1216\nclade 885\nforma specialis 741\nforma 564\nserogroup 138\ngenotype 20\nbiotype 17\nmorph 12\npathogroup 5\nsubvariety 5\n
Sometimes (1) one needs to build a database including bacteria and archaea (from GTDB) and viral database from NCBI. The idea is to export lineages from both GTDB and NCBI using taxonkit reformat, and then create taxdump files from them with taxonkit create-taxdump.
Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump.
taxonkit list --data-dir gtdb-taxdump/R207/ --ids 1 --indent \"\" \\\n | taxonkit filter --data-dir gtdb-taxdump/R207/ --equal-to species \\\n | taxonkit reformat --data-dir gtdb-taxdump/R207/ --taxid-field 1 \\\n --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\\n -o gtdb.tsv\n
Exporting taxonomic lineages of viral taxa with rank equal to or lower than species from NCBI taxdump. For taxa whose rank is \"no rank\" below the species, we treat them as tax of strain rank (--pseudo-strain
, taxonkit v0.14.1 needed).
# taxid of Viruses: 10239\ntaxonkit list --data-dir ~/.taxonkit --ids 10239 --indent \"\" \\\n | taxonkit filter --data-dir ~/.taxonkit --equal-to species --lower-than species \\\n | taxonkit reformat --data-dir ~/.taxonkit --taxid-field 1 \\\n --pseudo-strain --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n -o ncbi-viral.tsv\n
Creating taxdump from lineages above.
(awk '{print $_\"\\t\"}' gtdb.tsv; cat ncbi-viral.tsv) \\\n | taxonkit create-taxdump \\\n --field-accession 1 \\\n -R \"superkingdom,phylum,class,order,family,genus,species,strain\" \\\n -O taxdump\n\n# we use --field-accession 1 to output the mapping file between old taxids and new ones.\n$ grep 2697049 taxdump/taxid.map # SARS-COV-2\n2697049 21630522\n
Some tests:
# SARS-COV-2 in NCBI taxonomy\n$ echo 2697049 \\\n | taxonkit lineage -t --data-dir ~/.taxonkit \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L --data-dir ~/.taxonkit \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -Ht\n10239 superkingdom Viruses\n2559587 clade Riboviria\n2732396 kingdom Orthornavirae\n2732408 phylum Pisuviricota\n2732506 class Pisoniviricetes\n76804 order Nidovirales\n2499399 suborder Cornidovirineae\n11118 family Coronaviridae\n2501931 subfamily Orthocoronavirinae\n694002 genus Betacoronavirus\n2509511 subgenus Sarbecovirus\n694009 species Severe acute respiratory syndrome-related coronavirus\n2697049 no rank Severe acute respiratory syndrome coronavirus 2\n\n$ echo \"Severe acute respiratory syndrome coronavirus 2\" | taxonkit name2taxid --data-dir taxdump/\nSevere acute respiratory syndrome coronavirus 2 216305222\n\n$ echo 216305222 \\\n | taxonkit lineage -t --data-dir taxdump/ \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L --data-dir taxdump/ \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -Ht\n1287770734 superkingdom Viruses\n1506901452 phylum Pisuviricota\n1091693597 class Pisoniviricetes\n37745009 order Nidovirales\n738421640 family Coronaviridae\n906833049 genus Betacoronavirus\n1015862491 species Severe acute respiratory syndrome-related coronavirus\n216305222 strain Severe acute respiratory syndrome coronavirus 2\n\n\n\n$ echo \"Escherichia coli\" | taxonkit name2taxid --data-dir taxdump/\nEscherichia coli 1945799576\n\n$ echo 1945799576 \\\n | taxonkit lineage -t --data-dir taxdump/ \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L --data-dir taxdump/ \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -Ht\n609216830 superkingdom Bacteria\n1641076285 phylum Proteobacteria\n329474883 class Gammaproteobacteria\n1012954932 order Enterobacterales\n87250111 family Enterobacteriaceae\n1187493883 genus Escherichia\n1945799576 species Escherichia coli\n
Please enable JavaScript to view the comments powered by Disqus."},{"location":"usage/","title":"Usage and Examples","text":"Table of Contents
taxdump.tar.gz
: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz names.dmp
, nodes.dmp
, delnodes.dmp
and merged.dmp
to data directory: $HOME/.taxonkit
, e.g., /home/shenwei/.taxonkit
,--data-dir
, or environment variable TAXONKIT_DB
.All-in-one command:
wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \ntar -zxvf taxdump.tar.gz\n\nmkdir -p $HOME/.taxonkit\ncp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit\n
Update dataset: Simply re-download the taxdump files, uncompress and override old ones.
"},{"location":"usage/#taxonkit","title":"taxonkit","text":"TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit\n\nVersion: 0.15.1\n\nAuthor: Wei Shen <shenwei356@gmail.com>\n\nSource code: https://github.com/shenwei356/taxonkit\nDocuments : https://bioinf.shenwei.me/taxonkit\nCitation : https://www.sciencedirect.com/science/article/pii/S1673852721000837\n\nDataset:\n\n Please download and uncompress \"taxdump.tar.gz\":\n ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz\n\n and copy \"names.dmp\", \"nodes.dmp\", \"delnodes.dmp\" and \"merged.dmp\" to data directory:\n \"/home/shenwei/.taxonkit\"\n\n or some other directory, and later you can refer to using flag --data-dir,\n or environment variable TAXONKIT_DB.\n\n When environment variable TAXONKIT_DB is set, explicitly setting --data-dir will\n overide the value of TAXONKIT_DB.\n\nUsage:\n taxonkit [command]\n\nAvailable Commands:\n cami-filter Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile\n create-taxdump Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV\n filter Filter TaxIds by taxonomic rank range\n genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)\n lca Compute lowest common ancestor (LCA) for TaxIds\n lineage Query taxonomic lineage of given TaxIds\n list List taxonomic subtrees of given TaxIds\n name2taxid Convert taxon names to TaxIds\n profile2cami Convert metagenomic profile table to CAMI format\n reformat Reformat lineage in canonical ranks\n taxid-changelog Create TaxId changelog from dump archives\n version print version information and check for update\n\nFlags:\n --data-dir string directory containing nodes.dmp and names.dmp (default \"/home/shenwei/.taxonkit\")\n -h, --help help for taxonkit\n --line-buffered use line buffering on output, i.e., immediately writing to stdin/file for\n every line of output\n -o, --out-file string out file (\"-\" for stdout, suffix .gz for gzipped out) (default \"-\")\n -j, --threads int number of CPUs. 4 is enough (default 4)\n --verbose print verbose information\n\nUse \"taxonkit [command] --help\" for more information about a command.\n
"},{"location":"usage/#list","title":"list","text":"Usage
List taxonomic subtrees of given TaxIds\n\nAttentions:\n 1. When multiple taxids are given, the output may contain duplicated records\n if some taxids are descendants of others.\n\nExamples:\n\n $ taxonkit list --ids 9606 -n -r --indent \" \"\n 9606 [species] Homo sapiens\n 63221 [subspecies] Homo sapiens neanderthalensis\n 741158 [subspecies] Homo sapiens subsp. 'Denisova'\n\n $ taxonkit list --ids 9606 --indent \"\"\n 9606\n 63221\n 741158\n\nUsage:\n taxonkit list [flags]\n\nFlags:\n -h, --help help for list\n -i, --ids string TaxId(s), multiple values should be separated by comma\n -I, --indent string indent (default \" \")\n -J, --json output in JSON format. you can save the result in file with suffix \".json\" and\n open with modern text editor\n -n, --show-name output scientific name\n -r, --show-rank output rank\n
Examples
Default usage.
$ taxonkit list --ids 9605,239934\n9605\n9606\n 63221\n 741158\n1425170\n2665952\n 2665953\n\n239934\n239935\n 349741\n512293\n 512294\n 1131822\n 1262691\n 1263034\n1679444\n2608915\n 1131336\n...\n
Removing indent. The list could be used to extract sequences from BLAST database with blastdbcmd
(see tutorial)
$ taxonkit list --ids 9605,239934 --indent \"\"\n9605\n9606\n63221\n741158\n1425170\n2665952\n2665953\n\n239934\n239935\n349741\n512293\n512294\n1131822\n1262691\n1263034\n1679444\n...\n
Performance: Time and memory usage for whole taxon tree:
$ # emptying the buffers cache\n$ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\"\n\n$ memusg -t taxonkit list --ids 1 --indent \"\" --verbose > t0.txt\n21:05:01.782 [INFO] parsing merged file: /home/shenwei/.taxonkit/names.dmp\n21:05:01.782 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp\n21:05:01.782 [INFO] parsing delnodes file: /home/shenwei/.taxonkit/names.dmp\n21:05:01.816 [INFO] 61023 merged nodes parsed\n21:05:01.889 [INFO] 437929 delnodes parsed\n21:05:03.178 [INFO] 2303979 names parsed\n\nelapsed time: 3.290s\npeak rss: 742.77 MB\n
Adding names
$ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934\n9605 [genus] Homo\n 9606 [species] Homo sapiens\n 63221 [subspecies] Homo sapiens neanderthalensis\n 741158 [subspecies] Homo sapiens subsp. 'Denisova'\n 1425170 [species] Homo heidelbergensis\n 2665952 [no rank] environmental samples\n 2665953 [species] Homo sapiens environmental sample\n\n239934 [genus] Akkermansia\n 239935 [species] Akkermansia muciniphila\n 349741 [strain] Akkermansia muciniphila ATCC BAA-835\n 512293 [no rank] environmental samples\n 512294 [species] uncultured Akkermansia sp.\n 1131822 [species] uncultured Akkermansia sp. SMG25\n 1262691 [species] Akkermansia sp. CAG:344\n 1263034 [species] Akkermansia muciniphila CAG:154\n 1679444 [species] Akkermansia glycaniphila\n 2608915 [no rank] unclassified Akkermansia\n 1131336 [species] Akkermansia sp. KLE1605\n 1574264 [species] Akkermansia sp. KLE1797\n...\n
Performance: Time and memory usage for whole taxonomy tree:
$ # emptying the buffers cache\n$ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\"\n\n$ memusg -t taxonkit list --show-rank --show-name --ids 1 > t1.txt\nelapsed time: 5.341s\npeak rss: 1.04 GB\n
Output in JSON format, you can easily collapse and uncollapse taxonomy tree in modern text editor.
$ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 --json\n{\n \"9605 [genus] Homo\": {\n \"9606 [species] Homo sapiens\": {\n \"63221 [subspecies] Homo sapiens neanderthalensis\": {\n },\n \"741158 [subspecies] Homo sapiens subsp. 'Denisova'\": {\n }\n },\n \"1425170 [species] Homo heidelbergensis\": {\n }\n },\n \"239934 [genus] Akkermansia\": {\n \"239935 [species] Akkermansia muciniphila\": {\n \"349741 [no rank] Akkermansia muciniphila ATCC BAA-835\": {\n }\n },\n \"512293 [no rank] environmental samples\": {\n \"512294 [species] uncultured Akkermansia sp.\": {\n },\n \"1131822 [species] uncultured Akkermansia sp. SMG25\": {\n },\n \"1262691 [species] Akkermansia sp. CAG:344\": {\n },\n \"1263034 [species] Akkermansia muciniphila CAG:154\": {\n }\n },\n \"1679444 [species] Akkermansia glycaniphila\": {\n },\n \"2608915 [no rank] unclassified Akkermansia\": {\n \"1131336 [species] Akkermansia sp. KLE1605\": {\n },\n \"1574264 [species] Akkermansia sp. KLE1797\": {\n },\n \"1574265 [species] Akkermansia sp. KLE1798\": {\n },\n \"1638783 [species] Akkermansia sp. UNK.MGS-1\": {\n },\n \"1755639 [species] Akkermansia sp. MC_55\": {\n }\n }\n }\n}\n
Snapshot of taxonomy (taxid 1) in kate:
Usage
Query taxonomic lineage of given TaxIds\n\nInput:\n\n - List of TaxIds, one TaxId per line.\n - Or tab-delimited format, please specify TaxId field \n with flag -i/--taxid-field (default 1).\n - Supporting (gzipped) file or STDIN.\n\nOutput:\n\n 1. Input line data.\n 2. (Optional) Status code (-c/--show-status-code), values:\n - \"-1\" for queries not found in whole database.\n - \"0\" for deleted TaxIds, provided by \"delnodes.dmp\".\n - New TaxIds for merged TaxIds, provided by \"merged.dmp\".\n - Taxids for these found in \"nodes.dmp\".\n 3. Lineage, delimiter can be changed with flag -d/--delimiter.\n 4. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids)\n 5. (Optional) Name (-n/--show-name)\n 6. (Optional) Rank (-r/--show-rank)\n\nFilter out invalid and deleted taxids, and replace merged \ntaxids with new ones:\n\n # input is one-column-taxid\n $ taxonkit lineage -c taxids.txt \\\n | awk '$2>0' \\\n | cut -f 2-\n\n # taxids are in 3rd field in a 4-columns tab-delimited file,\n # for $5, where 5 = 4 + 1.\n $ cat input.txt \\\n | taxonkit lineage -c -i 3 \\\n | csvtk filter2 -H -t -f '$5>0' \\\n | csvtk -H -t cut -f -3\n\nUsage:\n taxonkit lineage [flags]\n\nFlags:\n -d, --delimiter string field delimiter in lineage (default \";\")\n -h, --help help for lineage\n -L, --no-lineage do not show lineage, when user just want names or/and ranks\n -R, --show-lineage-ranks appending ranks of all levels\n -t, --show-lineage-taxids appending lineage consisting of taxids\n -n, --show-name appending scientific name\n -r, --show-rank appending rank of taxids\n -c, --show-status-code show status code before lineage\n -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1)\n
Examples
Full lineage:
# note that 123124124 is a fake taxid, 3 was deleted, 92489,1458427 were merged\n$ cat taxids.txt \n9606\n9913\n376619\n349741\n239935\n314101\n11932\n1327037\n123124124\n3\n92489\n1458427\n\n$ taxonkit lineage taxids.txt | tee lineage.txt \n19:22:13.077 [WARN] taxid 92489 was merged into 796334\n19:22:13.077 [WARN] taxid 1458427 was merged into 1458425\n19:22:13.077 [WARN] taxid 123124124 not found\n19:22:13.077 [WARN] taxid 3 was deleted\n9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\n9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus\n376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS\n349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B\n11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\n1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y\n123124124\n3\n92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\n1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n\n# wrapped table with csvtk pretty (>v0.26.0)\n$ taxonkit lineage taxids.txt | csvtk pretty -Ht -x ';' -W 70 -S bold\n\u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n\u2503 9606 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503\n\u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503\n\u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503\n\u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates; \u2503\n\u2503 \u2503 Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae; \u2503\n\u2503 \u2503 Homo;Homo sapiens \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 9913 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503\n\u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503\n\u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503\n\u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla; \u2503\n\u2503 \u2503 Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 376619 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503\n\u2503 \u2503 Thiotrichales;Francisellaceae;Francisella;Francisella tularensis; \u2503\n\u2503 \u2503 Francisella tularensis subsp. holarctica; \u2503\n\u2503 \u2503 Francisella tularensis subsp. holarctica LVS \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 349741 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503\n\u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503\n\u2503 \u2503 Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 239935 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503\n\u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503\n\u2503 \u2503 Akkermansia muciniphila \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 314101 \u2503 cellular organisms;Bacteria;environmental samples; \u2503\n\u2503 \u2503 uncultured murine large bowel bacterium BAC 54B \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 11932 \u2503 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes; \u2503\n\u2503 \u2503 Ortervirales;Retroviridae;unclassified Retroviridae; \u2503\n\u2503 \u2503 Intracisternal A-particles;Mouse Intracisternal A-particle \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 1327037 \u2503 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes; \u2503\n\u2503 \u2503 Caudovirales;Siphoviridae;unclassified Siphoviridae; \u2503\n\u2503 \u2503 Croceibacter phage P2559Y \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 92489 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503\n\u2503 \u2503 Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 1458427 \u2503 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria; \u2503\n\u2503 \u2503 Burkholderiales;Comamonadaceae;Serpentinomonas; \u2503\n\u2503 \u2503 Serpentinomonas raichei \u2503\n\u2517\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u253b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u251b\n
Speed.
$ time echo 9606 | taxonkit lineage \n9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\n\nreal 0m1.190s\nuser 0m2.365s\nsys 0m0.170s\n\n# all TaxIds\n$ time taxonkit list --ids 1 --indent \"\" | taxonkit lineage > t\n\nreal 0m4.249s\nuser 0m16.418s\nsys 0m1.221s\n
Checking deleted or merged taxids
$ taxonkit lineage --show-status-code taxids.txt | tee lineage.withcode.txt\n\n# valid\n$ cat lineage.withcode.txt | awk '$2 > 0' | cut -f 1,2\n9606 9606\n9913 9913\n376619 376619\n349741 349741\n239935 239935\n314101 314101\n11932 11932\n1327037 1327037\n92489 796334\n1458427 1458425\n\n# merged\n$ cat lineage.withcode.txt | awk '$2 > 0 && $2 != $1' | cut -f 1,2\n92489 796334\n1458427 1458425\n\n# deleted\n$ cat lineage.withcode.txt | awk '$2 == 0' | cut -f 1\n3\n\n# invalid\n$ cat lineage.withcode.txt | awk '$2 < 0' | cut -f 1\n123124124\n
Filter out invalid and deleted taxids, and replace merged taxids with new ones, you may install csvtk.
# input is one-column-taxid\n$ taxonkit lineage -c taxids.txt \\\n | awk '$2>0' \\\n | cut -f 2-\n\n# taxids are in 3rd field in a 4-columns tab-delimited file,\n# for $5, where 5 = 4 + 1.\n$ cat input.txt \\\n | taxonkit lineage -c -i 3 \\\n | csvtk filter2 -H -t -f '$5>0' \\\n | csvtk -H -t cut -f -3\n
Only show name and rank.
$ taxonkit lineage -r -n -L taxids.txt \\\n | csvtk pretty -H -t\n9606 Homo sapiens species\n9913 Bos taurus species\n376619 Francisella tularensis subsp. holarctica LVS strain\n349741 Akkermansia muciniphila ATCC BAA-835 strain\n239935 Akkermansia muciniphila species\n314101 uncultured murine large bowel bacterium BAC 54B species\n11932 Mouse Intracisternal A-particle species\n1327037 Croceibacter phage P2559Y species\n123124124 \n3 \n92489 Erwinia oleae species\n1458427 Serpentinomonas raichei species\n
Show lineage consisting of taxids:
$ taxonkit lineage -t taxids.txt\n9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314146;9443;376913;314293;9526;314295;9604;207598;9605;9606\n9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314145;91561;9845;35500;9895;27592;9903;9913\n376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 131567;2;1224;1236;72273;34064;262;263;119857;376619\n349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741\n239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935\n314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 131567;2;48479;314101\n11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 10239;2559587;2732397;2732409;2732514;2169561;11632;35276;11749;11932\n1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 10239;2731341;2731360;2731618;2731619;28883;10699;196894;1327037\n123124124\n3\n92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 131567;2;1224;1236;91347;1903409;551;796334\n1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei 131567;2;1224;28216;80840;80864;2490452;1458425\n
or read taxids from STDIN:
$ cat taxids.txt | taxonkit lineage\n
And ranks of all nodes:
$ echo 2697049 \\\n | taxonkit lineage -t -R \\\n | csvtk transpose -Ht\n2697049\nViruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2\n10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049\nsuperkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank\n
Another way to show lineage detail of a TaxId
$ echo 2697049 \\\n | taxonkit lineage -t \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -H -t \n10239 superkingdom Viruses\n2559587 clade Riboviria\n2732396 kingdom Orthornavirae\n2732408 phylum Pisuviricota\n2732506 class Pisoniviricetes\n76804 order Nidovirales\n2499399 suborder Cornidovirineae\n11118 family Coronaviridae\n2501931 subfamily Orthocoronavirinae\n694002 genus Betacoronavirus\n2509511 subgenus Sarbecovirus\n694009 species Severe acute respiratory syndrome-related coronavirus\n2697049 no rank Severe acute respiratory syndrome coronavirus 2\n
Usage
Reformat lineage in canonical ranks\n\nInput:\n\n - List of TaxIds or lineages, one record per line.\n The lineage can be a complete lineage or only one taxonomy name.\n - Or tab-delimited format.\n Plese specify the lineage field with flag -i/--lineage-field (default 2).\n Or specify the TaxId field with flag -I/--taxid-field (default 0),\n which overrides -i/--lineage-field.\n - Supporting (gzipped) file or STDIN.\n\nOutput:\n\n 1. Input line data.\n 2. Reformated lineage.\n 3. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids)\n\nAmbiguous names:\n\n - Some TaxIds have the same complete lineage, empty result is returned\n by default. You can use the flag -a/--output-ambiguous-result to\n return one possible result\n\nOutput format can be formated by flag --format, available placeholders:\n\n {k}: superkingdom\n {K}: kingdom\n {p}: phylum\n {c}: class\n {o}: order\n {f}: family\n {g}: genus\n {s}: species\n {t}: subspecies/strain\n\n {S}: subspecies\n {T}: strain\n\nWhen these're no nodes of rank \"subspecies\" nor \"strain\",\nyou can switch on -S/--pseudo-strain to use the node with lowest rank\nas subspecies/strain name, if which rank is lower than \"species\".\nThis flag affects {t}, {S}, {T}.\n\nOutput format can contains some escape charactors like \"\\t\".\n\nUsage:\n taxonkit reformat [flags]\n\nFlags:\n -P, --add-prefix add prefixes for all ranks, single prefix for a rank is defined\n by flag --prefix-X\n -d, --delimiter string field delimiter in input lineage (default \";\")\n -F, --fill-miss-rank fill missing rank with lineage information of the next higher rank\n -f, --format string output format, placeholders of rank are needed (default\n \"{k};{p};{c};{o};{f};{g};{s}\")\n -h, --help help for reformat\n -i, --lineage-field int field index of lineage. data should be tab-separated (default 2)\n -r, --miss-rank-repl string replacement string for missing rank\n -p, --miss-rank-repl-prefix string prefix for estimated taxon level (default \"unclassified \")\n -s, --miss-rank-repl-suffix string suffix for estimated taxon names. \"rank\" for rank name, \"\" for no\n suffix (default \"rank\")\n -R, --miss-taxid-repl string replacement string for missing taxid\n -a, --output-ambiguous-result output one of the ambigous result\n --prefix-K string prefix for kingdom, used along with flag -P/--add-prefix (default\n \"K__\")\n --prefix-S string prefix for subspecies, used along with flag -P/--add-prefix\n (default \"S__\")\n --prefix-T string prefix for strain, used along with flag -P/--add-prefix (default\n \"T__\")\n --prefix-c string prefix for class, used along with flag -P/--add-prefix (default \"c__\")\n --prefix-f string prefix for family, used along with flag -P/--add-prefix (default\n \"f__\")\n --prefix-g string prefix for genus, used along with flag -P/--add-prefix (default \"g__\")\n --prefix-k string prefix for superkingdom, used along with flag -P/--add-prefix\n (default \"k__\")\n --prefix-o string prefix for order, used along with flag -P/--add-prefix (default \"o__\")\n --prefix-p string prefix for phylum, used along with flag -P/--add-prefix (default\n \"p__\")\n --prefix-s string prefix for species, used along with flag -P/--add-prefix (default\n \"s__\")\n --prefix-t string prefix for subspecies/strain, used along with flag\n -P/--add-prefix (default \"t__\")\n -S, --pseudo-strain use the node with lowest rank as strain name, only if which rank\n is lower than \"species\" and not \"subpecies\" nor \"strain\". It\n affects {t}, {S}, {T}. This flag needs flag -F\n -t, --show-lineage-taxids show corresponding taxids of reformated lineage\n -I, --taxid-field int field index of taxid. input data should be tab-separated. it\n overrides -i/--lineage-field\n -T, --trim do not fill or add prefix for missing rank lower than current rank\n
Examples:
For version > 0.8.0, reformat
accept input of TaxIds via flag -I/--taxid-field
.
$ echo 239935 | taxonkit reformat -I 1\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n\n$ echo 349741 | taxonkit reformat -I 1 -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}|{t}\" -F -t\n349741 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila|Akkermansia muciniphila ATCC BAA-835 2|74201|203494|48461|1647988|239934|239935|349741\n
Example lineage (produced by: taxonkit lineage taxids.txt | awk '$2!=\"\"' > lineage.txt
).
$ cat lineage.txt\n9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\n9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus\n376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS\n349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B\n11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\n1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y\n92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\n1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n
Default output format (\"{k};{p};{c};{o};{f};{g};{s}\"
).
# reformated lineages are appended to the input data\n$ taxonkit reformat lineage.txt \n...\n239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n...\n\n$ \n$ taxonkit reformat lineage.txt | tee lineage.txt.reformat\n\n$ cut -f 1,3 lineage.txt.reformat\n9606 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens\n9913 Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus\n376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis\n349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B\n11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\n1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y\n92489 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\n1458427 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n\n# aligned \n$ cat lineage.txt \\\n | taxonkit reformat \\\n | csvtk -H -t cut -f 1,3 \\\n | csvtk -H -t sep -f 2 -s ';' -R \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\\n | csvtk pretty -t\n\ntaxid kindom phylum class order family genus species\n------- --------- --------------- ------------------- ------------------ --------------- -------------------------- -----------------------------------------------\n9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens\n9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus\n376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis\n349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila\n239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila\n314101 Bacteria uncultured murine large bowel bacterium BAC 54B\n11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle\n1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Croceibacter phage P2559Y\n92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae\n1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei\n
And subspecies/strain
({t}
), subspecies
({S}
), and strain
({T}
) are also available.
# default operation\n$ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\\n | taxonkit lineage -n -r \\\n | taxonkit reformat -f '{t};{S};{T}' \\\n | csvtk -H -t cut -f 1,4,3,5 \\\n | csvtk -H -t sep -f 4 -s ';' -R \\\n | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\\n | csvtk pretty -t\n\ntaxid rank name subspecies/strain subspecies strain\n------- ---------- ----------------------------------------------- --------------------- --------------------- ---------------------\n239935 species Akkermansia muciniphila \n83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12\n1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 \n2697049 no rank Severe acute respiratory syndrome coronavirus 2 \n2605619 no rank Escherichia coli O16:H48\n\n# fill missing ranks\n# see example below for -F/--fill-miss-rank\n#\n$ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\\n | taxonkit lineage -n -r \\\n | taxonkit reformat -f '{t};{S};{T}' --fill-miss-rank \\\n | csvtk -H -t cut -f 1,4,3,5 \\\n | csvtk -H -t sep -f 4 -s ';' -R \\\n | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\\n | csvtk pretty -t\n\ntaxid rank name subspecies/strain subspecies strain\n------- ---------- ----------------------------------------------- ------------------------------------------------------------------------------------ ----------------------------------------------------------------------------- -------------------------------------------------------------------------\n239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain unclassified Akkermansia muciniphila subspecies unclassified Akkermansia muciniphila strain\n83333 strain Escherichia coli K-12 Escherichia coli K-12 unclassified Escherichia coli subspecies Escherichia coli K-12\n1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 unclassified Escherichia coli R178 strain\n2697049 no rank Severe acute respiratory syndrome coronavirus 2 unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain unclassified Severe acute respiratory syndrome-related coronavirus subspecies unclassified Severe acute respiratory syndrome-related coronavirus strain\n2605619 no rank Escherichia coli O16:H48 unclassified Escherichia coli subspecies/strain unclassified Escherichia coli subspecies unclassified Escherichia coli strain\n
When these's no nodes of rank \"subspecies\" nor \"strain\", you can switch -S/--pseudo-strain
to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\". Recommend using v0.14.1 or later versions.
$ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\\n | taxonkit lineage -n -r \\\n | taxonkit reformat -f '{t};{S};{T}' --pseudo-strain \\\n | csvtk -H -t cut -f 1,4,3,5 \\\n | csvtk -H -t sep -f 4 -s ';' -R \\\n | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\\n | csvtk pretty -t\n\ntaxid rank name subspecies/strain subspecies strain\n------- ---------- ----------------------------------------------- ----------------------------------------------- ----------------------------------------------- -----------------------------------------------\n239935 species Akkermansia muciniphila\n83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12\n1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178\n2697049 no rank Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2\n2605619 no rank Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48\n
Add prefix (-P/--add-prefix
).
$ cat lineage.txt \\\n | taxonkit reformat -P \\\n | csvtk -H -t cut -f 1,3\n\n9606 k__Eukaryota;p__Chordata;c__Mammalia;o__Primates;f__Hominidae;g__Homo;s__Homo sapiens\n9913 k__Eukaryota;p__Chordata;c__Mammalia;o__Artiodactyla;f__Bovidae;g__Bos;s__Bos taurus\n376619 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Thiotrichales;f__Francisellaceae;g__Francisella;s__Francisella tularensis\n349741 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila\n239935 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila\n314101 k__Bacteria;p__;c__;o__;f__;g__;s__uncultured murine large bowel bacterium BAC 54B\n11932 k__Viruses;p__Artverviricota;c__Revtraviricetes;o__Ortervirales;f__Retroviridae;g__Intracisternal A-particles;s__Mouse Intracisternal A-particle\n1327037 k__Viruses;p__Uroviricota;c__Caudoviricetes;o__Caudovirales;f__Siphoviridae;g__;s__Croceibacter phage P2559Y\n92489 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Erwiniaceae;g__Erwinia;s__Erwinia oleae\n1458427 k__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__Comamonadaceae;g__Serpentinomonas;s__Serpentinomonas raichei\n
Show corresponding taxids of reformated lineage (flag -t/--show-lineage-taxids
)
$ cat lineage.txt \\\n | taxonkit reformat -t \\\n | csvtk -H -t cut -f 1,4 \\\n | csvtk -H -t sep -f 2 -s ';' -R \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\\n | csvtk pretty -t\n\ntaxid kindom phylum class order family genus species\n------- ------ ------- ------- ------- ------- ------- -------\n9606 2759 7711 40674 9443 9604 9605 9606\n9913 2759 7711 40674 91561 9895 9903 9913\n376619 2 1224 1236 72273 34064 262 263\n349741 2 74201 203494 48461 1647988 239934 239935\n239935 2 74201 203494 48461 1647988 239934 239935\n314101 2 314101\n11932 10239 2732409 2732514 2169561 11632 11749 11932\n1327037 10239 2731618 2731619 28883 10699 1327037\n92489 2 1224 1236 91347 1903409 551 796334\n1458427 2 1224 28216 80840 80864 2490452 1458425\n
Use custom symbols for unclassfied ranks (-r/--miss-rank-repl
)
$ taxonkit reformat lineage.txt -r \"__\" | cut -f 3\nEukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens\nEukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus\nBacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis\nBacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\nBacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\nBacteria;__;__;__;__;__;uncultured murine large bowel bacterium BAC 54B\nViruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\nViruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;__;Croceibacter phage P2559Y\nBacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\nBacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n\n$ taxonkit reformat lineage.txt -r Unassigned | cut -f 3\nEukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens\nEukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus\nBacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis\nBacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\nBacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\nBacteria;Unassigned;Unassigned;Unassigned;Unassigned;Unassigned;uncultured murine large bowel bacterium BAC 54B\nViruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\nViruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;Unassigned;Croceibacter phage P2559Y\nBacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\nBacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n
Estimate and fill missing rank with original lineage information (-F, --fill-miss-rank
, very useful for formatting input data for LEfSe). You can change the prefix \"unclassified\" using flag -p/--miss-rank-repl-prefix
.
$ cat lineage.txt \\\n | taxonkit reformat -F \\\n | csvtk -H -t cut -f 1,3 \\\n | csvtk -H -t sep -f 2 -s ';' -R \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\\n | csvtk pretty -t\n\ntaxid kindom phylum class order family genus species\n------- --------- ---------------------------- --------------------------- --------------------------- ---------------------------- ------------------------------- -----------------------------------------------\n9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens\n9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus\n376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis\n349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila\n239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila\n314101 Bacteria unclassified Bacteria phylum unclassified Bacteria class unclassified Bacteria order unclassified Bacteria family unclassified Bacteria genus uncultured murine large bowel bacterium BAC 54B\n11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle\n1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae unclassified Siphoviridae genus Croceibacter phage P2559Y\n92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae\n1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei\n
Do not add prefix or suffix for estimated nodes:
$ echo 314101 | taxonkit reformat -I 1\n314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B\n$ echo 314101 | taxonkit reformat -I 1 -F -p \"\" -s \"\"\n314101 Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;uncultured murine large bowel bacterium BAC 54B\n
Only some ranks.
$ cat lineage.txt \\\n | taxonkit reformat -F -f \"{s};{p}\"\\\n | csvtk -H -t cut -f 1,3 \\\n | csvtk -H -t sep -f 2 -s ';' -R \\\n | csvtk add-header -t -n taxid,species,phylum \\\n | csvtk pretty -t\n\ntaxid species phylum\n------- ----------------------------------------------- ----------------------------\n9606 Homo sapiens Chordata\n9913 Bos taurus Chordata\n376619 Francisella tularensis Proteobacteria\n349741 Akkermansia muciniphila Verrucomicrobia\n239935 Akkermansia muciniphila Verrucomicrobia\n314101 uncultured murine large bowel bacterium BAC 54B unclassified Bacteria phylum\n11932 Mouse Intracisternal A-particle Artverviricota\n1327037 Croceibacter phage P2559Y Uroviricota\n92489 Erwinia oleae Proteobacteria\n1458427 Serpentinomonas raichei Proteobacteria\n
For some taxids which rank is higher than the lowest rank in -f/--format
, use -T/--trim
to avoid fill missing rank lower than current rank.
$ echo -ne \"2\\n239934\\n239935\\n\" \\\n | taxonkit lineage \\\n | taxonkit reformat -F \\\n | sed -r \"s/;+$//\" \\\n | csvtk -H -t cut -f 1,3\n\n2 Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;unclassified Bacteria species\n239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;unclassified Akkermansia species\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n\n$ echo -ne \"2\\n239934\\n239935\\n\" \\\n | taxonkit lineage \\\n | taxonkit reformat -F -T \\\n | sed -r \"s/;+$//\" \\\n | csvtk -H -t cut -f 1,3\n\n2 Bacteria\n239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n
Support tab in format string
$ echo 9606 \\\n | taxonkit lineage \\\n | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{S}\" \\\n | csvtk cut -t -f -2\n\n9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens\n
List seven-level lineage for all TaxIds.
# replace empty taxon with \"Unassigned\"\n$ taxonkit list --ids 1 \\\n | taxonkit lineage \\\n | taxonkit reformat -r Unassigned \n | gzip -c > all.lineage.tsv.gz\n\n# tab-delimited seven-levels\n$ taxonkit list --ids 1 \\\n | taxonkit lineage \\\n | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\\n | csvtk cut -H -t -f -2 \\\n | head -n 5 \\\n | csvtk pretty -H -t\n\n# 8-level\n$ taxonkit list --ids 1 \\\n | taxonkit lineage \\\n | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n | csvtk cut -H -t -f -2 \\\n | head -n 5 \\\n | csvtk pretty -H -t\n\n# Fill and trim\n$ memusg -t -s ' taxonkit list --ids 1 \\\n | taxonkit lineage \\\n | taxonkit reformat -F -T \\\n | sed -r \"s/;+$//\" \\\n | gzip -c > all.lineage.tsv.gz '\n\nelapsed time: 19.930s\npeak rss: 6.25 GB\n
From taxid to 7-ranks lineage:
$ cat taxids.txt | taxonkit lineage | taxonkit reformat\n\n# for taxonkit v0.8.0 or later versions\n$ cat taxids.txt | taxonkit reformat -I 1\n
Some TaxIds have the same complete lineage, empty result is returned by default. You can use the flag -a/--output-ambiguous-result
to return one possible result. see #42
$ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t \n19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result\n19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result\n2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019\n2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019\n\n$ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t -a\n2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530\n2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530\n
Usage
Convert taxon names to TaxIds\n\nAttention:\n\n 1. Some TaxIds share the same names, e.g, Drosophila.\n These input lines are duplicated with multiple TaxIds.\n\n $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L\n Drosophila 7215 genus\n Drosophila 32281 subgenus\n Drosophila 2081351 genus\n\nUsage:\n taxonkit name2taxid [flags]\n\nFlags:\n -h, --help help for name2taxid\n -i, --name-field int field index of name. data should be tab-separated (default 1)\n -s, --sci-name only searching scientific names\n -r, --show-rank show rank\n
Examples
Example data
$ cat names.txt\nHomo sapiens\nAkkermansia muciniphila ATCC BAA-835\nAkkermansia muciniphila\nMouse Intracisternal A-particle\nWei Shen\nuncultured murine large bowel bacterium BAC 54B\nCroceibacter phage P2559Y\n
Default.
# taxonkit name2taxid names.txt\n$ cat names.txt | taxonkit name2taxid | csvtk pretty -H -t\nHomo sapiens 9606\nAkkermansia muciniphila ATCC BAA-835 349741\nAkkermansia muciniphila 239935\nMouse Intracisternal A-particle 11932\nWei Shen \nuncultured murine large bowel bacterium BAC 54B 314101\nCroceibacter phage P2559Y 1327037\n
Show rank.
$ cat names.txt | taxonkit name2taxid --show-rank | csvtk pretty -H -t\nHomo sapiens 9606 species\nAkkermansia muciniphila ATCC BAA-835 349741 strain\nAkkermansia muciniphila 239935 species\nMouse Intracisternal A-particle 11932 species\nWei Shen \nuncultured murine large bowel bacterium BAC 54B 314101 species\nCroceibacter phage P2559Y 1327037 species\n
From name to lineage.
$ cat names.txt | taxonkit name2taxid | taxonkit lineage --taxid-field 2\nHomo sapiens 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\nAkkermansia muciniphila ATCC BAA-835 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\nAkkermansia muciniphila 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\nMouse Intracisternal A-particle 11932 Viruses;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\nWei Shen\nuncultured murine large bowel bacterium BAC 54B 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B\nCroceibacter phage P2559Y 1327037 Viruses;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y\n
Convert old names to new names.
$ echo Lactobacillus fermentum | taxonkit name2taxid | taxonkit lineage -i 2 -n | cut -f 1,2,4\nLactobacillus fermentum 1613 Limosilactobacillus fermentum\n
Some TaxIds share the same scientific names, e.g, Drosophila.
$ echo Drosophila \\\n | taxonkit name2taxid \\\n | taxonkit lineage -i 2 -r \\\n | taxonkit reformat -i 3 \\\n | csvtk cut -H -t -f 1,2,4,5 \\\n | csvtk pretty -H -t\nDrosophila 7215 genus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila;\nDrosophila 32281 subgenus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila;\nDrosophila 2081351 genus Eukaryota;Basidiomycota;Agaricomycetes;Agaricales;Psathyrellaceae;Drosophila;\n
Usage
Filter TaxIds by taxonomic rank range\n\nAttentions:\n\n 1. Flag -L/--lower-than and -H/--higher-than are exclusive, and can be\n used along with -E/--equal-to which values can be different.\n 2. A list of pre-ordered ranks is in ~/.taxonkit/ranks.txt, you can use\n your list by -r/--rank-file, the format specification is below.\n 3. All ranks in taxonomy database should be defined in rank file.\n 4. Ranks can be removed with black list via -B/--black-list.\n\n 5. TaxIDs with no rank are kept by default!!!\n They can be optionally discarded by -N/--discard-noranks.\n 6. [Recommended] When filtering with -L/--lower-than, you can use\n -n/--save-predictable-norank to save some special ranks without order,\n where rank of the closest higher node is still lower than rank cutoff.\n\nRank file:\n\n 1. Blank lines or lines starting with \"#\" are ignored.\n 2. Ranks are in decending order and case ignored.\n 3. Ranks with same order should be in one line separated with comma (\",\", no space).\n 4. Ranks without order should be assigned a prefix symbol \"!\" for each rank.\n\nUsage:\n taxonkit filter [flags]\n\nFlags:\n -B, --black-list strings black list of ranks to discard, e.g., '-B \"no rank\" -B \"clade\"\n -N, --discard-noranks discard all ranks without order, type \"taxonkit filter --help\" for details\n -R, --discard-root discard root taxid, defined by --root-taxid\n -E, --equal-to strings output TaxIds with rank equal to some ranks, multiple values can be\n separated with comma \",\" (e.g., -E \"genus,species\"), or give multiple\n times (e.g., -E genus -E species)\n -h, --help help for filter\n -H, --higher-than string output TaxIds with rank higher than a rank, exclusive with --lower-than\n --list-order list user defined ranks in order, from \"$HOME/.taxonkit/ranks.txt\"\n --list-ranks list ordered ranks in taxonomy database, sorted in user defined order\n -L, --lower-than string output TaxIds with rank lower than a rank, exclusive with --higher-than\n -r, --rank-file string user-defined ordered taxonomic ranks, type \"taxonkit filter --help\"\n for details\n --root-taxid uint32 root taxid (default 1)\n -n, --save-predictable-norank do not discard some special ranks without order when using -L, where\n rank of the closest higher node is still lower than rank cutoff\n -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1)\n
Examples
Example data
$ echo 349741 | taxonkit lineage -t | cut -f 3 | sed 's/;/\\n/g' > taxids2.txt\n\n$ cat taxids2.txt\n131567\n2\n1783257\n74201\n203494\n48461\n1647988\n239934\n239935\n349741\n\n$ cat taxids2.txt | taxonkit lineage -r | csvtk -Ht cut -f 1,3,2 | csvtk pretty -H -t\n131567 no rank cellular organisms\n2 superkingdom cellular organisms;Bacteria\n1783257 clade cellular organisms;Bacteria;PVC group\n74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia\n203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae\n48461 order cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales\n1647988 family cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae\n239934 genus cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia\n239935 species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n349741 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n
Equal to certain rank(s) (-E/--equal-to
)
$ cat taxids2.txt \\\n | taxonkit filter -E Phylum -E Class \\\n | taxonkit lineage -r \\\n | csvtk -Ht cut -f 1,3,2 \\\n | csvtk pretty -H -t\n74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia\n203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae\n
Lower than a rank (-L/--lower-than
)
$ cat taxids2.txt \\\n | taxonkit filter -L genus \\\n | taxonkit lineage -r -n -L \\\n | csvtk -Ht cut -f 1,3,2 \\\n | csvtk pretty -H -t\n239935 species Akkermansia muciniphila\n349741 strain Akkermansia muciniphila ATCC BAA-835\n
Higher than a rank (-H/--higher-than
)
$ cat taxids2.txt \\\n | taxonkit filter -H phylum \\\n | taxonkit lineage -r -n -L \\\n | csvtk -Ht cut -f 1,3,2 \\\n | csvtk pretty -H -t\n2 superkingdom Bacteria\n
TaxIDs with no rank are kept by default!!! \"no rank\" and \"clade\" have no rank and can be filter out via -N/--discard-noranks
. Futher ranks can be removed with black list via -B/--black-list
.
# 562 is the TaxId of Escherichia coli\n$ taxonkit list --ids 562 \\\n | taxonkit filter -L species \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk freq -Ht -f 2 -nr \\\n | csvtk pretty -H -t\nstrain 2950\nno rank 149\nserotype 141\nserogroup 95\nisolate 1\nsubspecies 1\n\n$ taxonkit list --ids 562 \\\n | taxonkit filter -L species -N -B strain \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk freq -Ht -f 2 -nr \\\n | csvtk pretty -H -t\nserotype 141\nserogroup 95\nisolate 1\nsubspecies 1\n
Combine of -L/-H
with -E
.
$ cat taxids2.txt \\\n | taxonkit filter -L genus -E genus \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -H -t\n239934 genus Akkermansia\n239935 species Akkermansia muciniphila\n349741 strain Akkermansia muciniphila ATCC BAA-835\n
Special cases of \"no rank\". (-n/--save-predictable-norank
). When filtering with -L/--lower-than
, you can use -n/--save-predictable-norank
to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff.
$ echo -ne \"2605619\\n1327037\\n\" \\\n | taxonkit lineage -t \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -H -t \n131567 no rank cellular organisms\n2 superkingdom Bacteria\n1224 phylum Proteobacteria\n1236 class Gammaproteobacteria\n91347 order Enterobacterales\n543 family Enterobacteriaceae\n561 genus Escherichia\n562 species Escherichia coli\n2605619 no rank Escherichia coli O16:H48\n\n10239 superkingdom Viruses\n2731341 clade Duplodnaviria\n2731360 clade Heunggongvirae\n2731618 phylum Uroviricota\n2731619 class Caudoviricetes\n28883 order Caudovirales\n10699 family Siphoviridae\n196894 no rank unclassified Siphoviridae\n1327037 species Croceibacter phage P2559Y\n\n# save taxids\n$ echo -ne \"2605619\\n1327037\\n\" \\\n | taxonkit lineage -t \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | tee taxids4.txt\n131567\n2\n1224\n1236\n91347\n543\n561\n562\n2605619\n10239\n2731341\n2731360\n2731618\n2731619\n28883\n10699\n196894\n1327037\n
Now, filter nodes of rank <= species.
$ cat taxids4.txt \\\n | taxonkit filter -L species -E species -N -n \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -H -t\n562 species Escherichia coli\n2605619 no rank Escherichia coli O16:H48\n1327037 species Croceibacter phage P2559Y\n
Note that 2605619 (no rank) is saved because its parent node 562 is <= species.
Usage
Compute lowest common ancestor (LCA) for TaxIds\n\nAttention:\n\n 1. This command computes LCA TaxId for a list of TaxIds \n in a field (\"-i/--taxids-field) of tab-delimited file or STDIN.\n 2. TaxIDs should have the same separator (\"-s/--separator\"),\n single charactor separator is prefered.\n 3. Empty lines or lines without valid TaxIds in the field are omitted.\n 4. If some TaxIds are not found in database, it returns 0.\n\nExamples:\n\n $ echo 239934, 239935, 349741 | taxonkit lca -s \", \"\n 239934, 239935, 349741 239934\n\n $ time echo 239934 239935 349741 9606 | taxonkit lca\n 239934 239935 349741 9606 131567\n\nUsage:\n taxonkit lca [flags] \n\nFlags:\n -b, --buffer-size string size of line buffer, supported unit: K, M, G. You need to increase the\n value when \"bufio.Scanner: token too long\" error occured (default \"1M\")\n -h, --help help for lca\n --separater string separater for TaxIds. This flag is same to --separator. (default \" \")\n -s, --separator string separator for TaxIds (default \" \")\n -D, --skip-deleted skip deleted TaxIds and compute with left ones\n -U, --skip-unfound skip unfound TaxIds and compute with left ones\n -i, --taxids-field int field index of TaxIds. Input data should be tab-separated (default 1)\n
Examples:
Example data
$ taxonkit list --ids 9605 -nr --indent \" \"\n9605 [genus] Homo\n 9606 [species] Homo sapiens\n 63221 [subspecies] Homo sapiens neanderthalensis\n 741158 [subspecies] Homo sapiens subsp. 'Denisova'\n 1425170 [species] Homo heidelbergensis\n 2665952 [no rank] environmental samples\n 2665953 [species] Homo sapiens environmental sample\n
Simple one
$ echo 63221 2665953 | taxonkit lca\n63221 2665953 9605\n
Custom field (-i/--taxids-field
) and separater (-s/--separator
).
$ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\"\na 63221,2665953\nb 63221, 741158\n\n$ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\\n | taxonkit lca -i 2 -s \",\"\na 63221,2665953 9605\nb 63221, 741158 9606\n
Merged TaxIds.
# merged\n$ echo 92487 92488 92489 | taxonkit lca\n10:08:26.578 [WARN] taxid 92489 was merged into 796334\n92487 92488 92489 1236\n
Deleted TaxIds, you can ommit theses and continue compute with left onces with (-D/--skip-deleted
).
$ echo 1 2 3 | taxonkit lca \n10:30:17.678 [WARN] taxid 3 not found\n1 2 3 0\n\n$ time echo 1 2 3 | taxonkit lca -D\n10:29:31.828 [WARN] taxid 3 was deleted\n1 2 3 1\n
TaxIDs not found in database, you can ommit theses and continue compute with left onces with (-U/--skip-unfound
).
$ echo 61021 61022 11111111 | taxonkit lca\n10:31:44.929 [WARN] taxid 11111111 not found\n61021 61022 11111111 0\n\n$ echo 61021 61022 11111111 | taxonkit lca -U\n10:32:02.772 [WARN] taxid 11111111 not found\n61021 61022 11111111 2628496\n
Usage
Create TaxId changelog from dump archives\n\nSteps:\n\n # dependencies:\n # rush - https://github.com/shenwei356/rush/\n\n mkdir -p archive; cd archive;\n\n # --------- download ---------\n\n # option 1\n # for fast network connection\n wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp*.zip\n\n # option 2\n # for slow network connection\n url=https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/\n wget $url -O - -o /dev/null \\\n | grep taxdmp | perl -ne '/(taxdmp_.+?.zip)/; print \"$1\\n\";' \\\n | rush -j 2 -v url=$url 'axel -n 5 {url}/{}' \\\n --immediate-output -c -C download.rush\n\n # --------- unzip ---------\n\n ls taxdmp*.zip | rush -j 1 'unzip {} names.dmp nodes.dmp merged.dmp delnodes.dmp -d {@_(.+)\\.}'\n\n # optionally compress .dmp files with pigz, for saving disk space\n fd .dmp$ | rush -j 4 'pigz {}'\n\n # --------- create log ---------\n\n cd ..\n taxonkit taxid-changelog -i archive -o taxid-changelog.csv.gz --verbose\n\nOutput format (CSV):\n\n # fields comments\n taxid # taxid\n version # version / time of archive, e.g, 2019-07-01\n change # change, values:\n # NEW newly added\n # REUSE_DEL deleted taxids being reused\n # REUSE_MER merged taxids being reused\n # DELETE deleted\n # MERGE merged into another taxid\n # ABSORB other taxids merged into this one\n # CHANGE_NAME scientific name changed\n # CHANGE_RANK rank changed\n # CHANGE_LIN_LIN lineage taxids remain but lineage remain\n # CHANGE_LIN_TAX lineage taxids changed\n # CHANGE_LIN_LEN lineage length changed\n change-value # variable values for changes: \n # 1) new taxid for MERGE\n # 2) merged taxids for ABSORB\n # 3) empty for others\n name # scientific name\n rank # rank\n lineage # complete lineage of the taxid\n lineage-taxids # taxids of the lineage\n\n # you can use csvtk to investigate them. e.g.,\n csvtk grep -f taxid -p 1390515 taxid-changelog.csv.gz\n\nUsage:\n taxonkit taxid-changelog [flags]\n\nFlags:\n -i, --archive string directory containing uncompressed dumped archives\n -h, --help help for taxid-changelog\n
Details
Example 1 (E.coli with taxid 562
)
$ pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f taxid -p 562 \\\n | csvtk pretty\ntaxid version change change-value name rank lineage lineage-taxids\n562 2014-08-01 NEW Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562\n562 2014-08-01 ABSORB 662101;662104 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562\n562 2015-11-01 ABSORB 1637691 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562\n562 2016-10-01 CHANGE_LIN_LIN Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562\n562 2018-06-01 ABSORB 469598 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562\n\n# merged taxids\n$ pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f taxid -p 662101,662104,1637691,469598 \\\n | csvtk pretty\ntaxid version change change-value name rank lineage lineage-taxids\n469598 2014-08-01 NEW Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598\n469598 2016-10-01 CHANGE_LIN_LIN Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598\n469598 2018-06-01 MERGE 562 Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598\n662101 2014-08-01 MERGE 562 \n662104 2014-08-01 MERGE 562 \n1637691 2015-04-01 DELETE \n1637691 2015-05-01 REUSE_DEL Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691\n1637691 2015-11-01 MERGE 562 Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691\n
Example 2 (SARS-CoV-2).
$ time pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f taxid -p 2697049 \\\n | csvtk pretty\ntaxid version change change-value name rank lineage lineage-taxids\n2697049 2020-02-01 NEW Wuhan seafood market pneumonia virus species Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;unclassified Betacoronavirus;Wuhan seafood market pneumonia virus 10239;2559587;76804;2499399;11118;2501931;694002;696098;2697049\n2697049 2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n2697049 2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n2697049 2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n2697049 2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n2697049 2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n2697049 2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n\nreal 0m7.644s\nuser 0m16.749s\nsys 0m3.985s\n
Example 3 (All subspecies and strain in Akkermansia muciniphila 239935)
# species in Akkermansia\n$ taxonkit list --show-rank --show-name --indent \" \" --ids 239935\n239935 [species] Akkermansia muciniphila\n 349741 [strain] Akkermansia muciniphila ATCC BAA-835\n\n# check them all \n$ pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f taxid -P <(taxonkit list --indent \"\" --ids 239935) \\\n | csvtk pretty lineage-taxids\ntaxid version change change-value name rank lineage lineage-taxids\n239935 2014-08-01 NEW Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;203557;239934;239935\n239935 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;1647988;239934;239935\n239935 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935\n239935 2016-05-01 ABSORB 1834199 Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935\n349741 2014-08-01 NEW Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;203557;239934;239935;349741\n349741 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;1647988;239934;239935;349741\n349741 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741\n349741 2020-07-01 CHANGE_RANK Akkermansia muciniphila ATCC BAA-835 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741\n
More
"},{"location":"usage/#create-taxdump","title":"create-taxdump","text":"Usage
Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV\n\nInput format: \n 0. For GTDB taxonomy file, just use --gtdb.\n We use the numeric assembly accession as the taxon at subspecies rank.\n (without the prefix GCA_ and GCF_, and version number).\n 1. The input file should be tab-delimited, at least one column is needed.\n 2. Ranks can be given either via the first row or the flag --rank-names.\n 3. The column containing the genome/assembly accession is recommended to\n generate TaxId mapping file (taxid.map, id -> taxid).\n -A/--field-accession, field contaning genome/assembly accession \n --field-accession-re, regular expression to extract the accession\n Note that mutiple TaxIds pointing to the same accession are listed as\n comma-seperated integers. \n\nAttentions:\n 1. Names should be distinct in taxa of different ranks.\n But for these missing some taxon nodes, using names of parent nodes is allowed:\n\n GB_GCA_018897955.1 d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155\n\n It can also detect duplicate names with different ranks, e.g.,\n the Class and Genus have the same name B47-G6, and the Order and Family\n between them have different names. In this case, we reassign a new TaxId\n by increasing the TaxId until it being distinct.\n\n GB_GCA_003663585.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585\n\n 2. Taxa from different parents may have the same name.\n We will assign different TaxIds to them. \n\n E.g., in ICTV, many viruses from different species have the same names.\n In practice, we set the \"Virus names(s)\" as a subspecies rank and also\n specify it as the accession.\n\n Species Virus name(s)\n Jerseyvirus SETP3 Salmonella phage SETP7\n Jerseyvirus SETP7 Salmonella phage SETP7\n\n 3. The generated TaxIds are not consecutive numbers, however some tools like MMSeqs2\n required this, you can use the script below for convertion:\n\n https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py\n\nUsage:\n taxonkit create-taxdump [flags] \n\nFlags:\n -A, --field-accession int field index of assembly accession (genome ID), for outputting taxid.map\n --field-accession-re string regular expression to extract assembly accession (default\n \"^\\\\w\\\\w_(.+)$\")\n --force overwrite existed output directory\n --gtdb input files are GTDB taxonomy file\n --gtdb-re-subs string regular expression to extract assembly accession as the subspecies\n (default \"^\\\\w\\\\w_GC[AF]_(.+)\\\\.\\\\d+$\")\n -h, --help help for create-taxdump\n --line-chunk-size int number of lines to process for each thread, and 4 threads is fast\n enough. (default 5000)\n --null strings null value of taxa (default [,NULL,NA])\n -x, --old-taxdump-dir string taxdump directory of the previous version, for generating merged.dmp\n and delnodes.dmp\n -O, --out-dir string output directory\n -R, --rank-names strings names of all ranks, leave it empty to use the first row of input as\n rank names\n
Examples:
GTDB. See more: https://github.com/shenwei356/gtdb-taxdump
$ taxonkit create-taxdump --gtdb ar53_taxonomy_r207.tsv.gz bac120_taxonomy_r207.tsv.gz --out-dir taxdump\n16:42:35.213 [INFO] 317542 records saved to taxdump/taxid.map\n16:42:35.460 [INFO] 401815 records saved to taxdump/nodes.dmp\n16:42:35.611 [INFO] 401815 records saved to taxdump/names.dmp\n16:42:35.611 [INFO] 0 records saved to taxdump/merged.dmp\n16:42:35.611 [INFO] 0 records saved to taxdump/delnodes.dmp\n
ICTV, See more: https://github.com/shenwei356/ictv-taxdump
MGV. Only Order, Family, Genus information are available.
$ cat mgv_contig_info.tsv \\\n | csvtk cut -t -f ictv_order,ictv_family,ictv_genus,votu_id,contig_id \\\n | sed 1d \\\n > mgv.tsv\n\n$ taxonkit create-taxdump mgv.tsv --out-dir mgv --force -A 5 -R order,family,genus,species\n23:33:18.098 [INFO] 189680 records saved to mgv/taxid.map\n23:33:18.131 [INFO] 58102 records saved to mgv/nodes.dmp\n23:33:18.150 [INFO] 58102 records saved to mgv/names.dmp\n23:33:18.150 [INFO] 0 records saved to mgv/merged.dmp\n23:33:18.150 [INFO] 0 records saved to mgv/delnodes.dmp\n\n$ head -n 5 mgv/taxid.map \nMGV-GENOME-0364295 677052301\nMGV-GENOME-0364296 677052301\nMGV-GENOME-0364303 1414406025\nMGV-GENOME-0364311 1849074420\nMGV-GENOME-0364312 2074846424\n\n$ echo 677052301 | taxonkit lineage --data-dir mgv/ \n677052301 Caudovirales;crAss-phage;OTU-61123\n\n$ echo 677052301 | taxonkit reformat --data-dir mgv/ -I 1 -P\n677052301 k__;p__;c__;o__Caudovirales;f__crAss-phage;g__;s__OTU-61123\n\n$ grep MGV-GENOME-0364295 mgv.tsv \nCaudovirales crAss-phage NULL OTU-61123 MGV-GENOME-0364295\n
Custom lineages with the first row as rank names and treating one column as accession.
$ csvtk pretty -t example/taxonomy.tsv \nid superkingdom phylum class order family genus species\n--------------- ------------ -------------- ------------------- ---------------- ------------------ -------------- --------------------------\nGCF_001027105.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus\nGCF_001096185.1 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus pneumoniae\nGCF_001544255.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecium\nGCF_002949675.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella dysenteriae\nGCF_002950215.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella flexneri\nGCF_006742205.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus epidermidis\nGCF_000006945.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Salmonella Salmonella enterica\nGCF_000017205.1 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa\nGCF_003697165.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli\nGCF_009759685.1 Bacteria Proteobacteria Gammaproteobacteria Moraxellales Moraxellaceae Acinetobacter Acinetobacter baumannii\nGCF_000148585.2 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus mitis\nGCF_000392875.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecalis\nGCF_000742135.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Klebsiella Klebsiella pneumonia\n\n# the first column as accession\n$ taxonkit create-taxdump -A 1 example/taxonomy.tsv -O example/taxdump\n16:31:31.828 [INFO] I will use the first row of input as rank names\n16:31:31.843 [INFO] 13 records saved to example/taxdump/taxid.map\n16:31:31.843 [INFO] 39 records saved to example/taxdump/nodes.dmp\n16:31:31.843 [INFO] 39 records saved to example/taxdump/names.dmp\n16:31:31.843 [INFO] 0 records saved to example/taxdump/merged.dmp\n16:31:31.843 [INFO] 0 records saved to example/taxdump/delnodes.dmp\n\n$ export TAXONKIT_DB=example/taxdump\n$ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r\n1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species\n2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species\n3809813362 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis species\n4145431389 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecium species\n1569132721 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus species\n1920251658 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis species\n3843752343 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa species\n72054943 Bacteria;Proteobacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii species\n1678121664 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica species\n524994882 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella dysenteriae species\n2695851945 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella flexneri species\n3958205156 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae species\n4093283224 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli species\n\n$ head -n 3 example/taxdump/taxid.map\nGCF_001027105.1 1569132721\nGCF_001096185.1 2983929374\nGCF_001544255.1 4145431389\n
Custom lineages with the first row as rank names (pure lineage data)
$ csvtk cut -t -f 2- example/taxonomy.tsv | head -n 2 | csvtk pretty -t \nsuperkingdom phylum class order family genus species\n------------ ---------- ------- ---------- ----------------- -------------- ---------------------\nBacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus\n\n$ csvtk cut -t -f 2- example/taxonomy.tsv \\\n | taxonkit create-taxdump -O example/taxdump2\n16:53:08.604 [INFO] I will use the first row of input as rank names\n16:53:08.614 [INFO] 39 records saved to example/taxdump2/nodes.dmp\n16:53:08.614 [INFO] 39 records saved to example/taxdump2/names.dmp\n16:53:08.614 [INFO] 0 records saved to example/taxdump2/merged.dmp\n16:53:08.615 [INFO] 0 records saved to example/taxdump2/delnodes.dmp\n\n$ export TAXONKIT_DB=example/taxdump2\n$ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r | head -n 2\n1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species\n2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species\n
Usage
Generate shell autocompletion script\n\nSupported shell: bash|zsh|fish|powershell\n\nBash:\n\n # generate completion shell\n taxonkit genautocomplete --shell bash\n\n # configure if never did.\n # install bash-completion if the \"complete\" command is not found.\n echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion\n echo \"source ~/.bash_completion\" >> ~/.bashrc\n\nZsh:\n\n # generate completion shell\n taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit\n\n # configure if never did\n echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc\n echo \"autoload -U compinit; compinit\" >> ~/.zshrc\n\nfish:\n\n taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish\n\nUsage:\n taxonkit genautocomplete [flags]\n\nFlags:\n --file string autocompletion file (default \"/home/shenwei/.bash_completion.d/taxonkit.sh\")\n -h, --help help for genautocomplete\n --type string autocompletion type (currently only bash supported) (default \"bash\")\n
"},{"location":"usage/#profile2cami","title":"profile2cami","text":"Usage
Convert metagenomic profile table to CAMI format\n\nInput format: \n 1. The input file should be tab-delimited\n 2. At least two columns needed:\n a) TaxId of taxon at species or lower rank.\n b) Abundance (could be percentage, automatically detected or use -p/--percentage).\n\nAttentions:\n 1. Some TaxIds may be merged to another ones in current taxonomy version,\n the abundances will be summed up.\n 2. Some TaxIds may be deleted in current taxonomy version,\n the abundances can be optionally recomputed with the flag -R/--recompute-abd.\n\nUsage:\n taxonkit profile2cami [flags]\n\nFlags:\n -a, --abundance-field int field index of abundance. input data should be tab-separated (default 2)\n -h, --help help for profile2cami\n -0, --keep-zero keep taxons with abundance of zero\n -p, --percentage abundance is in percentage\n -R, --recompute-abd recompute abundance if some TaxIds are deleted in current taxonomy version\n -s, --sample-id string sample ID in result file\n -r, --show-rank strings only show TaxIds and names of these ranks (default\n [superkingdom,phylum,class,order,family,genus,species,strain])\n -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1)\n -t, --taxonomy-id string taxonomy ID in result file\n
Examples
Test data, note that 2824115
is merged to 483329
and 1657696
is deleted in current taxonomy version.
$ cat example/abundance.tsv \n2824115 0.2 merged to 483329\n483329 0.2 absord 2824115\n239935 0.5 no change\n1657696 0.1 deleted\n
Example:
$ taxonkit profile2cami -s sample1 -t 2021-10-01 \\\n example/abundance.tsv\n\n13:17:40.552 [WARN] taxid is deleted in current taxonomy version: 1657696\n13:17:40.552 [WARN] you may recomputed abundance with the flag -R/--recompute-abd\n@SampleID:sample1\n@Version:0.10.0\n@Ranks:superkingdom|phylum|class|order|family|genus|species|strain\n@TaxonomyID:2021-10-01\n@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE\n2 superkingdom 2 Bacteria 50.000000000000000\n2759 superkingdom 2759 Eukaryota 40.000000000000000\n74201 phylum 2|74201 Bacteria|Verrucomicrobia 50.000000000000000\n6656 phylum 2759|6656 Eukaryota|Arthropoda 40.000000000000000\n203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 50.000000000000000\n50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 40.000000000000000\n48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 50.000000000000000\n7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 40.000000000000000\n1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 50.000000000000000\n57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 40.000000000000000\n239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 50.000000000000000\n57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 40.000000000000000\n239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 50.000000000000000\n483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 40.000000000000000\n
Recompute (normalize) the abundance
$ taxonkit profile2cami -s sample1 -t 2021-10-01 \\\n example/abundance.tsv --recompute-abd\n13:19:23.647 [WARN] taxid is deleted in current taxonomy version: 1657696\n@SampleID:sample1\n@Version:0.10.0\n@Ranks:superkingdom|phylum|class|order|family|genus|species|strain\n@TaxonomyID:2021-10-01\n@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE\n2 superkingdom 2 Bacteria 55.555555555555557\n2759 superkingdom 2759 Eukaryota 44.444444444444450\n74201 phylum 2|74201 Bacteria|Verrucomicrobia 55.555555555555557\n6656 phylum 2759|6656 Eukaryota|Arthropoda 44.444444444444450\n203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 55.555555555555557\n50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 44.444444444444450\n48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 55.555555555555557\n7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 44.444444444444450\n1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 55.555555555555557\n57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 44.444444444444450\n239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 55.555555555555557\n57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 44.444444444444450\n239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 55.555555555555557\n483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 44.444444444444450\n
See https://github.com/shenwei356/sun2021-cami-profiles
Usage
Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile\n\nInput format: \n The CAMI (Taxonomic) Profiling Output Format \n - https://github.com/CAMI-challenge/contest_information/blob/master/file_formats/CAMI_TP_specification.mkd\n - One file with mutiple samples is also supported.\n\nHow to:\n - No extra taxonomy data needed, so the original taxonomic information are\n used and not changed.\n - A mini taxonomic tree is built from records with abundance greater than\n zero, and only leaves are retained for later use. The rank of leaves may\n be \"strain\", \"species\", or \"no rank\".\n - Relative abundances (in percentage) are recomputed for all leaves\n (reference genome).\n - A new taxonomic tree is built from these leaves, and abundances are \n cumulatively added up from leaves to the root.\n\nExamples:\n 1. Remove Archaea, Bacteria, and EukaryoteS, only keep Viruses:\n taxonkit cami-filter -t 2,2157,2759 test.profile -o test.filter.profile\n 2. Remove Viruses:\n taxonkit cami-filter -t 10239 test.profile -o test.filter.profile\n\nUsage:\n taxonkit cami-filter [flags]\n\nFlags:\n --field-percentage int field index of PERCENTAGE (default 5)\n --field-rank int field index of taxid (default 2)\n --field-taxid int field index of taxid (default 1)\n --field-taxpath int field index of TAXPATH (default 3)\n --field-taxpathsn int field index of TAXPATHSN (default 4)\n -h, --help help for cami-filter\n --leaf-ranks strings only consider leaves at these ranks (default [species,strain,no rank])\n --show-rank strings only show TaxIds and names of these ranks (default\n [superkingdom,phylum,class,order,family,genus,species,strain])\n --taxid-sep string separator of taxid in TAXPATH and TAXPATHSN (default \"|\")\n -t, --taxids strings the parent taxid(s) to filter out\n -f, --taxids-file strings file(s) for the parent taxid(s) to filter out, one taxid per line\n
Examples:
taxonkit profile2cami -s sample1 -t 2021-10-01 \\\n example/abundance.tsv --recompute-abd \\\n | taxonkit cami-filter -t 2759\n@SampleID:sample1\n@Version:0.10.0\n@Ranks:superkingdom|phylum|class|order|family|genus|species|strain\n@TaxonomyID:2021-10-01\n@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE\n2 superkingdom 2 Bacteria 100.000000000000000\n74201 phylum 2|74201 Bacteria|Verrucomicrobia 100.000000000000000\n203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 100.000000000000000\n48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 100.000000000000000\n1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 100.000000000000000\n239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 100.000000000000000\n239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 100.000000000000000\n
NCBI taxonomy, version 2021-01-21
TaxIDs. Root node 1
is removed. And These data should be updated along with NCBI taxonomy dataset. Seven sizes of TaxIds are sampled from nodes.dmp
.
# shuffle all taxids\ncut -f 1 nodes.dmp | grep -w -v 1 | shuf > ids.txt\n\n# extract n taxids for testing\nfor n in 1 10 100 1000 2000 4000 6000 8000 10000 20000 40000 60000 80000 100000; do \n head -n $n ids.txt > taxids.n$n.txt\ndone\n
ETE
sudo pip3 install ete3\n\n# create database\n# http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html#upgrading-the-local-database\nfrom ete3 import NCBITaxa\nncbi = NCBITaxa()\nncbi.update_taxonomy_database()\n
TaxonKit
mkdir -p $HOME/.taxonkit\nmkdir -p $HOME/bin/\n\n# data\nwget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \ntar -zxvf taxdump.tar.gz -C $HOME/.taxonkit\n\n# binary\nwget https://github.com/shenwei356/taxonkit/releases/download/v0.7.2/taxonkit_linux_amd64.tar.gz\ntar -zxvf taxonkit_linux_amd64.tar.gz -C $HOME/bin/\n
taxopy
sudo pip3 install -U taxopy\n\n# taxoopy identical dump files copied from taxonkit\nmkdir -p ~/.taxopy\ncp ~/.taxonkit/{nodes.dmp,names.dmp} ~/.taxopy\n
Scripts/Command as listed below. Python scripts were written following to the official documents, and parallelized querying were not used, including TaxonKit.
ETE get_lineage.ete.py < $infile > $outfile\ntaxopy get_lineage.taxopy.py < $infile > $outfile\ntaxonkit taxonkit lineage --threads 1 --delimiter \"; \" < $infile > $outfile\n
A Python script memusg was used to computate running time and peak memory usage of a process. A Perl scripts run.pl
is used to automatically running tests and generate data for plotting.
Running benchmark:
$ # emptying the buffers cache\n$ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\"\n\ntime perl run.pl -n 3 run_benchmark.sh -o bench.get_lineage.tsv\n
Checking result:
$ md5sum taxids.n*.lineage\n\n# clear\n$ rm *.lineage *.out\n
Plotting benchmark result. R libraries dplyr
, ggplot2
, scales
, ggthemes
, ggrepel
are needed.
# reformat dataset\n# tools: https://github.com/shenwei356/csvtk/\n\nfor f in taxids.n*.txt; do wc -l $f; done \\\n | sort -k 1,1n \\\n | awk '{ print($2\"\\t\"$1) }' \\\n > dataset_rename.tsv\n\ncat bench.get_lineage.tsv \\\n | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\\n | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\\n > bench.get_lineage.reformat.tsv\n\n./plot2.R -i bench.get_lineage.reformat.tsv --width 6 --height 4 --dpi 600 \\\n --labcolor \"log10(queries)\" --labshape \"Tools\"\n
Result
"},{"location":"bench/#benchmark-2-taxonkit-multi-threaded-scalability","title":"Benchmark 2: TaxonKit multi-threaded scalability","text":"Running benchmark:
$ # emptying the buffers cache\n$ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\"\n\n\n$ time perl run.pl -n 3 run_benchmark_taxonkit.sh -o bench.taxonkit.tsv\n$ rm *.lineage *.out\n
Plotting benchmark result.
cat bench.taxonkit.tsv \\\n | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\\n | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\\n > bench.taxonkit.reformat.tsv\n\n./plot_threads2.R -i bench.taxonkit.reformat.tsv --width 6 --height 4 --dpi 600 \\\n --labcolor \"log10(queries)\" --labshape \"Threads\"\n
Result
Please enable JavaScript to view the comments powered by Disqus."}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit","text":"Related projects:
$HOME/.taxonkit
list
List taxonomic subtrees (TaxIds) bellow given TaxIds lineage
Query taxonomic lineage of given TaxIds reformat
Reformat lineage in canonical ranks name2taxid
Convert taxon names to TaxIds filter
Filter TaxIds by taxonomic rank range lca
Compute lowest common ancestor (LCA) for TaxIds taxid-changelog
Create TaxId changelog from dump archives profile2cami
* Convert metagenomic profile table to CAMI format cami-filter
* Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile create-taxdump
* Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV Note: *New commands since the publication.
"},{"location":"#benchmark","title":"Benchmark","text":"Versions: ETE=3.1.2, taxopy=0.5.0 (faster since 0.6.0), TaxonKit=0.7.2.
"},{"location":"#dataset","title":"Dataset","text":"taxdump.tar.gz
: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz names.dmp
, nodes.dmp
, delnodes.dmp
and merged.dmp
to data directory: $HOME/.taxonkit
, e.g., /home/shenwei/.taxonkit
,--data-dir
, or environment variable TAXONKIT_DB
.All-in-one command:
wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \ntar -zxvf taxdump.tar.gz\n\nmkdir -p $HOME/.taxonkit\ncp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit\n
Update dataset: Simply re-download the taxdump files, uncompress and override old ones.
"},{"location":"#installation","title":"Installation","text":"Go to Download Page for more download options and changelogs.
TaxonKit
is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.
Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz
command or other tools. And then:
For Linux-like systems
If you have root privilege simply copy it to /usr/local/bin
:
sudo cp taxonkit /usr/local/bin/\n
Or copy to anywhere in the environment variable PATH
:
mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/\n
For Windows, just copy taxonkit.exe
to C:\\WINDOWS\\system32
.
conda install -c bioconda taxonkit\n
"},{"location":"#method-3-install-via-homebrew-out-of-date","title":"Method 3: Install via homebrew (out of date)","text":"brew install brewsci/bio/taxonkit\n
"},{"location":"#method-4-compile-from-source-latest-stabledev-version","title":"Method 4: Compile from source (latest stable/dev version)","text":"Install go
wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz\n\ntar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/\n\n# or \n# echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc\n# source ~/.bashrc\nexport PATH=$PATH:$HOME/go/bin\n
Compile TaxonKit
# ------------- the latest stable version -------------\n\ngo get -v -u github.com/shenwei356/taxonkit/taxonkit\n\n# The executable binary file is located in:\n# ~/go/bin/taxonkit\n# You can also move it to anywhere in the $PATH\nmkdir -p $HOME/bin\ncp ~/go/bin/taxonkit $HOME/bin/\n\n# --------------- the development version --------------\n\ngit clone https://github.com/shenwei356/taxonkit\ncd taxonkit/taxonkit/\ngo build\n\n# The executable binary file is located in:\n# ./taxonkit\n# You can also move it to anywhere in the $PATH\nmkdir -p $HOME/bin\ncp ./taxonkit $HOME/bin/\n
Supported shell: bash|zsh|fish|powershell
Bash:
# generate completion shell\ntaxonkit genautocomplete --shell bash\n\n# configure if never did.\n# install bash-completion if the \"complete\" command is not found.\necho \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion\necho \"source ~/.bash_completion\" >> ~/.bashrc\n
Zsh:
# generate completion shell\ntaxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit\n\n# configure if never did\necho 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc\necho \"autoload -U compinit; compinit\" >> ~/.zshrc\n
fish:
taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish\n
"},{"location":"#citation","title":"Citation","text":"If you use TaxonKit in your work, please cite:
Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006
"},{"location":"#contact","title":"Contact","text":"Create an issue to report bugs, propose new functions or ask for help.
"},{"location":"#license","title":"License","text":"MIT License
"},{"location":"#starchart","title":"Starchart","text":""},{"location":"bioinf/","title":"Bioinf","text":""},{"location":"chinese-dev/","title":"\u5f00\u53d1\u7b14\u8bb0","text":""},{"location":"chinese-dev/#_1","title":"\u73b0\u6709\u5de5\u5177\u6bd4\u8f83","text":"\u60f3\u8981\u4eceNCBI\u83b7\u53d6\u751f\u7269\u7684\u8c31\u7cfb\u4fe1\u606f\uff0c\u53ef\u4ee5\u5728 NCBI Taxonomy\u7f51\u7ad9\u4e0a\u7528TaxID\u6216\u8005\u540d\u79f0\u67e5\u8be2\u3002 \u6bd4\u5982\u53ef\u4ee5\u7528Homo sapiens
\u62169606
\u641c\u7d22\u201c\u4eba\u201d\u7684\u5206\u7c7b\u5b66\u4fe1\u606f\uff0c\u4ee5\u53ca\u5bc6\u7801\u5b50\u8868\uff0cEntrez\u8bb0\u5f55\u7edf\u8ba1\u7b49\u3002
\u540c\u65f6\u4e5f\u53ef\u4ee5\u901a\u8fc7NCBI\u7684\u5b98\u65b9\u5de5\u5177\u5305 E-utilities (ftp)\u3002
$ esearch -db taxonomy -query \"txid9606 [Organism]\" \\\n | efetch -format xml \\\n | xtract -pattern Lineage -element Lineage\n
\u6b64\u5916\u4e5f\u6709\u4e00\u4e9b\u5de5\u5177\u63d0\u4f9b\u7c7b\u4f3c\u7684\u529f\u80fd\uff0c\u90e8\u5206\u8f6f\u4ef6\uff1a
\u5de5\u5177 \u7f16\u7a0b\u8bed\u8a00 \u6570\u636e\u83b7\u53d6\u65b9\u5f0f \u4f7f\u7528\u65b9\u5f0f \u5907\u6ce8 E-utilities shell/Perl/C++ \u8fdc\u7a0bWeb\u8c03\u7528 \u547d\u4ee4\u884c \u5b98\u65b9\u7a0b\u5e8f\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd BioPython Python \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c \u5305\u88c5entrez\u63a5\u53e3\uff0cTaxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd ETE Toolkit Python \u672c\u5730\u6570\u636e\u5e93 \u811a\u672c/\u547d\u4ee4\u884c Taxonomy\u64cd\u4f5c\u4ec5\u4e3a\u5176\u90e8\u5206\u529f\u80fd Taxize R \u8fdc\u7a0bWeb\u8c03\u7528 \u811a\u672c ropensci\uff1b\u652f\u6301\u591a\u79cd\u6570\u636e\u5e93\uff1b\u529f\u80fd\u8f83\u4e30\u5bcc Taxopy Python \u672c\u5730\u6570\u636e\u6587\u4ef6 \u811a\u672c/\u547d\u4ee4\u884c \u4ec5\u57fa\u672c\u529f\u80fd\u9009\u62e9\u5de5\u5177\u4e00\u822c\u8003\u8651\u51e0\u4e2a\u65b9\u9762\uff1a
\u6700\u521d\u6211\u60f3\u8981\u7684\u529f\u80fd\u53ea\u662f\u6839\u636e\u83b7\u53d6\"\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\"\u683c\u5f0f\u7684\u8c31\u7cfb\uff0c\u53d1\u73b0\u6ca1\u6709\u73b0\u6210\u5de5\u5177\uff0c\u800c\u540e\u53c8\u6709\u65b0\u7684\u9700\u6c42\u65e0\u6cd5\u6ee1\u8db3\uff0c\u5373\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\u6240\u6709\u7684TaxID\u3002 \u6545\u5f00\u59cb\u7f16\u5199\u5de5\u5177\u6765\u5b9e\u73b0\uff0c\u5e76\u9010\u6b65\u6269\u5c55\u5176\u529f\u80fd\u3002
\u5176\u5b9e\u6700\u7b80\u5355\u7684\u65b9\u6cd5\u5c31\u662f\u81ea\u5df1\u4e0b\u8f7d\u6570\u636e\u6587\u4ef6\u8fdb\u884c\u89e3\u6790\u3002
"},{"location":"chinese-dev/#ncbi-taxonomy","title":"NCBI Taxonomy \u6570\u636e\u6587\u4ef6","text":"NCBI Taxonomy\u6570\u636e\u5e93\u5c06\u6240\u6709\u751f\u7269\u7684\u5206\u7c7b\u5b66\u5173\u7cfb\u7ec4\u7ec7\u4e3a\u4e00\u68f5\u201c\u6709\u6839\u6811\u201d\uff08rooted tree\uff09, \u4e0e\u8fdb\u5316\u6811\uff08Phylogenetic tree\uff09\u4e0d\u540c: \u8fdb\u5316\u6811\u662f\u6309\u8fdb\u5316\u5173\u7cfb\u201d\u7ec4\u7ec7\uff0c\u4e14\u53ef\u4ee5\u4e3a\u201c\u65e0\u6839\u6811\u201d(unrooted tree)\u3002
NCBI Taxonomy\u516c\u5f00\u6570\u636e\u683c\u5f0f\u6709\u4e24\u79cd\uff0c\u65e7\u7684\u540d\u79f0\u4e3a taxdump.tar.gz
\uff0c\u6587\u4ef6\u5927\u5c0f\u7ea650Mb\uff0c\u5185\u542b\u4ee5\u4e0b\u6587\u4ef6\u3002
nodes.dmp # [\u5f53\u524d\u7248\u672c] \u8282\u70b9\u4fe1\u606f\n # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, parent tax_id, rank\nnames.dmp # [\u5f53\u524d\u7248\u672c] \u540d\u79f0\u4fe1\u606f\n # \u91cd\u8981\u5185\u5bb9\uff1a tax_id, name_txt\nmerged.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5408\u5e76\u7684\u8282\u70b9\u4fe1\u606f\n # \u91cd\u8981\u5185\u5bb9\uff1a old_tax_id, new_tax_id\ndelnodes.dmp # [\u76ee\u524d\u4e3a\u6b62] \u88ab\u5220\u9664\u7684nodes\u4fe1\u606f\n # \u91cd\u8981\u5185\u5bb9\uff1a tax_id\n\ncitations.dmp # \u5f15\u7528\u4fe1\u606f\ndivision.dmp # division\u4fe1\u606f\ngencode.dmp # \u9057\u4f20\u7f16\u7801\u4fe1\u606f\ngc.prt # \u9057\u4f20\u7f16\u7801\u8868\nreadme.txt # \u8bf4\u660e\u6587\u6863\n
\u5176\u4e2d\u6700\u4e3b\u8981\u7684\u662f\u524d4\u4e2a\u6587\u4ef6\uff1a
nodes.dmp
\u4e3b\u8981\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709\u5206\u7c7b\u5b66\u5355\u5143\u8282\u70b9\uff08taxon\uff09 \u7684\u552f\u4e00\u6807\u8bc6\u7b26\uff08taxonomic identifier, \u7b80\u79f0TaxId, taxid, tax_id)\uff0c \u5206\u7c7b\u5b66\u6c34\u5e73(rank\uff09\uff0c\u53ca\u5176\u7236\u8282\u70b9\u7684TaxID\u3002names.dmp
\u4e3b\u8981\u5305\u542b\u5305\u542b\u5f53\u524d\u7248\u672c\u7684\u6240\u6709TaxID\u53ca\u5176\u7edf\u4e00\u79d1\u5b66\u540d\u79f0\uff08scientific name\uff09\u548c\u522b\u540d\u3002merged.dmp
\u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5408\u5e76\u7684TaxID\u4e0e\u5408\u5e76\u5230\u7684\u65b0TaxID\u3002delnodes.dmp
\u5305\u542b\u4e86\u5230\u5f53\u524d\u7248\u672c\u4e3a\u6b62\uff0c\u6240\u6709\u88ab\u5220\u9664\u7684TaxID\u3002\u57282018\u5e742\u6708\u7684\u65f6\u5019\uff0c\u63a8\u51fa\u4e86\u65b0\u7684\u683c\u5f0f\uff0c \u989d\u5916\u5305\u542b\u4e86\u8c31\u7cfb\uff08lineage\uff09\uff0c\u7c7b\u578b\uff08type\uff09\u548c\u5bbf\u4e3b\uff08host\uff09\u4fe1\u606f\u3002 \u6587\u4ef6\u540d\u79f0\u4e3anew_taxdump.tar.gz
\uff0c\u6587\u4ef6\u5927\u5c0f\u7ea6110Mb\u3002 \u76f8\u5bf9\u65e7\u7248\uff0c\u65b0\u7248\u672c\u6587\u4ef6\u6570\u91cf\u548c\u5185\u5bb9\u66f4\u591a\uff0c\u4e3b\u8981\u662f\u56e0\u4e3a\u589e\u52a0\u4e86lineage\u548c\u7c7b\u578b\u4fe1\u606f\u3002 \u4e8b\u5b9e\u4e0alineage\u662f\u53ef\u4ee5\u4ecenodes.dmp
\u548cnames.dmp
\u8ba1\u7b97\u800c\u6765\u3002 \u65b0\u7248\u683c\u5f0f\u6240\u542b\u6587\u4ef6\u5982\u4e0b\uff1a
nodes.dmp\nnames.dmp\nmerged.dmp\ndelnodes.dmp\n\nfullnamelineage.dmp\nTaxIDlineage.dmp\nrankedlineage.dmp\n\nhost.dmp\ntypeoftype.dmp\ntypematerial.dmp\n\ncitations.dmp\ndivision.dmp\ngencode.dmp\nreadme.txt\n
NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/
\u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002
\u5927\u5bb6\u5e94\u8be5\u90fd\u6709\u5b89\u88c5\u751f\u7269\u4fe1\u606f\u8f6f\u4ef6\u7684\u75db\u82e6\u56de\u5fc6\uff0c\u5728conda\u51fa\u73b0\u4e4b\u524d\uff0c\u5f88\u591a\u8f6f\u4ef6\u90fd\u9700\u8981\u624b\u52a8\u5b89\u88c5\u4f9d\u8d56\u3001\u518d\u7f16\u8bd1\u5b89\u88c5\u3002 \u4e0d\u540c\u64cd\u4f5c\u7cfb\u7edf\uff0c\u64cd\u4f5c\u7cfb\u7edf\u7248\u672c\uff0c\u7f16\u8bd1\u5668\u7248\u672c\u7ed9\u8f6f\u4ef6\u5b89\u88c5\u5e26\u6765\u4e86\u5de8\u5927\u7684\u56f0\u96be\u3002 \u5982\u679c\u5f00\u53d1\u8005\u6ca1\u6ce8\u610f\u8f6f\u4ef6\u7684\u8de8\u5e73\u53f0\u3001\u53ef\u79fb\u690d\u6027\u66f4\u662f\u5982\u6b64\u3002
\u597d\u7684\u8f6f\u4ef6\u4e00\u5b9a\u8981\u8003\u8651\u4ee5\u4e0b\u51e0\u4e2a\u65b9\u9762\uff1a
\u5728\u5b9e\u73b0TaxonKit\u7684\u65f6\u5019\uff0c\u6211\u5df2\u7ecf\u5f00\u59cb\u7f16\u5199seqkit\u548ccsvtk\u8f6f\u4ef6\uff0c\u6709\u4e86\u4e00\u5b9a\u7684\u7ecf\u9a8c\uff0c\u4e5f\u57fa\u672c\u80fd\u8fbe\u5230\u4e0a\u8ff0\u6240\u6709\u8981\u6c42\u3002
TaxonKit\u4f7f\u7528Go\u8bed\u8a00\u7f16\u5199\uff0c\u8fd9\u6837\u53ef\u4ee5\u8f7b\u677e\u7f16\u8bd1\u51fa\u652f\u6301Linux, Windows, macOS\u7b49\u64cd\u4f5c\u7cfb\u7edf\u7684\u4e0d\u540c\u67b6\u6784\uff08x86/arm\uff09\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u7531\u4e8eGo\u662f\u7f16\u8bd1\u578b\u8bed\u8a00\uff0c\u5728\u8fd0\u884c\u6548\u7387\u4e0a\u4e5f\u6709\u4fdd\u8bc1\u3002 \u81f3\u4e8e\u914d\u7f6e\u3001\u4f7f\u7528\u7b49\u4fbf\u5229\u6027\u5219\u4f9d\u8d56\u4e8e\u5f00\u53d1\u8005\u3002
\u5206\u7c7b\u5b66\u6570\u636e\u4f7f\u7528NCBI taxonomy\u7684\u516c\u5f00\u6570\u636e\u3002 \u6570\u636e\u8bbf\u95ee\u65b9\u5f0f\u7684\u9009\u62e9\uff1a\u901a\u8fc7\u7f51\u7edc\u8bbf\u95ee\u5b98\u65b9Web\u63a5\u53e3\u7684\u65b9\u5f0f\u592a\u6162\uff0c\u53ea\u8003\u8651\u672c\u5730\u8bbf\u95ee\u3002 \u672c\u5730\u8bbf\u95ee\u6709\u51e0\u79cd\u65b9\u5f0f\uff1a
\u6700\u540e\u6d4b\u8bd5\u53d1\u73b0\uff0c\u76f4\u63a5\u89e3\u6790\u6570\u636e\u6587\u4ef6\u7684\u901f\u5ea6\u4e5f\u5f88\u5feb\uff0c5\u79d2\u5de6\u53f3\uff08\u5b58\u50a8\u4e3aNVMe SSD\uff09\uff0c\u5b8c\u5168\u6ee1\u8db3\u8981\u6c42\u3002 \u5b8c\u5168\u4e0d\u7528\u642d\u5efa\u6570\u636e\u5e93\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\u3002 \u8fd1\u65e5\u53c8\u8fdb\u4e00\u6b65\u4f18\u5316\u52302\u79d2\u5de6\u53f3\uff0c\u975e\u5e38\u5feb\u901f\u3002\u5185\u5b58\u4e5f\u5728500Mb-1.5G\u5de6\u53f3\uff0c\u5b8c\u5168\u53ef\u4ee5\u63a5\u53d7\u3002
TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c\u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\u3002
"},{"location":"chinese-dev/#_2","title":"\u5c40\u9650\u6027","text":"\u4ece\u4e8b\u751f\u7269\u591a\u6837\u6027\u7684\u7814\u7a76\u8005\u5bf9NCBI Taxonomy\u6570\u636e\u5e93\u4e00\u5b9a\u4e0d\u4f1a\u964c\u751f\uff0c \u5b83\u5305\u542b\u4e86NCBI\u6240\u6709\u6838\u9178\u548c\u86cb\u767d\u5e8f\u5217\u6570\u636e\u5e93\u4e2d\u6bcf\u6761\u5e8f\u5217\u5bf9\u5e94\u7684\u7269\u79cd\u540d\u79f0\u4e0e\u5206\u7c7b\u5b66\u4fe1\u606f\u3002 \u5927\u591a\u6570\u751f\u6001\u5b66\u7814\u7a76\u5bf9\u7269\u79cd\u7ec4\u6210\u7684\u63cf\u8ff0\u90fd\u662f\u57fa\u4e8eNCBI Taxonomy\u6570\u636e\u5e93\uff0c \u5f53\u7136\u76ee\u524d\u4e5f\u5f00\u59cb\u4f7f\u7528\u5176\u4ed6\u6570\u636e\u5e93\uff0c\u5982GTDB\u7b49\u3002
NCBI Taxonomy\u6570\u636e\u5e93\u59cb\u4e8e1991\u5e74\uff0c\u4e00\u76f4\u968f\u7740Entrez\u6570\u636e\u5e93\u548c\u5176\u4ed6\u6570\u636e\u5e93\u66f4\u65b0\uff0c 1996\u5e74\u63a8\u51fa\u7f51\u9875\u7248\u3002NCBI Taxonomy\u6570\u636e\u5e93\u5b98\u65b9\u5730\u5740\u4e3a https://www.ncbi.nlm.nih.gov/taxonomy \uff0c \u516c\u5f00\u6570\u636e\u4e0b\u8f7d\u5730\u5740\u4e3a https://ftp.ncbi.nih.gov/pub/taxonomy/ \uff0c \u6570\u636e\u6bcf\u5c0f\u65f6\u66f4\u65b0\uff0c\u6bcf\u4e2a\u6708\u521d\u751f\u6210\u4e00\u4efd\u6570\u636e\u5f52\u6863\u5b58\u4e8e taxdump_archive \u76ee\u5f55\uff0c\u6700\u65e9\u53ef\u8ffd\u6eaf\u52302014\u5e748\u6708\u3002
"},{"location":"chinese/#taxonkit","title":"TaxonKit \u4f7f\u7528","text":"TaxonKit\u662f\u91c7\u7528Go\u8bed\u8a00\u7f16\u5199\u7684\u547d\u4ee4\u884c\u5de5\u5177\uff0c \u63d0\u4f9bLinux, Windows, macOS\u64cd\u4f5c\u7cfb\u7edf\u4e0d\u540c\u67b6\u6784\uff08x86-64/arm64\uff09\u7684\u9759\u6001\u7f16\u8bd1\u7684\u53ef\u6267\u884c\u4e8c\u8fdb\u5236\u6587\u4ef6\u3002 \u53d1\u5e03\u7684\u538b\u7f29\u5305\u4e0d\u8db33Mb\uff0c\u9664\u4e86Github\u6258\u7ba1\u5916\uff0c\u8fd8\u63d0\u4f9b\u56fd\u5185\u955c\u50cf\u4f9b\u4e0b\u8f7d\uff0c\u540c\u65f6\u8fd8\u652f\u6301conda\u548chomebrew\u5b89\u88c5\u3002 \u7528\u6237\u53ea\u9700\u8981\u4e0b\u8f7d\u3001\u89e3\u538b\uff0c\u5f00\u7bb1\u5373\u7528\uff0c\u65e0\u9700\u914d\u7f6e\uff0c\u4ec5\u9700\u4e0b\u8f7d\u89e3\u538bNCBI Taxonomy\u6570\u636e\u6587\u4ef6\u89e3\u538b\u5230\u6307\u5b9a\u76ee\u5f55\u5373\u53ef\u3002
\u9009\u62e9\u7cfb\u7edf\u5bf9\u5e94\u7684\u7248\u672c\u4e0b\u8f7d\u6700\u65b0\u7248 https://github.com/shenwei356/taxonkit/releases \uff0c\u89e3\u538b\u540e\u6dfb\u52a0\u73af\u5883\u53d8\u91cf\u5373\u53ef\u4f7f\u7528\u3002\u6216\u53ef\u9009conda\u5b89\u88c5
conda install taxonkit -c bioconda -y\n# \u8868\u683c\u6570\u636e\u5904\u7406\uff0c\u63a8\u8350\u4f7f\u7528 csvtk \u66f4\u9ad8\u6548\nconda install csvtk -c bioconda -y\n
\u6d4b\u8bd5\u6570\u636e\u4e0b\u8f7d\u53ef\u76f4\u63a5 https://github.com/shenwei356/taxonkit \u4e0b\u8f7d\u9879\u76ee\u538b\u7f29\u5305\uff0c\u6216\u4f7f\u7528git clone\u4e0b\u8f7d\u9879\u76ee\u6587\u4ef6\u5939\uff0c\u5176\u4e2d\u7684example\u4e3a\u6d4b\u8bd5\u6570\u636e
git clone https://github.com/shenwei356/taxonkit\n
TaxonKit\u4e3a\u547d\u4ee4\u884c\u5de5\u5177\uff0c\u91c7\u7528\u5b50\u547d\u4ee4\u7684\u65b9\u5f0f\u6765\u6267\u884c\u4e0d\u540c\u529f\u80fd\uff0c \u5927\u591a\u6570\u5b50\u547d\u4ee4\u652f\u6301\u6807\u51c6\u8f93\u5165/\u8f93\u51fa\uff0c\u4fbf\u4e8e\u4f7f\u7528\u547d\u4ee4\u884c\u7ba1\u9053\u8fdb\u884c\u6d41\u6c34\u4f5c\u4e1a\uff0c \u8f7b\u677e\u6574\u5408\u8fdb\u5206\u6790\u6d41\u7a0b\u4e2d\u3002
\u5b50\u547d\u4ee4 \u529f\u80fdlist
\u5217\u51fa\u6307\u5b9aTaxId\u4e0b\u6240\u6709\u5b50\u5355\u5143\u7684\u7684TaxID lineage
\u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\uff08lineage\uff09 reformat
\u5c06\u5b8c\u6574\u8c31\u7cfb\u8f6c\u5316\u4e3a\u201c\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u682a\"\u7684\u81ea\u5b9a\u4e49\u683c\u5f0f name2taxid
\u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID filter
\u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs lca
\u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA) taxid-changelog
\u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55 version
\u663e\u793a\u7248\u672c\u4fe1\u606f\u3001\u68c0\u6d4b\u65b0\u7248\u672c genautocomplete
\u751f\u6210shell\u81ea\u52a8\u8865\u5168\u914d\u7f6e\u811a\u672c \u5907\u6ce8\uff1a
>
\uff09\u5199\u5165\u6587\u4ef6\u3002-o
\u6216--out-file
\u6307\u5b9a\u8f93\u51fa\u6587\u4ef6\uff0c\u4e14\u53ef\u81ea\u52a8\u8bc6\u522b\u8f93\u51fa\u6587\u4ef6\u540e\u7f00\uff08.gz
\uff09\u8f93\u51fagzip\u683c\u5f0f\u3002list
\u4e0etaxid-changelog
\u4e4b\u5916\uff0clineage
, reformat
, name2taxid
, filter
\u4e0e lca
\u5747\u53ef\u4ece\u6807\u51c6\u8f93\u5165\uff08stdin\uff09\u8bfb\u53d6\u8f93\u5165\u6570\u636e\uff0c\u4e5f\u53ef\u901a\u8fc7\u4f4d\u7f6e\u53c2\u6570\uff08positional arguments\uff09\u8f93\u5165\uff0c\u5373\u547d\u4ee4\u540e\u9762\u4e0d\u5e26 \u4efb\u4f55flag\u7684\u53c2\u6570\uff0c\u5982 taxonkit lineage taxids.txt
-i
\u6216--taxid-field
\u6307\u5b9a\u3002TaxonKit\u76f4\u63a5\u89e3\u6790NCBI Taxonomy\u6570\u636e\u6587\u4ef6\uff082\u79d2\u5de6\u53f3\uff09\uff0c\u914d\u7f6e\u66f4\u5bb9\u6613\uff0c\u4e5f\u4fbf\u4e8e\u66f4\u65b0\u6570\u636e\uff0c\u5360\u7528\u5185\u5b58\u5728500Mb-1.5G\u5de6\u53f3\u3002 \u6570\u636e\u4e0b\u8f7d\uff1a
# \u6709\u65f6\u4e0b\u8f7d\u5931\u8d25\uff0c\u53ef\u591a\u8bd5\u51e0\u6b21\uff1b\u6216\u5c1d\u8bd5\u6d4f\u89c8\u5668\u4e0b\u8f7d\u6b64\u94fe\u63a5\nwget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \ntar -zxvf taxdump.tar.gz\n\n# \u89e3\u538b\u6587\u4ef6\u5b58\u4e8e\u5bb6\u76ee\u5f55\u4e2d.taxonkit/\uff0c\u7a0b\u5e8f\u9ed8\u8ba4\u6570\u636e\u5e93\u9ed8\u8ba4\u76ee\u5f55\nmkdir -p $HOME/.taxonkit\ncp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit\n
"},{"location":"chinese/#list-taxidtaxid","title":"list \u5217\u51fa\u6307\u5b9aTaxId\u6240\u5728\u5b50\u6811\u7684\u6240\u6709TaxID","text":"taxonkit list
\u7528\u4e8e\u5217\u51fa\u6307\u5b9aTaxID\u6240\u5728\u5206\u7c7b\u5b66\u5355\u5143\uff08taxon\uff09\u7684\u5b50\u6811\uff08subtree\uff09\u7684\u6240\u6709taxon\u7684TaxID\uff0c\u53ef\u9009\u663e\u793a\u540d\u79f0\u548c\u5206\u7c7b\u5b66\u6c34\u5e73\u3002 \u6b64\u529f\u80fd\u4e0eNCBI Taxonomy\u7f51\u9875\u7248\u7c7b\u4f3c\u3002
\u5982\uff0c
# \u4ee5\u4eba\u5c5e(9605)\u548c\u80a0\u9053\u4e2d\u8457\u540d\u7684Akk\u83cc\u5c5e(239934)\u4e3a\u4f8b\n$ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934\n9605 [genus] Homo\n 9606 [species] Homo sapiens\n 63221 [subspecies] Homo sapiens neanderthalensis\n 741158 [subspecies] Homo sapiens subsp. 'Denisova'\n 1425170 [species] Homo heidelbergensis\n 2665952 [no rank] environmental samples\n 2665953 [species] Homo sapiens environmental sample\n\n239934 [genus] Akkermansia\n 239935 [species] Akkermansia muciniphila\n 349741 [strain] Akkermansia muciniphila ATCC BAA-835\n 512293 [no rank] environmental samples\n 512294 [species] uncultured Akkermansia sp.\n 1131822 [species] uncultured Akkermansia sp. SMG25\n 1262691 [species] Akkermansia sp. CAG:344\n 1263034 [species] Akkermansia muciniphila CAG:154\n 1679444 [species] Akkermansia glycaniphila\n 2608915 [no rank] unclassified Akkermansia\n 1131336 [species] Akkermansia sp. KLE1605\n ...\n
list\u4f7f\u7528\u6700\u5e7f\u6cdb\u7684\u7684\u529f\u80fd\u662f\u83b7\u53d6\u67d0\u4e2a\u7c7b\u522b\uff08\u6bd4\u5982\u7ec6\u83cc\u3001\u75c5\u6bd2\u3001\u67d0\u4e2a\u5c5e\u7b49\uff09\u4e0b\u6240\u6709\u7684TaxID\uff0c \u7528\u6765\u4eceNCBI nt/nr\u4e2d\u83b7\u53d6\u5bf9\u5e94\u7684\u6838\u9178/\u86cb\u767d\u5e8f\u5217\uff0c\u4ece\u800c\u642d\u5efa\u7279\u5f02\u6027\u7684BLAST\u6570\u636e\u5e93\u3002 \u5b98\u7f51\u63d0\u4f9b\u4e86\u76f8\u5e94\u7684\u8be6\u7ec6\u6b65\u9aa4\uff1a http://bioinf.shenwei.me/taxonkit/tutorial \u3002
# \u6240\u6709\u7ec6\u83cc\u7684TaxID\n$ taxonkit list --show-rank --show-name --ids 2 > /dev/null\n
"},{"location":"chinese/#lineage-taxid","title":"lineage \u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb","text":"\u5206\u7c7b\u5b66\u6570\u636e\u76f8\u5173\u6700\u5e38\u89c1\u7684\u529f\u80fd\u5c31\u662f\u6839\u636eTaxID\u83b7\u53d6\u5b8c\u6574\u8c31\u7cfb\u3002 TaxonKit\u53ef\u6839\u636e\u8f93\u5165\u6587\u4ef6\u63d0\u4f9b\u7684TaxID\u5217\u8868\u5feb\u901f\u8ba1\u7b97lineage\uff0c\u5e76\u53ef\u9009\u63d0\u4f9b\u540d\u79f0\uff0c\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u4ee5\u53ca\u8c31\u7cfb\u5bf9\u5e94\u7684TaxID\u3002
\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u968f\u7740Taxonomy\u6570\u636e\u7684\u9891\u7e41\u66f4\u65b0\uff0c\u6709\u7684TaxID\u53ef\u80fd\u88ab\u5220\u9664\u3001\u6216\u5408\u5e76\uff08merge\uff09\u5230\u5176\u5b83TaxID\u4e2d\uff0c TaxonKit\u4f1a\u81ea\u52a8\u8bc6\u522b\uff0c\u5e76\u8fdb\u884c\u63d0\u793a\uff0c\u5bf9\u4e8e\u88ab\u5408\u5e76\u7684TaxID\uff0cTaxonKit\u4f1a\u6309\u65b0TaxID\u8fdb\u884c\u8ba1\u7b97\u3002
# \u4f7f\u7528example\u4e2d\u7684\u6d4b\u8bd5\u6570\u636e\n$ head taxids.txt\n9606\n9913\n376619\n# \u67e5\u627e\u6307\u5b9ataxids\u5217\u8868\u7684\u7269\u79cd\u4fe1\u606f\uff0ctee\u53ef\u8f93\u51fa\u5c4f\u5e55\u5e76\u5199\u5165\u6587\u4ef6\n$ taxonkit lineage taxids.txt | tee lineage.txt \n19:22:13.077 [WARN] taxid 92489 was merged into 796334\n19:22:13.077 [WARN] taxid 1458427 was merged into 1458425\n19:22:13.077 [WARN] taxid 123124124 not found\n19:22:13.077 [WARN] taxid 3 was deleted\n9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\n9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus\n376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS\n349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B\n11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\n1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y\n123124124\n3\n92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\n1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n
\u4e0e\u5176\u5b83\u8f6f\u4ef6\u7684\u6027\u80fd\u76f8\u6bd4\uff0c\u5f53\u67e5\u8be2\u6570\u91cf\u8f83\u5c11\u65f6ETE\u8f83\u5feb\uff0c\u6570\u91cf\u8f83\u591a\u65f6\u5219TaxonKit\u66f4\u5feb\u3002 \u5728\u4e0d\u540c\u6570\u636e\u91cf\u89c4\u6a21\u4e0a TaxonKit\u901f\u5ea6\u4e00\u76f4\u5f88\u7a33\u5b9a\uff0c\u5747\u4e3a2-3\u79d2\uff0c\u65f6\u95f4\u4e3b\u8981\u82b1\u5728\u89e3\u6790Taxonomy\u6570\u636e\u6587\u4ef6\u4e0a\u3002
\u5217\u51falineage\u6bcf\u4e2a\u5206\u7c7b\u5b66\u5355\u5143\u7684\u7684TaxId\u548crank\u548c\u540d\u79f0\uff0c\u6bd4\u5982SARS-COV-2\u3002
# lineage\u63d0\u53d6SARS-COV-2\u7684\u4e16\u7cfb\n$ echo \"2697049\" \\\n | taxonkit lineage -t -R \\\n | sed \"s/\\t/\\n/g\"\n2697049\nViruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2\n10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049\nsuperkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank\n
"},{"location":"chinese/#reformat","title":"reformat \u751f\u6210\u6807\u51c6\u5c42\u7ea7\u7269\u79cd\u6ce8\u91ca","text":"\u6709\u65f6\u5019\uff0c\u6211\u4eec\u5e76\u4e0d\u9700\u8981\u5b8c\u6574\u7684\u5206\u7c7b\u5b66\u8c31\u7cfb\uff08complete lineage\uff09\uff0c\u56e0\u4e3a\u5f88\u591a\u7ea7\u522b\u5373\u4e0d\u5e38\u7528\uff0c\u800c\u4e14\u4e0d\u5b8c\u6574\u3002\u901a\u5e38\u53ea\u60f3\u4fdd\u7559\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u3002
\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u4e0d\u662f\u6240\u6709\u7269\u79cd\u90fd\u6709\u5b8c\u6574\u7684\u754c\u95e8\u7eb2\u76ee\u79d1\u5c5e\u79cd\u6c34\u5e73\uff0c\u7279\u522b\u662f\u75c5\u6bd2\u4ee5\u53ca\u4e00\u4e9b\u73af\u5883\u6837\u54c1\u3002 TaxonKit\u53ef\u4ee5\u7528\u81ea\u5b9a\u4e49\u5185\u5bb9\u66ff\u4ee3\u7f3a\u5931\u7684\u5206\u7c7b\u5355\u5143\uff0c\u5982\u7528\u201c__\u201d\u66ff\u4ee3\u3002 \u66f4\u5389\u5bb3\u6709\u7528\u7684\u662f\uff0cTaxonKit\u8fd8\u53ef\u4ee5\u7528\u66f4\u9ad8\u5c42\u7ea7\u7684\u5206\u7c7b\u5355\u5143\u4fe1\u606f\u6765\u8865\u9f50\u7f3a\u5931\u7684\u5c42\u7ea7 (-F/--fill-miss-rank
)\uff0c\u6bd4\u5982
# \u6ca1\u6709genus\u7684\u75c5\u6bd2\n$ echo 1327037 | taxonkit lineage | taxonkit reformat | cut -f 1,3\n1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y\n\n# -F\u53c2\u6570\u4f1a\u7528family\u4fe1\u606f\u6765\u8865\u9f50genus\u4fe1\u606f\n$ echo 1327037 | taxonkit lineage | taxonkit reformat -F | cut -f 1,3\n1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae genus;Croceibacter phage P2559Y\n
\u8f93\u51fa\u683c\u5f0f\u53ef\u9009\u53ea\u8f93\u51fa\u90e8\u5206\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u8fd8\u652f\u6301\u5236\u8868\u7b26\uff08\"\\t\"
\uff09\uff0c\u518d\u914d\u5408\u4f5c\u8005\u7684\u53e6\u4e00\u4e2a\u5de5\u5177csvtk\uff0c\u53ef\u4ee5\u8f93\u51fa\u6f02\u4eae\u7684\u7ed3\u679c\u3002
\u5176\u5b83\u6709\u7528\u7684\u9009\u9879\uff1a
-P/--add-prefix
\uff1a\u7ed9\u6bcf\u4e2a\u5206\u7c7b\u5b66\u6c34\u5e73\u6dfb\u52a0\u524d\u7f00\uff0c\u6bd4\u5982s__species
\u3002-t/--show-lineage-taxids
\uff1a\u8f93\u51fa\u5206\u7c7b\u5b66\u5355\u5143\u5bf9\u5e94\u7684TaxID\u3002-r/--miss-rank-repl
: \u66ff\u4ee3\u6ca1\u6709\u5bf9\u5e94rank\u7684taxon\u540d\u79f0-S/--pseudo-strain
: \u5bf9\u4e8e\u4f4e\u4e8especies\u4e14rank\u65e2\u4e0d\u662fsubspecies\u4e5f\u4e0d\u662fstrain\u7684taxid\uff0c\u4f7f\u7528\u6c34\u5e73\u6700\u4f4etaxon\u540d\u79f0\u505a\u4e3a\u83cc\u682a\u540d\u79f0\u3002\u4f8b\uff0c
$ echo -ne \"349741\\n1327037\"\\\n | taxonkit lineage \\\n | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\\n | csvtk cut -t -f -2 \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\\n | csvtk pretty -t\n\ntaxid kindom phylum class order family genus species\n349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila\n1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y\n\n# \u4fbf\u4e8e\u5c0f\u5c4f\u5e55\u67e5\u770b\uff0c\u7528csvtk\u8fdb\u884c\u8f6c\u7f6e\n$ echo -ne \"349741\\n1327037\"\\\n | taxonkit lineage \\\n | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" -F -P \\\n | csvtk cut -t -f -2 \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\\n | csvtk transpose -t \\\n | csvtk pretty -H -t\n\ntaxid 349741 1327037\nkindom k__Bacteria k__Viruses\nphylum p__Verrucomicrobia p__Uroviricota\nclass c__Verrucomicrobiae c__Caudoviricetes\norder o__Verrucomicrobiales o__Caudovirales\nfamily f__Akkermansiaceae f__Siphoviridae\ngenus g__Akkermansia g__unclassified Siphoviridae genus\nspecies s__Akkermansia muciniphila s__Croceibacter phage P2559Y\n\n# \u5230\u682a\u6c34\u5e73\uff0c\u4ee5sars-cov-2\u4e3a\u4f8b\n$ echo -ne \"2697049\"\\\n | taxonkit lineage \\\n | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" -F -P -S \\\n | csvtk cut -t -f -2 \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species,strain \\\n | csvtk transpose -t \\\n | csvtk pretty -H -t\n\ntaxid 2697049\nkindom k__Viruses\nphylum p__Pisuviricota\nclass c__Pisoniviricetes\norder o__Nidovirales\nfamily f__Coronaviridae\ngenus g__Betacoronavirus\nspecies s__Severe acute respiratory syndrome-related coronavirus\nstrain t__Severe acute respiratory syndrome coronavirus 2\n
"},{"location":"chinese/#name2taxid-taxid","title":"name2taxid \u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID","text":"\u5c06\u5206\u7c7b\u5355\u5143\u540d\u79f0\u8f6c\u5316\u4e3aTaxID\u975e\u5e38\u5bb9\u6613\u7406\u89e3\uff0c\u552f\u4e00\u8981\u6ce8\u610f\u7684\u662f\u67d0\u4e9bTaxId\u5bf9\u5e94\u76f8\u540c\u7684\u540d\u79f0\uff0c\u6bd4\u5982
# -i\u6307\u5b9a\u5217\uff0c-r\u663e\u793a\u7ea7\u522b\uff0c-L\u4e0d\u663e\u793a\u4e16\u7cfb\n$ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L\nDrosophila 7215 genus\nDrosophila 32281 subgenus\nDrosophila 2081351 genus\n
\u83b7\u53d6TaxID\u4e4b\u540e\uff0c\u53ef\u4ee5\u7acb\u5373\u4f20\u7ed9taxonkit\u8fdb\u884c\u540e\u7eed\u64cd\u4f5c\uff0c\u4f46\u8981\u6ce8\u610f\u7528-i
\u6307\u5b9aTaxId\u6240\u5728\u5217\u3002
filter\u53ef\u4ee5\u6309\u5206\u7c7b\u5b66\u6c34\u5e73\u8303\u56f4\u8fc7\u6ee4TaxIDs\uff0c\u6ce8\u610f\uff0c\u4e0d\u4ec5\u4ec5\u662f\u7279\u5b9a\u7684Rank\uff0c\u800c\u662f\u4e00\u4e2a\u8303\u56f4\u3002 \u6bd4\u5982genus\u53ca\u4ee5\u4e0b\u7684\u5206\u7c7b\u5b66\u6c34\u5e73\uff0c\u7528-L genus -E genus
\uff0c\u7c7b\u4f3c\u4e8e <= genus
\u3002
$ cat taxids2.txt \\\n | taxonkit filter -L genus -E genus \\\n | taxonkit lineage -r -n -L \\\n | csvtk -Ht cut -f 1,3,2 \\\n | csvtk pretty -H -t\n239934 genus Akkermansia\n239935 species Akkermansia muciniphila\n349741 strain Akkermansia muciniphila ATCC BAA-835\n
"},{"location":"chinese/#lca-lca","title":"lca \u8ba1\u7b97\u6700\u4f4e\u516c\u5171\u7956\u5148(LCA)","text":"\u6bd4\u5982\u4eba\u5c5e\u7684\u4f8b\u5b50
$ taxonkit list --ids 9605 -nr --indent \" \" \n9605 [genus] Homo\n 9606 [species] Homo sapiens\n 63221 [subspecies] Homo sapiens neanderthalensis\n 741158 [subspecies] Homo sapiens subsp. 'Denisova'\n 1425170 [species] Homo heidelbergensis\n 2665952 [no rank] environmental samples\n 2665953 [species] Homo sapiens environmental sample\n
TaxID\u7684\u5206\u9694\u7b26\u53ef\u7528-s/--separater
\u6307\u5b9a\uff0c\u9ed8\u8ba4\u4e3a\" \"\u3002
# \u8ba1\u7b97\u4e24\u4e2a\u7269\u79cd\u7684\u6700\u8fd1\u5171\u540c\u7956\u5148\uff0c\u4ee5\u4e0a\u9762\u5c3c\u5b89\u5fb7\u7279\u4eba\u4e9a\u79cd\u548c\u6d77\u5fb7\u5821\u4eba\u79cd\n$ echo 63221 2665953 | taxonkit lca\n63221 2665953 9605\n\n# \u5176\u5b83\u5206\u9694\u7b26\uff0c\u4e14\u4e0d\u5c0f\u5fc3\u591a\u4e86\u7a7a\u683c\n$ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\"\na 63221,2665953\nb 63221, 741158\n\n$ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\\n | taxonkit lca -i 2 -s \",\"\na 63221,2665953 9605\nb 63221, 741158 9606\n
"},{"location":"chinese/#taxid-changelog-taxid","title":"TaxID changelog \u8ffd\u8e2aTaxID\u53d8\u66f4\u8bb0\u5f55","text":"NCBI Taxonomy\u6570\u636e\u6bcf\u5929\u90fd\u5728\u66f4\u65b0\uff0c\u6bcf\u6708\u521d\uff08\u5927\u591a\u4e3a1\u53f7\uff09\u7684\u6570\u636e\u4f5c\u4e3a\u5b58\u6863\u4fdd\u5b58\u5728 taxdump_archive/
\u76ee\u5f55\uff0c \u65e7\u7248\u672c\u6700\u65e9\u6570\u636e\u52302014\u5e748\u6708\uff0c\u65b0\u7248\u672c\u53ea\u52302018\u5e7412\u6708\u3002
TaxonKit\u53ef\u4ee5\u8ffd\u8e2a\u6240\u6709TaxID\u6bcf\u4e2a\u6708\u7684\u53d8\u5316\uff0c\u8f93\u51fa\u5230csv\u6587\u4ef6\u4e2d\uff0c\u53ef\u4ee5\u901a\u8fc7\u547d\u4ee4\u884c\u5de5\u5177\u8fdb\u884c\u67e5\u8be2\u3002 \u6570\u636e\u548c\u6587\u6863\u5355\u72ec\u6258\u7ba1\u5728 https://github.com/shenwei356/taxid-changelog \u3002
\u9664\u4e86\u7b80\u5355\u7684\u589e\u52a0\u3001\u5220\u9664\u3001\u5408\u5e76\u4e4b\u5916\uff0c\u4f5c\u8005\u5c06TaxID\u6539\u53d8\u505a\u4e86\u7ec6\u5206\u3002\u8f93\u51fa\u683c\u5f0f\u5982\u4e0b
# \u5217 \u5907\u6ce8\ntaxid # taxid\nversion # version / time of archive, e.g, 2019-07-01\nchange # change, values:\n # NEW \u65b0\u589e\n # REUSE_DEL \u524d\u671f\u88ab\u5220\u9664\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165\n # REUSE_MER \u524d\u671f\u88ab\u5408\u5e76\uff0c\u73b0\u5728\u53c8\u91cd\u65b0\u52a0\u5165\n # DELETE \u5220\u9664\n # MERGE \u5408\u5e76\u5230\u53e6\u4e00\u4e2aTaxID\n # ABSORB \u5176\u4ed6TaxID\u5408\u5e76\u5230\u5f53\u524dTaxID\n # CHANGE_NAME \u540d\u79f0\u6539\u53d8\n # CHANGE_RANK \u5206\u7c7b\u5b66\u6c34\u5e73\u6539\u53d8\n # CHANGE_LIN_LIN \u8c31\u7cfb\u7684TaxID\u6ca1\u6709\u53d8\u5316\uff0c\u8c31\u7cfb\u6539\u53d8\uff08\u67d0\u4e9bTaxID\u7684\u540d\u79f0\u53d8\u4e86\uff09\n # CHANGE_LIN_TAX \u8c31\u7cfb\u7684TaxID\u6539\u53d8\n # CHANGE_LIN_LEN \u8c31\u7cfb\u7684\u957f\u5ea6/\u6df1\u5ea6\u53d1\u751f\u53d8\u5316\nchange-value # variable values for changes: \n # 1) new taxid for MERGE\n # 2) merged taxids for ABSORB\n # 3) empty for others\nname # scientific name\nrank # rank\nlineage # complete lineage of the taxid\nlineage-taxids # taxids of the lineage\n
\u6570\u636e\u6587\u4ef6\u53ef\u4ee5\u5728\u524d\u9762\u7f51\u7ad9\u4e0a\u4e0b\u8f7d\uff0ctaxid-changelog.csv.gz
\uff0c130M\u5de6\u53f3\uff0c\u89e3\u538b\u540e2.2G\uff0c\u56e0\u4e3a\u662fgzip\u683c\u5f0f\uff0c\u5b8c\u5168\u4e0d\u9700\u8981\u89e3\u538b\u5373\u53ef\u5206\u6790\u3002 \u4e0b\u6587\u4f7f\u7528\u4e86pigz
\u4ee3\u66ffzcat
\u548cgzip
\u63d0\u9ad8\u89e3\u538b\u901f\u5ea6\u3002
\u4f8b1 superkingdom\u4e5f\u80fd\u6d88\u5931 \uff0c\u6bd4\u5982\u7c7b\u75c5\u6bd2(Viroids)\u57282019\u5e745\u6708\u88ab\u5220\u9664\u4e86\u3002 \u4f5c\u8005\u662f\u5728\u67d0\u4e00\u5929\u65e0\u610f\u4e2d\u53d1\u73b0\u6b64\u4e8b\uff0c\u6240\u4ee5\u51b3\u5b9a\u5228\u6839\u95ee\u5e95\uff0c\u5f00\u53d1\u4e86\u8fd9\u4e2a\u5b50\u547d\u4ee4\u3002
# \u4e0b\u8f7d\nwget -c https://github.com/shenwei356/taxid-changelog/releases/download/v2021.01/taxid-changelog.csv.gz\n# \u5b89\u88c5\u591a\u7ebf\u7a0b\u89e3\u538b\u7d22\u8f6f\u4ef6\u3002\u6216\u8005\u7528gzip\u66ff\u6362\u3002\nconda install pigz\n\n$ pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f rank -p superkingdom \\\n | csvtk pretty \ntaxid version change change-value name rank lineage lineage-taxids\n2 2014-08-01 NEW Bacteria superkingdom cellular organisms;Bacteria 131567;2\n2157 2014-08-01 NEW Archaea superkingdom cellular organisms;Archaea 131567;2157\n2759 2014-08-01 NEW Eukaryota superkingdom cellular organisms;Eukaryota 131567;2759\n10239 2014-08-01 NEW Viruses superkingdom Viruses 10239\n12884 2014-08-01 NEW Viroids superkingdom Viroids 12884\n12884 2019-05-01 DELETE Viroids superkingdom Viroids 12884\n
\u4f8b2 SARS-CoV-2 \u3002\u53ef\u89c1\u65b0\u51a0\u75c5\u6bd2\u57282020\u5e742\u6708\u52a0\u5165\uff0c\u968f\u540e3\u6708\u548c6\u6708\u4efd\u6539\u4e86\u540d\u79f0\uff0c\u8c31\u7cfb\u7b49\u4fe1\u606f\u3002\u67e5\u8be2\u901f\u5ea6\u4e5f\u5f88\u5feb\u3002
# \u672c\u4f8b\u5b50\u53ea\u663e\u793a\u4e86\u90e8\u5206\u5217\u3002\n$ time pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f taxid -p 2697049 \\\n | csvtk cut -f version,change,name,rank \\\n | csvtk pretty\n\nversion change name rank\n2020-02-01 NEW Wuhan seafood market pneumonia virus species\n2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank\n2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank\n2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank\n2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank\n2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate\n2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank\n\nreal 0m7.644s\nuser 0m16.749s\nsys 0m3.985s\n
\u66f4\u591a\u6709\u610f\u601d\u7684\u53d1\u73b0\u8be6\u89c1taxid-changelog
"},{"location":"download/","title":"Download","text":"TaxonKit
is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.
taxonkit create-taxdump
:taxonkit taxid-changelog/create-taxdump
:create-taxdump
. #91taxonkit create-taxdump
has no problem, it's just the changelog might not be perfect.taxonkit lca
:-K/--keep-invalid
: print the query even if no single valid taxid left. #89Shen, W., Ren, H., TaxonKit: a practical and efficient NCBI Taxonomy toolkit, Journal of Genetics and Genomics, https://doi.org/10.1016/j.jgg.2021.03.006
"},{"location":"download/#links","title":"Links","text":"Tips
taxonkit version
to check update !!!taxonkit genautocomplete
to update Bash completion !!!Download Page
TaxonKit
is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.
Just download compressed executable file of your operating system, and uncompress it with tar -zxvf *.tar.gz
command or other tools. And then:
For Linux-like systems
If you have root privilege simply copy it to /usr/local/bin
:
sudo cp taxonkit /usr/local/bin/\n
Or copy to anywhere in the environment variable PATH
:
mkdir -p $HOME/bin/; cp taxonkit $HOME/bin/\n
For windows, just copy taxonkit.exe
to C:\\WINDOWS\\system32
.
conda install -c bioconda taxonkit\n
"},{"location":"download/#method-3-install-via-homebrew-may-not-the-lastest-version","title":"Method 3: Install via homebrew (may not the lastest version)","text":"brew install brewsci/bio/taxonkit\n
"},{"location":"download/#method-4-compile-from-source-latest-stabledev-version","title":"Method 4: Compile from source (latest stable/dev version)","text":"Install go
wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz\n\ntar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/\n\n# or \n# echo \"export PATH=$PATH:$HOME/go/bin\" >> ~/.bashrc\n# source ~/.bashrc\nexport PATH=$PATH:$HOME/go/bin\n
Compile TaxonKit
# ------------- the latest stable version -------------\n\ngo get -v -u github.com/shenwei356/taxonkit/taxonkit\n\n# The executable binary file is located in:\n# ~/go/bin/taxonkit\n# You can also move it to anywhere in the $PATH\nmkdir -p $HOME/bin\ncp ~/go/bin/taxonkit $HOME/bin/\n\n# --------------- the development version --------------\n\ngit clone https://github.com/shenwei356/taxonkit\ncd taxonkit/taxonkit/\ngo build\n\n# The executable binary file is located in:\n# ./taxonkit\n# You can also move it to anywhere in the $PATH\nmkdir -p $HOME/bin\ncp ./taxonkit $HOME/bin/\n
Supported shell: bash|zsh|fish|powershell
Bash:
# generate completion shell\ntaxonkit genautocomplete --shell bash\n\n# configure if never did.\n# install bash-completion if the \"complete\" command is not found.\necho \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion\necho \"source ~/.bash_completion\" >> ~/.bashrc\n
Zsh:
# generate completion shell\ntaxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit\n\n# configure if never did\necho 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc\necho \"autoload -U compinit; compinit\" >> ~/.zshrc\n
fish:
taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish\n
"},{"location":"download/#dataset","title":"Dataset","text":"taxdump.tar.gz
: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz names.dmp
, nodes.dmp
, delnodes.dmp
and merged.dmp
to data directory: $HOME/.taxonkit
, e.g., /home/shenwei/.taxonkit
,--data-dir
, or environment variable TAXONKIT_DB
.All-in-one command:
wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \ntar -zxvf taxdump.tar.gz\n\nmkdir -p $HOME/.taxonkit\ncp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit\n
Update dataset: Simply re-download the taxdump files, uncompress and override old ones.
"},{"location":"download/#release-history","title":"Release history","text":"taxonkit name2taxid
:taxonkit reformat
:-T/--trim
also does not add the prefix for missing ranks lower than the current rank. #82-s/--miss-rank-repl-suffix
to set the suffix for estimated taxon names. #85taxonkit filter
:taxonkit lca
:-b/--buffer-size
to set the size of the line buffer. #75--separater
-> --separater
, the former is still available for backward compatibility.taxonkit reformat
:taxonkit taxid-changelog
:taxonkit reformat
:-S/--pseudo-strain
does not require -F/--fill-miss-rank
now.{t}
, {S}
, and T
outputs nothing when using -S/--pseudo-strain
.taxonkit create-taxdump
:int32
instead of uint32
, as BLAST and DIAMOND do. #70taxonkit list
:taxonkit
:TAXONKIT_DB
is set, explicitly setting --data-dir
will override the value of TAXONKIT_DB
.taxonkit reformat
:{K}
for rank kingdom
. #64-I--taxid-field
.taxonkit create-taxdump
: -A/--field-accession
and no rank names given: the colname of the accession column would be treated as one of the ranks, which messed up all the ranks.--field-accession-re
which wrongly remove prefix like Sp_
. #65taxonkit list
:taxonkit create-taxdump
: taxonkit create-taxdump
: fix bug of missing Class rank, contributed by @apcamargo. The flag --gtdb
was not effected. #57taxonkit create-taxdump
: Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV. #56taxonkit cami2-filter
: fix option --show-rank
which did not work in v0.10.0.taxonkit cami2-filter
: Remove taxa of given TaxIds and their descendants in CAMI metagenomic profiletaxonkit reformat
: fix panic for deleted taxid using -F/--fill-miss-rank
. #55taxonkit profile2cami
: converting metagenomic profile table to CAMI formattaxonkit reformat
:-I/--taxid-field
.taxonkit lca
:taxonkit genautocomplete
:taxonkit lineage
:-R/--show-lineage-ranks
for appending ranks of all levels.taxonkit filter
:-E/--equal-to
supports multiple values.-n/--save-predictable-norank
: do not discard some special ranks without order when using -L, where rank of the closest higher node is still lower than rank cutoff.taxonkit reformat
:{t}
for subspecies/strain
, {T}
for strain
. Thanks @wqssf102 for feedback.-S/--pseudo-strain
for using the node with lowest rank as strain name, only if which rank is lower than \"species\" and not \"subpecies\" nor \"strain\". taxonkit filter
: --list-order
or --list-ranks
. #36-N/--discard-noranks
to explicitly filter out \"no rank\", \"clade\". #37taxonkit
: 2-3X faster taxonomy data loading.taxonkit filter
: filtering TaxIds by taxonomic rank range. #32taxonkit lca
: Computing lowest common ancestor (LCA) for TaxIds.taxonkit reformat
:-P/--add-prefix
: add prefixes for all ranks, single prefix for a rank is defined by flag --prefix-X
, where X
may be k
, p
, c
, o
, f
, s
, S
.-T/--trim
: do not fill missing rank lower than current rank.taxonkit list
: do not duplicate root node.taxonkit reformat -F
: fix taxids of abbreviated lineage containing names shared by different taxids. #35taxonkit lineage
: -n/--show-name
for appending scientific name.-L/--no-lineage
for hide lineage, this is for fast retrieving names or/and ranks.taxonkit reformat
:-F/--fill-miss-rank
.taxonkit list
:taxonkit name2taxid
: new flag -s/--sci-name
for limiting to searching scientific names. #29taxonkit version
: make checking update optionaltaxonkit
: requiring delnodes.dmp and merged.dmp.taxonkit lineage
: detect deleted and merged taxids now. #19taxonkit list/name2taxid
: add short flag -r
for --show-rank
, -n
for --show-name
.taxonkit taxid-changelog
: rewrite logic, fix bug and add more change typestaxonkit taxid-changelog
: change output of ABSORB
, do not merged into one record for changes in different versions.taxonkit taxid-changelog
: name
and rank
.taxonkit taxid-changelog
: for creating taxid changelog from dump archive--line-buffered
to disable output buffer. #11--names-file
and --nodes-file
with --data-dir
, also support environment variable TAXONKIT_DB
. #17taxonkit reformat
: detects lineages containing unofficial taxon name and won't show panic message.taxonkit name2taxid
: supports synonyms names. #9taxokit lineage
: add flag -r/--show-rank
to print rank at another new column.taxonkit reformat
:-F/--fill-miss-rank
to estimate and fill missing rank with original lineage information\\t
, \\n
, #5taxonkit lineage
:1
#7-d/--delimiter
.taxonkit list
: fix bug of no output for leaf nodes of the taxonomic tree. #4genautocomplete
to generate shell autocompletion script!name2taxid
to query taxid by taxon scientific name.lineage
, reformat
: changed flags and default operations, check the usage.taxonkit lineage
, add an extra column of lineage in Taxid. #3. e.g.,taxonkit reformat
: supports reading stdin from output of taxonkit lineage
, reformated lineages are appended to input data.-f/--formated-rank
from taxonkit lineage
, using taxonkit reformat
can archieve same result.--fill
for taxonkit reformat
, which estimates and fills missing rank with original lineage informationtaxonkit reformat
which reformats full lineage to custom formattaxonkit lineage
, users can query lineage of given taxon IDs from filetaxonkit list
, users can choose output in readable JSON format by flag --json
so the taxonomy tree could be collapse and uncollapse in modern text editor.Show lineage detail of a TaxId. The command below works on Windows with help of csvtk.
$ echo \"2697049\" \\\n | taxonkit lineage -t \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -Ht\n\n10239 superkingdom Viruses\n2559587 clade Riboviria\n2732396 kingdom Orthornavirae\n2732408 phylum Pisuviricota\n2732506 class Pisoniviricetes\n76804 order Nidovirales\n2499399 suborder Cornidovirineae\n11118 family Coronaviridae\n2501931 subfamily Orthocoronavirinae\n694002 genus Betacoronavirus\n2509511 subgenus Sarbecovirus\n694009 species Severe acute respiratory syndrome-related coronavirus\n2697049 no rank Severe acute respiratory syndrome coronavirus 2\n
Example data.
$ cat taxids3.txt\n376619\n349741\n239935\n314101\n11932\n1327037\n83333\n1408252\n2605619\n2697049\n
Format to 7-level ranks (\"superkingdom phylum class order family genus species\").
$ cat taxids3.txt \\\n | taxonkit reformat -I 1\n\n376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis\n349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B\n11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\n1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y\n83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli\n1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli\n2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli\n2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus\n
Format to 8-level ranks (\"superkingdom phylum class order family genus species subspecies/rank\").
$ cat taxids3.txt \\\n | taxonkit reformat -I 1 -f \"{k};{p};{c};{o};{f};{g};{s};{t}\"\n\n376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica LVS\n349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;\n314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B;\n11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle;\n1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y;\n83333 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli K-12\n1408252 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;Escherichia coli R178\n2605619 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;\n2697049 Viruses;Pisuviricota;Pisoniviricetes;Nidovirales;Coronaviridae;Betacoronavirus;Severe acute respiratory syndrome-related coronavirus;\n
Replace missing ranks with Unassigned
and output tab-delimited format.
$ cat taxids3.txt \\\n | taxonkit reformat -I 1 -r \"Unassigned\" -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n | csvtk pretty -H -t\n\n376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis Francisella tularensis subsp. holarctica LVS\n349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835\n239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila Unassigned\n314101 Bacteria Unassigned Unassigned Unassigned Unassigned Unassigned uncultured murine large bowel bacterium BAC 54B Unassigned\n11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle Unassigned\n1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Unassigned Croceibacter phage P2559Y Unassigned\n83333 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli K-12\n1408252 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Escherichia coli R178\n2605619 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli Unassigned\n2697049 Viruses Pisuviricota Pisoniviricetes Nidovirales Coronaviridae Betacoronavirus Severe acute respiratory syndrome-related coronavirus Unassigned\n
Fill missing ranks and add prefixes.
$ cat taxids3.txt \\\n | taxonkit reformat -I 1 -F -P -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n | csvtk pretty -H -t\n\n376619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Thiotrichales f__Francisellaceae g__Francisella s__Francisella tularensis t__Francisella tularensis subsp. holarctica LVS\n349741 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__Akkermansia muciniphila ATCC BAA-835\n239935 k__Bacteria p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales f__Akkermansiaceae g__Akkermansia s__Akkermansia muciniphila t__unclassified Akkermansia muciniphila subspecies/strain\n314101 k__Bacteria p__unclassified Bacteria phylum c__unclassified Bacteria class o__unclassified Bacteria order f__unclassified Bacteria family g__unclassified Bacteria genus s__uncultured murine large bowel bacterium BAC 54B t__unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain\n11932 k__Viruses p__Artverviricota c__Revtraviricetes o__Ortervirales f__Retroviridae g__Intracisternal A-particles s__Mouse Intracisternal A-particle t__unclassified Mouse Intracisternal A-particle subspecies/strain\n1327037 k__Viruses p__Uroviricota c__Caudoviricetes o__Caudovirales f__Siphoviridae g__unclassified Siphoviridae genus s__Croceibacter phage P2559Y t__unclassified Croceibacter phage P2559Y subspecies/strain\n83333 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli K-12\n1408252 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__Escherichia coli R178\n2605619 k__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Enterobacterales f__Enterobacteriaceae g__Escherichia s__Escherichia coli t__unclassified Escherichia coli subspecies/strain\n2697049 k__Viruses p__Pisuviricota c__Pisoniviricetes o__Nidovirales f__Coronaviridae g__Betacoronavirus s__Severe acute respiratory syndrome-related coronavirus t__unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain\n
When these's no nodes of rank \"subspecies\" nor \"strain\", we can switch -S/--pseudo-strain
to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\".
$ cat taxids3.txt \\\n | taxonkit lineage -r -L \\\n | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n | cut -f 1,2,9,10 \\\n | csvtk add-header -t -n \"taxid,rank,species,strain\" \\\n | csvtk pretty -t\n\ntaxid rank species strain\n------- ---------- ----------------------------------------------------- ------------------------------------------------------------------------------\n376619 strain Francisella tularensis Francisella tularensis subsp. holarctica LVS\n349741 strain Akkermansia muciniphila Akkermansia muciniphila ATCC BAA-835\n239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain\n314101 species uncultured murine large bowel bacterium BAC 54B unclassified uncultured murine large bowel bacterium BAC 54B subspecies/strain\n11932 species Mouse Intracisternal A-particle unclassified Mouse Intracisternal A-particle subspecies/strain\n1327037 species Croceibacter phage P2559Y unclassified Croceibacter phage P2559Y subspecies/strain\n83333 strain Escherichia coli Escherichia coli K-12\n1408252 subspecies Escherichia coli Escherichia coli R178\n2605619 no rank Escherichia coli Escherichia coli O16:H48\n2697049 no rank Severe acute respiratory syndrome-related coronavirus Severe acute respiratory syndrome coronavirus 2\n
List eight-level lineage for all TaxIds of rank lower than or equal to species, including some nodes with \"no rank\". But when filtering with -L/--lower-than
, you can use -n/--save-predictable-norank
to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff.
$ time taxonkit list --ids 1 \\\n | taxonkit filter -L species -E species -R -N -n \\\n | taxonkit lineage -n -r -L \\\n | taxonkit reformat -I 1 -F -S -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n | csvtk cut -Ht -l -f 1,3,2,1,4-11 \\\n | csvtk add-header -t -n \"taxid,rank,name,lineage,kingdom,phylum,class,order,family,genus,species,strain\" \\\n | pigz -c > result.tsv.gz\n\nreal 0m25.167s\nuser 2m14.809s\nsys 0m7.197s\n\n$ pigz -cd result.tsv.gz \\\n | csvtk grep -t -f taxid -p 2697049 \\\n | csvtk transpose -t \\\n | csvtk pretty -H -t\n\ntaxid 2697049\nrank no rank\nname Severe acute respiratory syndrome coronavirus 2\nlineage Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2\nkingdom Viruses\nphylum Pisuviricota\nclass Pisoniviricetes\norder Nidovirales\nfamily Coronaviridae\ngenus Betacoronavirus\nspecies Severe acute respiratory syndrome-related coronavirus\nstrain Severe acute respiratory syndrome coronavirus 2\n
"},{"location":"tutorial/#mapping-old-species-names-to-new-ones","title":"Mapping old species names to new ones","text":"Some species names in papers or websites might changed, we can try querying their TaxIds via their old new names and then retrieve the new ones.
cat example/changed_species_names.txt\nLactobacillus fermentum\nMycoplasma gallinaceum\n\n# TaxonKit >= v0.15.1\ncat example/changed_species_names.txt \\\n | taxonkit name2taxid \\\n | taxonkit lineage -i 2 -n \\\n | cut -f 1,4\n\nLactobacillus fermentum Limosilactobacillus fermentum\nMycoplasma gallinaceum\n
Woops, there's no information of Mycoplasma gallinaceum
. Then we check the taxid-changelog.
zcat taxonkit/taxid-changelog.csv.gz \\\n | csvtk grep -f name -P example/changed_species_names.txt\n | csvtk cut -f taxid,version,change,name,rank \\\n | csvtk pretty\n\ntaxid version change name rank\n----- ---------- -------------- ----------------------- -------\n1613 2013-02-21 NEW Lactobacillus fermentum species\n1613 2016-03-01 ABSORB Lactobacillus fermentum species\n1613 2016-03-01 CHANGE_LIN_LEN Lactobacillus fermentum species\n29556 2013-02-21 NEW Mycoplasma gallinaceum species\n29556 2016-03-01 CHANGE_LIN_LEN Mycoplasma gallinaceum species\n29556 2021-01-01 CHANGE_NAME Mycoplasma gallinaceum species\n29556 2021-01-01 CHANGE_LIN_LIN Mycoplasma gallinaceum species\n
We can see the names are changed. Full changes can be queried with the taxid. e.g.,
taxid version change change-value name rank\n----- ---------- -------------- ------------ ------------------------- -------\n29556 2013-02-21 NEW Mycoplasma gallinaceum species\n29556 2016-03-01 CHANGE_LIN_LEN Mycoplasma gallinaceum species\n29556 2020-09-01 CHANGE_NAME Mycoplasmopsis gallinacea species\n29556 2020-09-01 CHANGE_LIN_TAX Mycoplasmopsis gallinacea species\n29556 2021-01-01 CHANGE_NAME Mycoplasma gallinaceum species\n29556 2021-01-01 CHANGE_LIN_LIN Mycoplasma gallinaceum species\n29556 2021-09-01 CHANGE_NAME Mycoplasmopsis gallinacea species\n29556 2021-09-01 CHANGE_LIN_LIN Mycoplasmopsis gallinacea species\n29556 2023-03-01 CHANGE_LIN_LIN Mycoplasmopsis gallinacea species\n
Then we just use their TaxIds to rertrieve the new names. The final commands are:
zcat taxonkit/taxid-changelog.csv.gz \\\n | csvtk grep -f name -P example/changed_species_names.txt \\\n | csvtk uniq -f taxid \\\n | csvtk cut -f name,taxid \\\n | csvtk del-header \\\n | csvtk csv2tab \\\n | taxonkit lineage -i 2 -n \\\n | cut -f 1,4\n\nLactobacillus fermentum Limosilactobacillus fermentum\nMycoplasma gallinaceum Mycoplasmopsis gallinacea\n
"},{"location":"tutorial/#add-taxonomy-information-to-blast-result","title":"Add taxonomy information to BLAST result","text":"An blast result file blast_result.txt
, where the second column is the accession of matched sequences.
head -n 5 blast_result.txt | csvtk pretty -Ht\n\nxxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.745 494 99 3 6361 6851 895 1385 6.53e-83 326 \nxxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.543 494 100 3 17168 17658 895 1385 3.04e-81 320 \nxxxxxxxxxxxxxxxxxxxxx/76/ccs LR699760.1 100.000 37 0 0 8139 8175 14507874 14507910 4.27e-06 69.4\nxxxxxxxxxxxxxxxxxxxxx/80/ccs HG994975.1 80.556 540 81 16 8269 8798 3821290 3820765 8.65e-104 394 \nxxxxxxxxxxxxxxxxxxxxx/80/ccs HG994975.1 77.805 410 89 2 9590 9998 3819858 3819450 5.51e-61 252\n
Prepare acc2taxid.tsv
file from nucl_gb.accession2taxid.gz file. Here we use the accession
column instead of accession.version
column, in case of unmatched versions for some accessions.
zcat nucl_gb.accession2taxid.gz | cut -f 1,3 | gzip -c > acc2taxid.tsv.gz\n
Extract needed acc2taxid subset to reduce memory usage.
# extract accession and deduplicate and remove versions\ncut -f 2 blast_result.txt | csvtk uniq -Ht | csvtk replace -Ht -p '\\.\\d+$' > acc.txt\n\n# grep from acc2taxid.tsv.gz\nzcat acc2taxid.tsv.gz | grep -w -f acc.txt > hit.acc2taxid.tsv\n
Prepare taxid2name.tsv
, species name are retrived for the taxids.
cut -f 2 hit.acc2taxid.tsv | taxonkit reformat -f '{s}' -I 1 > hit.taxid2name.tsv\n
Append taxids according to the accessions, and append species names for the taxids.
csvtk add-header -t --names \"qseqid,sseqid,pident,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore\" blast_result.txt \\\n | csvtk mutate -t -f sseqid -n taxid \\\n | csvtk replace -t -k hit.acc2taxid.tsv -f taxid -p '(.+)\\.\\d+' -r '{kv}' \\\n | csvtk mutate -t -f taxid -n species \\\n | csvtk replace -t -k hit.taxid2name.tsv -f species -p '(.+)' -r '{kv}' \\\n | head -n 5 | csvtk pretty -t\n\nqseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore taxid species \n---------------------------- -------------- ------- ------ -------- ------- ------ ----- -------- -------- --------- -------- ----- --------------------\nxxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.745 494 99 3 6361 6851 895 1385 6.53e-83 326 44415 Eimeria mitis \nxxxxxxxxxxxxxxxxxxxxx/2/ccs XM_013496560.1 78.543 494 100 3 17168 17658 895 1385 3.04e-81 320 44415 Eimeria mitis \nxxxxxxxxxxxxxxxxxxxxx/76/ccs LR699760.1 100.000 37 0 0 8139 8175 14507874 14507910 4.27e-06 69.4 3702 Arabidopsis thaliana\nxxxxxxxxxxxxxxxxxxxxx/80/ccs HG994975.1 80.556 540 81 16 8269 8798 3821290 3820765 8.65e-104 394 5802 Eimeria tenella\n
"},{"location":"tutorial/#parsing-krakenbracken-result","title":"Parsing kraken/bracken result","text":"Example Data
Run Kraken2 and Bracken
KRAKEN_DB=/home/shenwei/ws/db/kraken/k2_pluspf\nTHREADS=16\n\nCLASSIFICATION_LVL=S\nTHRESHOLD=10\n\nREAD_LEN=100\nSAMPLE=SRS014459-Stool.fasta.gz\n\nBRACKEN_OUTPUT_FILE=$SAMPLE\n\nkraken2 --db ${KRAKEN_DB} --threads ${THREADS} -report ${SAMPLE}.kreport $SAMPLE > ${SAMPLE}.kraken\n\nest_abundance.py -i ${SAMPLE}.kreport -k ${KRAKEN_DB}/database${READ_LEN}mers.kmer_distrib \\\n -l ${CLASSIFICATION_LVL} -t ${THRESHOLD} -o ${BRACKEN_OUTPUT_FILE}.bracken\n
Orignial format
$ head -n 15 SRS014459-Stool.fasta.gz_bracken_species.kreport\n100.00 9491 0 R 1 root\n99.85 9477 0 R1 131567 cellular organisms\n99.85 9477 0 D 2 Bacteria\n66.08 6271 0 D1 1783270 FCB group\n66.08 6271 0 D2 68336 Bacteroidetes/Chlorobi group\n66.08 6271 0 P 976 Bacteroidetes\n66.08 6271 0 C 200643 Bacteroidia\n66.08 6271 0 O 171549 Bacteroidales\n34.45 3270 0 F 815 Bacteroidaceae\n34.45 3270 0 G 816 Bacteroides\n10.43 990 990 S 246787 Bacteroides cellulosilyticus\n7.98 757 757 S 28116 Bacteroides ovatus\n3.10 293 0 G1 2646097 unclassified Bacteroides\n1.06 100 100 S 2755405 Bacteroides sp. CACC 737\n0.49 46 46 S 2650157 Bacteroides sp. HF-5287\n
Converting to MetaPhlAn2 format. (Similar to kreport2mpa.py)
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\\n | csvtk cut -Ht -f 5,1 \\\n | taxonkit lineage \\\n | taxonkit reformat -i 3 -P -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}\" \\\n | csvtk cut -Ht -f 4,2 \\\n | csvtk replace -Ht -p \"(\\|[kpcofgs]__)+$\" \\\n | csvtk replace -Ht -p \"\\|[kpcofgs]__\\|\" -r \"|\" \\\n | csvtk uniq -Ht \\\n | csvtk grep -Ht -p k__ -v \\\n > SRS014459-Stool.fasta.gz_bracken_species.kreport.format\n\n$ head -n 10 SRS014459-Stool.fasta.gz_bracken_species.kreport.format\n\nk__Bacteria 99.85\nk__Bacteria|p__Bacteroidetes 66.08\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia 66.08\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales 66.08\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae 34.45\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides 34.45\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides cellulosilyticus 10.43\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides ovatus 7.98\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. CACC 737 1.06\nk__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides sp. HF-5287 0.49\n
Converting to Qiime format
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\\n | csvtk cut -Ht -f 5,1 \\\n | taxonkit lineage \\\n | taxonkit reformat -i 3 -P -f \"{k}; {p}; {c}; {o}; {f}; {g}; {s}\" \\\n | csvtk cut -Ht -f 4,2 \\\n | csvtk replace -Ht -p \"(; [kpcofgs]__)+$\" \\\n | csvtk replace -Ht -p \"; [kpcofgs]__; \" -r \"; \" \\\n | csvtk uniq -Ht \\\n | csvtk grep -Ht -p k__ -v \\\n | head -n 10\n\nk__Bacteria 99.85\nk__Bacteria; p__Bacteroidetes 66.08\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia 66.08\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales 66.08\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae 34.45\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides 34.45\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides cellulosilyticus 10.43\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides ovatus 7.98\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. CACC 737 1.06\nk__Bacteria; p__Bacteroidetes; c__Bacteroidia; o__Bacteroidales; f__Bacteroidaceae; g__Bacteroides; s__Bacteroides sp. HF-5287 0.49\n
Save taxon proportion and taxid, and get lineage, name and rank.
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\\n | csvtk cut -Ht -f 1,5 \\\n | taxonkit lineage -i 2 -n -r \\\n | csvtk cut -Ht -f 1,2,5,4,3 \\\n | head -n 10 \\\n | csvtk pretty -Ht\n\n100.00 1 no rank root root\n99.85 131567 no rank cellular organisms cellular organisms\n99.85 2 superkingdom Bacteria cellular organisms;Bacteria\n66.08 1783270 clade FCB group cellular organisms;Bacteria;FCB group\n66.08 68336 clade Bacteroidetes/Chlorobi group cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group\n66.08 976 phylum Bacteroidetes cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes\n66.08 200643 class Bacteroidia cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia\n66.08 171549 order Bacteroidales cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales\n34.45 815 family Bacteroidaceae cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae\n34.45 816 genus Bacteroides cellular organisms;Bacteria;FCB group;Bacteroidetes/Chlorobi group;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides\n
Only save species or lower level and get lineage in format of \"superkingdom phylum class order family genus species\".
$ cat SRS014459-Stool.fasta.gz_bracken_species.kreport \\\n | csvtk cut -Ht -f 1,5 \\\n | taxonkit filter -N -E species -L species -i 2 \\\n | taxonkit lineage -i 2 -n -r \\\n | taxonkit reformat -i 3 -f \"{k};{p};{c};{o};{f};{g};{s}\" \\\n | csvtk cut -Ht -f 1,2,5,4,6 \\\n | csvtk add-header -t -n abundance,taxid,rank,name,lineage \\\n | head -n 10 \\\n | csvtk pretty -t\n\nabundance taxid rank name lineage\n--------- ------- ------- ---------------------------- --------------------------------------------------------------------------------------------------------\n10.43 246787 species Bacteroides cellulosilyticus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides cellulosilyticus\n7.98 28116 species Bacteroides ovatus Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides ovatus\n1.06 2755405 species Bacteroides sp. CACC 737 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CACC 737\n0.49 2650157 species Bacteroides sp. HF-5287 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5287\n0.99 2528203 species Bacteroides sp. A1C1 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. A1C1\n0.28 2763022 species Bacteroides sp. M10 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. M10\n0.16 2650158 species Bacteroides sp. HF-5141 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. HF-5141\n0.12 2715212 species Bacteroides sp. CBA7301 Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides sp. CBA7301\n5.10 817 species Bacteroides fragilis Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Bacteroides fragilis\n
"},{"location":"tutorial/#making-nr-blastdb-for-specific-taxids","title":"Making nr blastdb for specific taxids","text":"Attention:
(2023-11-27) BLAST+ 2.2.15 supports limiting a group of organisms without first using a custom script to get all species-level Taxonomy IDs (taxids) for the group. Details.
E.g., Search of the nr BLAST database limited to Bacteria (taxID 2).
blastp -db nr -taxids 2 -query ...\n
(2019) BLAST+ 2.8.1 is released with new databases, which allows you to limit your search by taxonomy using information built into the BLAST databases. So you don't need to build blastdb for specific taxids now.
Changes:
Data:
Hardware in this tutorial
Tools:
Steps:
Listing all taxids below $id
using taxonkit.
id=6656\n\n# 6656 is the phylum Arthropoda\n# echo 6656 | taxonkit lineage | taxonkit reformat\n# 6656 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda Eukaryota;Arthropoda;;;;;\n\n# 2 bacteria\n# 2157 archaea\n# 4751 fungi\n# 10239 virus\n\n# time: 2s\ntaxonkit list --ids $id --indent \"\" > $id.taxid.txt\n\n# taxonkit list --ids 2,4751,10239 --indent \"\" > microbe.taxid.txt\n\nwc -l $id.taxid.txt\n# 518373 6656.taxid.txt\n
Retrieving target accessions. There are two options:
From prot.accession2taxid.gz (faster, recommended). Note that some accessions are not in nr
.
# time: 4min\npigz -dc prot.accession2taxid.gz \\\n | csvtk grep -t -f taxid -P $id.taxid.txt \\\n | csvtk cut -t -f accession.version,taxid \\\n | sed 1d \\\n > $id.acc2taxid.txt\n\ncut -f 1 $id.acc2taxid.txt > $id.acc.txt\n\nwc -l $id.acc.txt\n# 8174609 6656.acc.txt\n
From pre-formated nr
blastdb
# time: 40min\nblastdbcmd -db nr -entry all -outfmt \"%a %T\" | pigz -c > nr.acc2taxid.txt.gz\n\npigz -dc nr.acc2taxid.txt.gz | wc -l\n# 555220892\n\n# time: 3min\npigz -dc nr.acc2taxid.txt.gz \\\n | csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \\\n | cut -d ' ' -f 1 \\\n > $id.acc.txt\n\nwc -l $id.acc.txt\n# 6928021 6656.acc.txt\n
Retrieving FASTA sequences from pre-formated blastdb. There are two options:
From nr.fa
exported from pre-formated blastdb (faster, smaller output file, recommended). DO NOT directly download nr.gz
from ncbi ftp, in which the FASTA headers are not well formated.
# 1. exporting nr.fa from pre-formated blastdb\n\n# time: 117min (run only once)\nblastdbcmd -db nr -dbtype prot -entry all -outfmt \"%f\" -out - | pigz -c > nr.fa.gz\n\n# =====================================================================\n\n# 2. filtering sequence belong to $taxid\n\n# ---------------------------------------------------------------------\n\n# methond 1) (for cases where $id.acc.txt is not very huge)\n# time: 80min\n# perl one-liner is used to unfold records having mulitple accessions\ntime cat <(echo) <(pigz -dc nr.fa.gz) \\\n | perl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } ' \\\n | seqkit grep -f $id.acc.txt -o nr.$id.fa.gz\n\n# ---------------------------------------------------------------------\n\n# method 2) (**faster**)\n\n# 33min (run only once)\n# (1). split nr.fa.gz. # Note: I have 16 cpus.\n$ time seqkit split2 -p 15 nr.fa.gz\n\n# (2). parallize unfolding\n$ cat _unfold_blastdb_fa.sh\n#!/bin/sh\nperl -e 'BEGIN{ $/ = \"\\n>\"; <>; } while(<>){s/>$//; $i = index $_, \"\\n\"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print \">$_\"; next; }; $h = \">$h\"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\\W+//; print \">$h1\\n$s\";} } '\n\n# 10 min\ntime ls nr.fa.gz.split/nr.part_*.fa.gz \\\n | rush -j 15 -v id=$id 'cat <(echo) <(pigz -dc {}) \\\n | ./_unfold_blastdb_fa.sh \\\n | seqkit grep -f {id}.acc.txt -o nr.{id}.{%@nr\\.(.+)$} '\n\n# (3). merge result\ncat nr.$id.part*.fa.gz > nr.$id.fa.gz\nrm nr.$id.part*.fa.gz\n\n# ---------------------------------------------------------------------\n\n# method 3) (for huge $id.acc.txt file, e.g., bacteria)\n\n# (1). split ${id}.acc.txt into several parts. chunk size depends on lines and RAM (64G for me).\nsplit -d -l 300000000 $id.acc.txt $id.acc.txt.part_\n\n# (2). filter\ntime ls $id.acc.txt.part_* \\\n | rush -j 1 --immediate-output -v id=$id \\\n 'echo {}; cat <(echo) <(pigz -dc nr.fa.gz ) \\\n | ./_unfold_blastdb_fa.sh \\\n | seqkit grep -f {} -o nr.{id}.{%@(part_.+)}.fa.gz '\n\n# (3). merge\ncat nr.$id.part*.fa.gz > nr.$id.fa.gz\n\n# clean\nrm nr.$id.part*.fa.gz\nrm $id.acc.txt.part_\n\n# (4). optionally adding taxid, you may edit replacement (-r) below\n# split\ntime split -d -l 200000000 $id.acc2taxid.txt $id.acc2taxid.txt.part_\n\nln -s nr.$id.fa.gz nr.$id.with-taxid.part0.fa.gz \ni=0\nfor f in $id.acc2taxid.txt.part_* ; do\n echo $f\n time pigz -cd nr.$id.with-taxid.part$i.fa.gz \\\n | seqkit replace -k $f -p \"^([^\\-]+?) \" -r \"{kv}-\\$1 \" -K -U -o nr.$id.with-taxid.part$(($i+1)).fa.gz;\n /bin/rm nr.$id.with-taxid.part$i.fa.gz\n i=$(($i+1));\ndone\nmv nr.$id.with-taxid.part$i.fa.gz nr.$id.with-taxid.fa.gz\n\n# =====================================================================\n\n# 3. counting sequences\n#\n# ls -lh nr.$id.fa.gz\n# -rw-r--r-- 1 shenwei shenwei 902M 9\u6708 13 01:42 nr.6656.fa.gz\n#\npigz -dc nr.$id.fa.gz | grep '^>' -c\n\n# 6928017\n# Here 6928017 ~= 6928021 ($id.acc.txt)\n
Directly from pre-formated blastdb
# time: 5h20min\nblastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz\n\n# counting sequences\n#\n# Note that the headers of outputed fasta by blastdbcmd are \"folded\"\n# for accessions from different species with same sequences, so the\n# number may be small than $(wc -l $id.acc.txt).\npigz -dc nr.$id.fa.gz | grep '^>' -c\n# 1577383\n\n# counting accessions\n#\n# ls -lh nr.$id.fa.gz\n# -rw-r--r-- 1 shenwei shenwei 2.1G 9\u6708 13 03:38 nr.6656.fa.gz\n#\npigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\\n>/g' | grep '^>' -c\n# 288415413\n
makeblastdb
pigz -dc nr.$id.fa.gz > nr.$id.fa\n\n# time: 3min ($nr.$id.fa from step 3 option 1)\n#\n# building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error:\n#\n# BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1\n#\nmakeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id\n\n# rm nr.$id.fa\n
blastp (optional)
# blastdb nr.$id is built from sequences in step 3 option 1\n#\nblastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast\n# real 0m20.866s\n\n# $ cat t4.fa.blast | grep Query= -A 10\n# Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a\n#\n# Length=35\n Score E\n# Sequences producing significant alignments: (Bits) Value\n\n# 2MPQ_A Chain A, Solution structure of the sodium channel toxin Hd1a 72.4 2e-17\n# A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-... 72.4 2e-17\n# ADB56726.1 HNTX-IV.2 precursor [Haplopelma hainanum] 66.6 9e-15\n# D2Y233.1 RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H... 66.6 9e-15\n# ADB56830.1 HNTX-IV.3 precursor [Haplopelma hainanum] 66.6 9e-15\n
You can change the TaxId of interest.
Rank counts of common categories.
$ echo Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae \\\n | rush -D ' ' -T b \\\n 'taxonkit list --ids $(echo {} | taxonkit name2taxid | cut -f 2) \\\n | sed 1d \\\n | taxonkit filter -i 2 -E genus -L genus \\\n | taxonkit lineage -L -r \\\n | csvtk freq -H -t -f 2 -nr \\\n > stats.{}.tsv '\n\n$ csvtk -t join --outer-join stats.*.tsv \\\n | csvtk add-header -t -n \"rank,$(ls stats.*.tsv | rush -k 'echo {@stats.(.+).tsv}' | paste -sd, )\" \\\n | csvtk csv2md -t\n
Similar data on NCBI Taxonomy
rank Archaea Bacteria Eukaryota Fungi Metazoa Viridiplantae species 12482 460940 1349648 156908 957297 191026 strain 354 40643 3486 2352 33 50 genus 205 4112 90882 6844 64148 16202 isolate 7 503 809 76 17 3 species group 2 77 251 22 214 5 serotype 218 serogroup 136 subsection 21 21 subspecies 632 24523 158 17043 7212 forma specialis 521 220 179 33 1 species subgroup 23 101 101 biotype 7 10 morph 12 3 4 5 section 437 37 2 398 genotype 12 12 series 9 5 4 varietas 25 8499 1100 2 7188 forma 4 560 185 6 315 subgenus 1 1558 10 1414 112 pathogroup 5 subvariety 5 5Count of all ranks
$ time taxonkit list --ids 1 \\\n | taxonkit lineage -L -r \\\n | csvtk freq -H -t -f 2 -nr \\\n | csvtk pretty -H -t\n\nspecies 1879659\nno rank 222743\ngenus 96625\nstrain 44483\nsubspecies 25174\nfamily 9492\nvarietas 8524\nsubfamily 3050\ntribe 2213\norder 1660\nsubgenus 1618\nisolate 1319\nserotype 1216\nclade 886\nsuperfamily 865\nforma specialis 741\nforma 564\nsubtribe 508\nsection 437\nclass 429\nsuborder 372\nspecies group 330\nphylum 272\nsubclass 156\nserogroup 138\ninfraorder 130\nspecies subgroup 124\nsuperorder 55\nsubphylum 33\nparvorder 26\nsubsection 21\ngenotype 20\ninfraclass 18\nbiotype 17\nmorph 12\nkingdom 11\nseries 9\nsuperclass 6\ncohort 5\npathogroup 5\nsubvariety 5\nsuperkingdom 4\nsubcohort 3\nsubkingdom 1\nsuperphylum 1\n\nreal 0m3.663s\nuser 0m15.897s\nsys 0m1.010s\n
Ranks of taxa at or below species.
$ taxonkit list --ids 1 \\\n | taxonkit filter --lower-than species --equal-to species \\\n | taxonkit lineage -L -r \\\n | csvtk freq -Ht -nr -f 2 \\\n | csvtk add-header -t -n rank,count \\\n | csvtk pretty -t\n\nrank count\n--------------- -------\nspecies 1880044\nno rank 222756\nstrain 44483\nsubspecies 25171\nvarietas 8524\nisolate 1319\nserotype 1216\nclade 885\nforma specialis 741\nforma 564\nserogroup 138\ngenotype 20\nbiotype 17\nmorph 12\npathogroup 5\nsubvariety 5\n
Sometimes (1) one needs to build a database including bacteria and archaea (from GTDB) and viral database from NCBI. The idea is to export lineages from both GTDB and NCBI using taxonkit reformat, and then create taxdump files from them with taxonkit create-taxdump.
Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump.
taxonkit list --data-dir gtdb-taxdump/R207/ --ids 1 --indent \"\" \\\n | taxonkit filter --data-dir gtdb-taxdump/R207/ --equal-to species \\\n | taxonkit reformat --data-dir gtdb-taxdump/R207/ --taxid-field 1 \\\n --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\\n -o gtdb.tsv\n
Exporting taxonomic lineages of viral taxa with rank equal to or lower than species from NCBI taxdump. For taxa whose rank is \"no rank\" below the species, we treat them as tax of strain rank (--pseudo-strain
, taxonkit v0.14.1 needed).
# taxid of Viruses: 10239\ntaxonkit list --data-dir ~/.taxonkit --ids 10239 --indent \"\" \\\n | taxonkit filter --data-dir ~/.taxonkit --equal-to species --lower-than species \\\n | taxonkit reformat --data-dir ~/.taxonkit --taxid-field 1 \\\n --pseudo-strain --format \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n -o ncbi-viral.tsv\n
Creating taxdump from lineages above.
(awk '{print $_\"\\t\"}' gtdb.tsv; cat ncbi-viral.tsv) \\\n | taxonkit create-taxdump \\\n --field-accession 1 \\\n -R \"superkingdom,phylum,class,order,family,genus,species,strain\" \\\n -O taxdump\n\n# we use --field-accession 1 to output the mapping file between old taxids and new ones.\n$ grep 2697049 taxdump/taxid.map # SARS-COV-2\n2697049 21630522\n
Some tests:
# SARS-COV-2 in NCBI taxonomy\n$ echo 2697049 \\\n | taxonkit lineage -t --data-dir ~/.taxonkit \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L --data-dir ~/.taxonkit \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -Ht\n10239 superkingdom Viruses\n2559587 clade Riboviria\n2732396 kingdom Orthornavirae\n2732408 phylum Pisuviricota\n2732506 class Pisoniviricetes\n76804 order Nidovirales\n2499399 suborder Cornidovirineae\n11118 family Coronaviridae\n2501931 subfamily Orthocoronavirinae\n694002 genus Betacoronavirus\n2509511 subgenus Sarbecovirus\n694009 species Severe acute respiratory syndrome-related coronavirus\n2697049 no rank Severe acute respiratory syndrome coronavirus 2\n\n$ echo \"Severe acute respiratory syndrome coronavirus 2\" | taxonkit name2taxid --data-dir taxdump/\nSevere acute respiratory syndrome coronavirus 2 216305222\n\n$ echo 216305222 \\\n | taxonkit lineage -t --data-dir taxdump/ \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L --data-dir taxdump/ \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -Ht\n1287770734 superkingdom Viruses\n1506901452 phylum Pisuviricota\n1091693597 class Pisoniviricetes\n37745009 order Nidovirales\n738421640 family Coronaviridae\n906833049 genus Betacoronavirus\n1015862491 species Severe acute respiratory syndrome-related coronavirus\n216305222 strain Severe acute respiratory syndrome coronavirus 2\n\n\n\n$ echo \"Escherichia coli\" | taxonkit name2taxid --data-dir taxdump/\nEscherichia coli 1945799576\n\n$ echo 1945799576 \\\n | taxonkit lineage -t --data-dir taxdump/ \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L --data-dir taxdump/ \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -Ht\n609216830 superkingdom Bacteria\n1641076285 phylum Proteobacteria\n329474883 class Gammaproteobacteria\n1012954932 order Enterobacterales\n87250111 family Enterobacteriaceae\n1187493883 genus Escherichia\n1945799576 species Escherichia coli\n
Please enable JavaScript to view the comments powered by Disqus."},{"location":"usage/","title":"Usage and Examples","text":"Table of Contents
taxdump.tar.gz
: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz names.dmp
, nodes.dmp
, delnodes.dmp
and merged.dmp
to data directory: $HOME/.taxonkit
, e.g., /home/shenwei/.taxonkit
,--data-dir
, or environment variable TAXONKIT_DB
.All-in-one command:
wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \ntar -zxvf taxdump.tar.gz\n\nmkdir -p $HOME/.taxonkit\ncp names.dmp nodes.dmp delnodes.dmp merged.dmp $HOME/.taxonkit\n
Update dataset: Simply re-download the taxdump files, uncompress and override old ones.
"},{"location":"usage/#taxonkit","title":"taxonkit","text":"TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit\n\nVersion: 0.16.0\n\nAuthor: Wei Shen <shenwei356@gmail.com>\n\nSource code: https://github.com/shenwei356/taxonkit\nDocuments : https://bioinf.shenwei.me/taxonkit\nCitation : https://www.sciencedirect.com/science/article/pii/S1673852721000837\n\nDataset:\n\n Please download and uncompress \"taxdump.tar.gz\":\n http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz\n\n and copy \"names.dmp\", \"nodes.dmp\", \"delnodes.dmp\" and \"merged.dmp\" to data directory:\n \"/home/shenwei/.taxonkit\"\n\n or some other directory, and later you can refer to using flag --data-dir,\n or environment variable TAXONKIT_DB.\n\n When environment variable TAXONKIT_DB is set, explicitly setting --data-dir will\n overide the value of TAXONKIT_DB.\n\nUsage:\n taxonkit [command]\n\nAvailable Commands:\n cami-filter Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile\n create-taxdump Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV\n filter Filter TaxIds by taxonomic rank range\n genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)\n lca Compute lowest common ancestor (LCA) for TaxIds\n lineage Query taxonomic lineage of given TaxIds\n list List taxonomic subtrees of given TaxIds\n name2taxid Convert taxon names to TaxIds\n profile2cami Convert metagenomic profile table to CAMI format\n reformat Reformat lineage in canonical ranks\n taxid-changelog Create TaxId changelog from dump archives\n version print version information and check for update\n\nFlags:\n --data-dir string directory containing nodes.dmp and names.dmp (default \"/home/shenwei/.taxonkit\")\n -h, --help help for taxonkit\n --line-buffered use line buffering on output, i.e., immediately writing to stdin/file for\n every line of output\n -o, --out-file string out file (\"-\" for stdout, suffix .gz for gzipped out) (default \"-\")\n -j, --threads int number of CPUs. 4 is enough (default 4)\n --verbose print verbose information\n\nUse \"taxonkit [command] --help\" for more information about a command.\n
"},{"location":"usage/#list","title":"list","text":"Usage
List taxonomic subtrees of given TaxIds\n\nAttention:\n 1. When multiple taxids are given, the output may contain duplicated records\n if some taxids are descendants of others.\n\nExamples:\n\n $ taxonkit list --ids 9606 -n -r --indent \" \"\n 9606 [species] Homo sapiens\n 63221 [subspecies] Homo sapiens neanderthalensis\n 741158 [subspecies] Homo sapiens subsp. 'Denisova'\n\n $ taxonkit list --ids 9606 --indent \"\"\n 9606\n 63221\n 741158\n\nUsage:\n taxonkit list [flags]\n\nFlags:\n -h, --help help for list\n -i, --ids string TaxId(s), multiple values should be separated by comma\n -I, --indent string indent (default \" \")\n -J, --json output in JSON format. you can save the result in file with suffix \".json\" and\n open with modern text editor\n -n, --show-name output scientific name\n -r, --show-rank output rank\n
Examples
Default usage.
$ taxonkit list --ids 9605,239934\n9605\n9606\n 63221\n 741158\n1425170\n2665952\n 2665953\n\n239934\n239935\n 349741\n512293\n 512294\n 1131822\n 1262691\n 1263034\n1679444\n2608915\n 1131336\n...\n
Removing indent. The list could be used to extract sequences from BLAST database with blastdbcmd
(see tutorial)
$ taxonkit list --ids 9605,239934 --indent \"\"\n9605\n9606\n63221\n741158\n1425170\n2665952\n2665953\n\n239934\n239935\n349741\n512293\n512294\n1131822\n1262691\n1263034\n1679444\n...\n
Performance: Time and memory usage for whole taxon tree:
$ # emptying the buffers cache\n$ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\"\n\n$ memusg -t taxonkit list --ids 1 --indent \"\" --verbose > t0.txt\n21:05:01.782 [INFO] parsing merged file: /home/shenwei/.taxonkit/names.dmp\n21:05:01.782 [INFO] parsing names file: /home/shenwei/.taxonkit/names.dmp\n21:05:01.782 [INFO] parsing delnodes file: /home/shenwei/.taxonkit/names.dmp\n21:05:01.816 [INFO] 61023 merged nodes parsed\n21:05:01.889 [INFO] 437929 delnodes parsed\n21:05:03.178 [INFO] 2303979 names parsed\n\nelapsed time: 3.290s\npeak rss: 742.77 MB\n
Adding names
$ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934\n9605 [genus] Homo\n 9606 [species] Homo sapiens\n 63221 [subspecies] Homo sapiens neanderthalensis\n 741158 [subspecies] Homo sapiens subsp. 'Denisova'\n 1425170 [species] Homo heidelbergensis\n 2665952 [no rank] environmental samples\n 2665953 [species] Homo sapiens environmental sample\n\n239934 [genus] Akkermansia\n 239935 [species] Akkermansia muciniphila\n 349741 [strain] Akkermansia muciniphila ATCC BAA-835\n 512293 [no rank] environmental samples\n 512294 [species] uncultured Akkermansia sp.\n 1131822 [species] uncultured Akkermansia sp. SMG25\n 1262691 [species] Akkermansia sp. CAG:344\n 1263034 [species] Akkermansia muciniphila CAG:154\n 1679444 [species] Akkermansia glycaniphila\n 2608915 [no rank] unclassified Akkermansia\n 1131336 [species] Akkermansia sp. KLE1605\n 1574264 [species] Akkermansia sp. KLE1797\n...\n
Performance: Time and memory usage for whole taxonomy tree:
$ # emptying the buffers cache\n$ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\"\n\n$ memusg -t taxonkit list --show-rank --show-name --ids 1 > t1.txt\nelapsed time: 5.341s\npeak rss: 1.04 GB\n
Output in JSON format, you can easily collapse and uncollapse taxonomy tree in modern text editor.
$ taxonkit list --show-rank --show-name --indent \" \" --ids 9605,239934 --json\n{\n \"9605 [genus] Homo\": {\n \"9606 [species] Homo sapiens\": {\n \"63221 [subspecies] Homo sapiens neanderthalensis\": {\n },\n \"741158 [subspecies] Homo sapiens subsp. 'Denisova'\": {\n }\n },\n \"1425170 [species] Homo heidelbergensis\": {\n }\n },\n \"239934 [genus] Akkermansia\": {\n \"239935 [species] Akkermansia muciniphila\": {\n \"349741 [no rank] Akkermansia muciniphila ATCC BAA-835\": {\n }\n },\n \"512293 [no rank] environmental samples\": {\n \"512294 [species] uncultured Akkermansia sp.\": {\n },\n \"1131822 [species] uncultured Akkermansia sp. SMG25\": {\n },\n \"1262691 [species] Akkermansia sp. CAG:344\": {\n },\n \"1263034 [species] Akkermansia muciniphila CAG:154\": {\n }\n },\n \"1679444 [species] Akkermansia glycaniphila\": {\n },\n \"2608915 [no rank] unclassified Akkermansia\": {\n \"1131336 [species] Akkermansia sp. KLE1605\": {\n },\n \"1574264 [species] Akkermansia sp. KLE1797\": {\n },\n \"1574265 [species] Akkermansia sp. KLE1798\": {\n },\n \"1638783 [species] Akkermansia sp. UNK.MGS-1\": {\n },\n \"1755639 [species] Akkermansia sp. MC_55\": {\n }\n }\n }\n}\n
Snapshot of taxonomy (taxid 1) in kate:
Usage
Query taxonomic lineage of given TaxIds\n\nInput:\n\n - List of TaxIds, one TaxId per line.\n - Or tab-delimited format, please specify TaxId field \n with flag -i/--taxid-field (default 1).\n - Supporting (gzipped) file or STDIN.\n\nOutput:\n\n 1. Input line data.\n 2. (Optional) Status code (-c/--show-status-code), values:\n - \"-1\" for queries not found in whole database.\n - \"0\" for deleted TaxIds, provided by \"delnodes.dmp\".\n - New TaxIds for merged TaxIds, provided by \"merged.dmp\".\n - Taxids for these found in \"nodes.dmp\".\n 3. Lineage, delimiter can be changed with flag -d/--delimiter.\n 4. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids)\n 5. (Optional) Name (-n/--show-name)\n 6. (Optional) Rank (-r/--show-rank)\n\nFilter out invalid and deleted taxids, and replace merged \ntaxids with new ones:\n\n # input is one-column-taxid\n $ taxonkit lineage -c taxids.txt \\\n | awk '$2>0' \\\n | cut -f 2-\n\n # taxids are in 3rd field in a 4-columns tab-delimited file,\n # for $5, where 5 = 4 + 1.\n $ cat input.txt \\\n | taxonkit lineage -c -i 3 \\\n | csvtk filter2 -H -t -f '$5>0' \\\n | csvtk -H -t cut -f -3\n\nUsage:\n taxonkit lineage [flags]\n\nFlags:\n -d, --delimiter string field delimiter in lineage (default \";\")\n -h, --help help for lineage\n -L, --no-lineage do not show lineage, when user just want names or/and ranks\n -R, --show-lineage-ranks appending ranks of all levels\n -t, --show-lineage-taxids appending lineage consisting of taxids\n -n, --show-name appending scientific name\n -r, --show-rank appending rank of taxids\n -c, --show-status-code show status code before lineage\n -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1)\n
Examples
Full lineage:
# note that 123124124 is a fake taxid, 3 was deleted, 92489,1458427 were merged\n$ cat taxids.txt \n9606\n9913\n376619\n349741\n239935\n314101\n11932\n1327037\n123124124\n3\n92489\n1458427\n\n$ taxonkit lineage taxids.txt | tee lineage.txt \n19:22:13.077 [WARN] taxid 92489 was merged into 796334\n19:22:13.077 [WARN] taxid 1458427 was merged into 1458425\n19:22:13.077 [WARN] taxid 123124124 not found\n19:22:13.077 [WARN] taxid 3 was deleted\n9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\n9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus\n376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS\n349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B\n11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\n1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y\n123124124\n3\n92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\n1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raicheisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n\n# wrapped table with csvtk pretty (>v0.26.0)\n$ taxonkit lineage taxids.txt | csvtk pretty -Ht -x ';' -W 70 -S bold\n\u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n\u2503 9606 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503\n\u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503\n\u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503\n\u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates; \u2503\n\u2503 \u2503 Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae; \u2503\n\u2503 \u2503 Homo;Homo sapiens \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 9913 \u2503 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria; \u2503\n\u2503 \u2503 Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi; \u2503\n\u2503 \u2503 Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota; \u2503\n\u2503 \u2503 Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla; \u2503\n\u2503 \u2503 Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 376619 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503\n\u2503 \u2503 Thiotrichales;Francisellaceae;Francisella;Francisella tularensis; \u2503\n\u2503 \u2503 Francisella tularensis subsp. holarctica; \u2503\n\u2503 \u2503 Francisella tularensis subsp. holarctica LVS \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 349741 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503\n\u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503\n\u2503 \u2503 Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 239935 \u2503 cellular organisms;Bacteria;PVC group;Verrucomicrobia; \u2503\n\u2503 \u2503 Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia; \u2503\n\u2503 \u2503 Akkermansia muciniphila \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 314101 \u2503 cellular organisms;Bacteria;environmental samples; \u2503\n\u2503 \u2503 uncultured murine large bowel bacterium BAC 54B \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 11932 \u2503 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes; \u2503\n\u2503 \u2503 Ortervirales;Retroviridae;unclassified Retroviridae; \u2503\n\u2503 \u2503 Intracisternal A-particles;Mouse Intracisternal A-particle \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 1327037 \u2503 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes; \u2503\n\u2503 \u2503 Caudovirales;Siphoviridae;unclassified Siphoviridae; \u2503\n\u2503 \u2503 Croceibacter phage P2559Y \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 92489 \u2503 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria; \u2503\n\u2503 \u2503 Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae \u2503\n\u2523\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u254b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u252b\n\u2503 1458427 \u2503 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria; \u2503\n\u2503 \u2503 Burkholderiales;Comamonadaceae;Serpentinomonas; \u2503\n\u2503 \u2503 Serpentinomonas raichei \u2503\n\u2517\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u253b\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u251b\n
Speed.
$ time echo 9606 | taxonkit lineage \n9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\n\nreal 0m1.190s\nuser 0m2.365s\nsys 0m0.170s\n\n# all TaxIds\n$ time taxonkit list --ids 1 --indent \"\" | taxonkit lineage > t\n\nreal 0m4.249s\nuser 0m16.418s\nsys 0m1.221s\n
Checking deleted or merged taxids
$ taxonkit lineage --show-status-code taxids.txt | tee lineage.withcode.txt\n\n# valid\n$ cat lineage.withcode.txt | awk '$2 > 0' | cut -f 1,2\n9606 9606\n9913 9913\n376619 376619\n349741 349741\n239935 239935\n314101 314101\n11932 11932\n1327037 1327037\n92489 796334\n1458427 1458425\n\n# merged\n$ cat lineage.withcode.txt | awk '$2 > 0 && $2 != $1' | cut -f 1,2\n92489 796334\n1458427 1458425\n\n# deleted\n$ cat lineage.withcode.txt | awk '$2 == 0' | cut -f 1\n3\n\n# invalid\n$ cat lineage.withcode.txt | awk '$2 < 0' | cut -f 1\n123124124\n
Filter out invalid and deleted taxids, and replace merged taxids with new ones, you may install csvtk.
# input is one-column-taxid\n$ taxonkit lineage -c taxids.txt \\\n | awk '$2>0' \\\n | cut -f 2-\n\n# taxids are in 3rd field in a 4-columns tab-delimited file,\n# for $5, where 5 = 4 + 1.\n$ cat input.txt \\\n | taxonkit lineage -c -i 3 \\\n | csvtk filter2 -H -t -f '$5>0' \\\n | csvtk -H -t cut -f -3\n
Only show name and rank.
$ taxonkit lineage -r -n -L taxids.txt \\\n | csvtk pretty -H -t\n9606 Homo sapiens species\n9913 Bos taurus species\n376619 Francisella tularensis subsp. holarctica LVS strain\n349741 Akkermansia muciniphila ATCC BAA-835 strain\n239935 Akkermansia muciniphila species\n314101 uncultured murine large bowel bacterium BAC 54B species\n11932 Mouse Intracisternal A-particle species\n1327037 Croceibacter phage P2559Y species\n123124124 \n3 \n92489 Erwinia oleae species\n1458427 Serpentinomonas raichei species\n
Show lineage consisting of taxids:
$ taxonkit lineage -t taxids.txt\n9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314146;9443;376913;314293;9526;314295;9604;207598;9605;9606\n9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus 131567;2759;33154;33208;6072;33213;33511;7711;89593;7742;7776;117570;117571;8287;1338369;32523;32524;40674;32525;9347;1437010;314145;91561;9845;35500;9895;27592;9903;9913\n376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS 131567;2;1224;1236;72273;34064;262;263;119857;376619\n349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741\n239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935\n314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B 131567;2;48479;314101\n11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle 10239;2559587;2732397;2732409;2732514;2169561;11632;35276;11749;11932\n1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y 10239;2731341;2731360;2731618;2731619;28883;10699;196894;1327037\n123124124\n3\n92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae 131567;2;1224;1236;91347;1903409;551;796334\n1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei 131567;2;1224;28216;80840;80864;2490452;1458425\n
or read taxids from STDIN:
$ cat taxids.txt | taxonkit lineage\n
And ranks of all nodes:
$ echo 2697049 \\\n | taxonkit lineage -t -R \\\n | csvtk transpose -Ht\n2697049\nViruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2\n10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049\nsuperkingdom;clade;kingdom;phylum;class;order;suborder;family;subfamily;genus;subgenus;species;no rank\n
Another way to show lineage detail of a TaxId
$ echo 2697049 \\\n | taxonkit lineage -t \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -H -t \n10239 superkingdom Viruses\n2559587 clade Riboviria\n2732396 kingdom Orthornavirae\n2732408 phylum Pisuviricota\n2732506 class Pisoniviricetes\n76804 order Nidovirales\n2499399 suborder Cornidovirineae\n11118 family Coronaviridae\n2501931 subfamily Orthocoronavirinae\n694002 genus Betacoronavirus\n2509511 subgenus Sarbecovirus\n694009 species Severe acute respiratory syndrome-related coronavirus\n2697049 no rank Severe acute respiratory syndrome coronavirus 2\n
Usage
Reformat lineage in canonical ranks\n\nInput:\n\n - List of TaxIds or lineages, one record per line.\n The lineage can be a complete lineage or only one taxonomy name.\n - Or tab-delimited format.\n Plese specify the lineage field with flag -i/--lineage-field (default 2).\n Or specify the TaxId field with flag -I/--taxid-field (default 0),\n which overrides -i/--lineage-field.\n - Supporting (gzipped) file or STDIN.\n\nOutput:\n\n 1. Input line data.\n 2. Reformated lineage.\n 3. (Optional) TaxIds taxons in the lineage (-t/--show-lineage-taxids)\n\nAmbiguous names:\n\n - Some TaxIds have the same complete lineage, empty result is returned\n by default. You can use the flag -a/--output-ambiguous-result to\n return one possible result\n\nOutput format can be formated by flag --format, available placeholders:\n\n {k}: superkingdom\n {K}: kingdom\n {p}: phylum\n {c}: class\n {o}: order\n {f}: family\n {g}: genus\n {s}: species\n {t}: subspecies/strain\n\n {S}: subspecies\n {T}: strain\n\nWhen these're no nodes of rank \"subspecies\" nor \"strain\",\nyou can switch on -S/--pseudo-strain to use the node with lowest rank\nas subspecies/strain name, if which rank is lower than \"species\".\nThis flag affects {t}, {S}, {T}.\n\nOutput format can contains some escape charactors like \"\\t\".\n\nUsage:\n taxonkit reformat [flags]\n\nFlags:\n -P, --add-prefix add prefixes for all ranks, single prefix for a rank is defined\n by flag --prefix-X\n -d, --delimiter string field delimiter in input lineage (default \";\")\n -F, --fill-miss-rank fill missing rank with lineage information of the next higher rank\n -f, --format string output format, placeholders of rank are needed (default\n \"{k};{p};{c};{o};{f};{g};{s}\")\n -h, --help help for reformat\n -i, --lineage-field int field index of lineage. data should be tab-separated (default 2)\n -r, --miss-rank-repl string replacement string for missing rank\n -p, --miss-rank-repl-prefix string prefix for estimated taxon level (default \"unclassified \")\n -s, --miss-rank-repl-suffix string suffix for estimated taxon names. \"rank\" for rank name, \"\" for no\n suffix (default \"rank\")\n -R, --miss-taxid-repl string replacement string for missing taxid\n -a, --output-ambiguous-result output one of the ambigous result\n --prefix-K string prefix for kingdom, used along with flag -P/--add-prefix (default\n \"K__\")\n --prefix-S string prefix for subspecies, used along with flag -P/--add-prefix\n (default \"S__\")\n --prefix-T string prefix for strain, used along with flag -P/--add-prefix (default\n \"T__\")\n --prefix-c string prefix for class, used along with flag -P/--add-prefix (default \"c__\")\n --prefix-f string prefix for family, used along with flag -P/--add-prefix (default\n \"f__\")\n --prefix-g string prefix for genus, used along with flag -P/--add-prefix (default \"g__\")\n --prefix-k string prefix for superkingdom, used along with flag -P/--add-prefix\n (default \"k__\")\n --prefix-o string prefix for order, used along with flag -P/--add-prefix (default \"o__\")\n --prefix-p string prefix for phylum, used along with flag -P/--add-prefix (default\n \"p__\")\n --prefix-s string prefix for species, used along with flag -P/--add-prefix (default\n \"s__\")\n --prefix-t string prefix for subspecies/strain, used along with flag\n -P/--add-prefix (default \"t__\")\n -S, --pseudo-strain use the node with lowest rank as strain name, only if which rank\n is lower than \"species\" and not \"subpecies\" nor \"strain\". It\n affects {t}, {S}, {T}. This flag needs flag -F\n -t, --show-lineage-taxids show corresponding taxids of reformated lineage\n -I, --taxid-field int field index of taxid. input data should be tab-separated. it\n overrides -i/--lineage-field\n -T, --trim do not fill or add prefix for missing rank lower than current rank\n
Examples:
For version > 0.8.0, reformat
accept input of TaxIds via flag -I/--taxid-field
.
$ echo 239935 | taxonkit reformat -I 1\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n\n$ echo 349741 | taxonkit reformat -I 1 -f \"{k}|{p}|{c}|{o}|{f}|{g}|{s}|{t}\" -F -t\n349741 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila|Akkermansia muciniphila ATCC BAA-835 2|74201|203494|48461|1647988|239934|239935|349741\n
Example lineage (produced by: taxonkit lineage taxids.txt | awk '$2!=\"\"' > lineage.txt
).
$ cat lineage.txt\n9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\n9913 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Laurasiatheria;Artiodactyla;Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus\n376619 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis;Francisella tularensis subsp. holarctica;Francisella tularensis subsp. holarctica LVS\n349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B\n11932 Viruses;Riboviria;Pararnavirae;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\n1327037 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y\n92489 cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\n1458427 cellular organisms;Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n
Default output format (\"{k};{p};{c};{o};{f};{g};{s}\"
).
# reformated lineages are appended to the input data\n$ taxonkit reformat lineage.txt \n...\n239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n...\n\n$ \n$ taxonkit reformat lineage.txt | tee lineage.txt.reformat\n\n$ cut -f 1,3 lineage.txt.reformat\n9606 Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens\n9913 Eukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus\n376619 Bacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis\n349741 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B\n11932 Viruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\n1327037 Viruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;;Croceibacter phage P2559Y\n92489 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\n1458427 Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n\n# aligned \n$ cat lineage.txt \\\n | taxonkit reformat \\\n | csvtk -H -t cut -f 1,3 \\\n | csvtk -H -t sep -f 2 -s ';' -R \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\\n | csvtk pretty -t\n\ntaxid kindom phylum class order family genus species\n------- --------- --------------- ------------------- ------------------ --------------- -------------------------- -----------------------------------------------\n9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens\n9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus\n376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis\n349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila\n239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila\n314101 Bacteria uncultured murine large bowel bacterium BAC 54B\n11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle\n1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae Croceibacter phage P2559Y\n92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae\n1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei\n
And subspecies/strain
({t}
), subspecies
({S}
), and strain
({T}
) are also available.
# default operation\n$ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\\n | taxonkit lineage -n -r \\\n | taxonkit reformat -f '{t};{S};{T}' \\\n | csvtk -H -t cut -f 1,4,3,5 \\\n | csvtk -H -t sep -f 4 -s ';' -R \\\n | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\\n | csvtk pretty -t\n\ntaxid rank name subspecies/strain subspecies strain\n------- ---------- ----------------------------------------------- --------------------- --------------------- ---------------------\n239935 species Akkermansia muciniphila \n83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12\n1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 \n2697049 no rank Severe acute respiratory syndrome coronavirus 2 \n2605619 no rank Escherichia coli O16:H48\n\n# fill missing ranks\n# see example below for -F/--fill-miss-rank\n#\n$ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\\n | taxonkit lineage -n -r \\\n | taxonkit reformat -f '{t};{S};{T}' --fill-miss-rank \\\n | csvtk -H -t cut -f 1,4,3,5 \\\n | csvtk -H -t sep -f 4 -s ';' -R \\\n | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\\n | csvtk pretty -t\n\ntaxid rank name subspecies/strain subspecies strain\n------- ---------- ----------------------------------------------- ------------------------------------------------------------------------------------ ----------------------------------------------------------------------------- -------------------------------------------------------------------------\n239935 species Akkermansia muciniphila unclassified Akkermansia muciniphila subspecies/strain unclassified Akkermansia muciniphila subspecies unclassified Akkermansia muciniphila strain\n83333 strain Escherichia coli K-12 Escherichia coli K-12 unclassified Escherichia coli subspecies Escherichia coli K-12\n1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178 unclassified Escherichia coli R178 strain\n2697049 no rank Severe acute respiratory syndrome coronavirus 2 unclassified Severe acute respiratory syndrome-related coronavirus subspecies/strain unclassified Severe acute respiratory syndrome-related coronavirus subspecies unclassified Severe acute respiratory syndrome-related coronavirus strain\n2605619 no rank Escherichia coli O16:H48 unclassified Escherichia coli subspecies/strain unclassified Escherichia coli subspecies unclassified Escherichia coli strain\n
When these's no nodes of rank \"subspecies\" nor \"strain\", you can switch -S/--pseudo-strain
to use the node with lowest rank as subspecies/strain name, if which rank is lower than \"species\". Recommend using v0.14.1 or later versions.
$ echo -ne \"239935\\n83333\\n1408252\\n2697049\\n2605619\\n\" \\\n | taxonkit lineage -n -r \\\n | taxonkit reformat -f '{t};{S};{T}' --pseudo-strain \\\n | csvtk -H -t cut -f 1,4,3,5 \\\n | csvtk -H -t sep -f 4 -s ';' -R \\\n | csvtk -H -t add-header -n \"taxid,rank,name,subspecies/strain,subspecies,strain\" \\\n | csvtk pretty -t\n\ntaxid rank name subspecies/strain subspecies strain\n------- ---------- ----------------------------------------------- ----------------------------------------------- ----------------------------------------------- -----------------------------------------------\n239935 species Akkermansia muciniphila\n83333 strain Escherichia coli K-12 Escherichia coli K-12 Escherichia coli K-12\n1408252 subspecies Escherichia coli R178 Escherichia coli R178 Escherichia coli R178\n2697049 no rank Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2 Severe acute respiratory syndrome coronavirus 2\n2605619 no rank Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48 Escherichia coli O16:H48\n
Add prefix (-P/--add-prefix
).
$ cat lineage.txt \\\n | taxonkit reformat -P \\\n | csvtk -H -t cut -f 1,3\n\n9606 k__Eukaryota;p__Chordata;c__Mammalia;o__Primates;f__Hominidae;g__Homo;s__Homo sapiens\n9913 k__Eukaryota;p__Chordata;c__Mammalia;o__Artiodactyla;f__Bovidae;g__Bos;s__Bos taurus\n376619 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Thiotrichales;f__Francisellaceae;g__Francisella;s__Francisella tularensis\n349741 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila\n239935 k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Akkermansiaceae;g__Akkermansia;s__Akkermansia muciniphila\n314101 k__Bacteria;p__;c__;o__;f__;g__;s__uncultured murine large bowel bacterium BAC 54B\n11932 k__Viruses;p__Artverviricota;c__Revtraviricetes;o__Ortervirales;f__Retroviridae;g__Intracisternal A-particles;s__Mouse Intracisternal A-particle\n1327037 k__Viruses;p__Uroviricota;c__Caudoviricetes;o__Caudovirales;f__Siphoviridae;g__;s__Croceibacter phage P2559Y\n92489 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Erwiniaceae;g__Erwinia;s__Erwinia oleae\n1458427 k__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Burkholderiales;f__Comamonadaceae;g__Serpentinomonas;s__Serpentinomonas raichei\n
Show corresponding taxids of reformated lineage (flag -t/--show-lineage-taxids
)
$ cat lineage.txt \\\n | taxonkit reformat -t \\\n | csvtk -H -t cut -f 1,4 \\\n | csvtk -H -t sep -f 2 -s ';' -R \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\\n | csvtk pretty -t\n\ntaxid kindom phylum class order family genus species\n------- ------ ------- ------- ------- ------- ------- -------\n9606 2759 7711 40674 9443 9604 9605 9606\n9913 2759 7711 40674 91561 9895 9903 9913\n376619 2 1224 1236 72273 34064 262 263\n349741 2 74201 203494 48461 1647988 239934 239935\n239935 2 74201 203494 48461 1647988 239934 239935\n314101 2 314101\n11932 10239 2732409 2732514 2169561 11632 11749 11932\n1327037 10239 2731618 2731619 28883 10699 1327037\n92489 2 1224 1236 91347 1903409 551 796334\n1458427 2 1224 28216 80840 80864 2490452 1458425\n
Use custom symbols for unclassfied ranks (-r/--miss-rank-repl
)
$ taxonkit reformat lineage.txt -r \"__\" | cut -f 3\nEukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens\nEukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus\nBacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis\nBacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\nBacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\nBacteria;__;__;__;__;__;uncultured murine large bowel bacterium BAC 54B\nViruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\nViruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;__;Croceibacter phage P2559Y\nBacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\nBacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n\n$ taxonkit reformat lineage.txt -r Unassigned | cut -f 3\nEukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens\nEukaryota;Chordata;Mammalia;Artiodactyla;Bovidae;Bos;Bos taurus\nBacteria;Proteobacteria;Gammaproteobacteria;Thiotrichales;Francisellaceae;Francisella;Francisella tularensis\nBacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\nBacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\nBacteria;Unassigned;Unassigned;Unassigned;Unassigned;Unassigned;uncultured murine large bowel bacterium BAC 54B\nViruses;Artverviricota;Revtraviricetes;Ortervirales;Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\nViruses;Uroviricota;Caudoviricetes;Caudovirales;Siphoviridae;Unassigned;Croceibacter phage P2559Y\nBacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Erwiniaceae;Erwinia;Erwinia oleae\nBacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Serpentinomonas;Serpentinomonas raichei\n
Estimate and fill missing rank with original lineage information (-F, --fill-miss-rank
, very useful for formatting input data for LEfSe). You can change the prefix \"unclassified\" using flag -p/--miss-rank-repl-prefix
.
$ cat lineage.txt \\\n | taxonkit reformat -F \\\n | csvtk -H -t cut -f 1,3 \\\n | csvtk -H -t sep -f 2 -s ';' -R \\\n | csvtk add-header -t -n taxid,kindom,phylum,class,order,family,genus,species \\\n | csvtk pretty -t\n\ntaxid kindom phylum class order family genus species\n------- --------- ---------------------------- --------------------------- --------------------------- ---------------------------- ------------------------------- -----------------------------------------------\n9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens\n9913 Eukaryota Chordata Mammalia Artiodactyla Bovidae Bos Bos taurus\n376619 Bacteria Proteobacteria Gammaproteobacteria Thiotrichales Francisellaceae Francisella Francisella tularensis\n349741 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila\n239935 Bacteria Verrucomicrobia Verrucomicrobiae Verrucomicrobiales Akkermansiaceae Akkermansia Akkermansia muciniphila\n314101 Bacteria unclassified Bacteria phylum unclassified Bacteria class unclassified Bacteria order unclassified Bacteria family unclassified Bacteria genus uncultured murine large bowel bacterium BAC 54B\n11932 Viruses Artverviricota Revtraviricetes Ortervirales Retroviridae Intracisternal A-particles Mouse Intracisternal A-particle\n1327037 Viruses Uroviricota Caudoviricetes Caudovirales Siphoviridae unclassified Siphoviridae genus Croceibacter phage P2559Y\n92489 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Erwiniaceae Erwinia Erwinia oleae\n1458427 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Serpentinomonas Serpentinomonas raichei\n
Do not add prefix or suffix for estimated nodes:
$ echo 314101 | taxonkit reformat -I 1\n314101 Bacteria;;;;;;uncultured murine large bowel bacterium BAC 54B\n$ echo 314101 | taxonkit reformat -I 1 -F -p \"\" -s \"\"\n314101 Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;Bacteria;uncultured murine large bowel bacterium BAC 54B\n
Only some ranks.
$ cat lineage.txt \\\n | taxonkit reformat -F -f \"{s};{p}\"\\\n | csvtk -H -t cut -f 1,3 \\\n | csvtk -H -t sep -f 2 -s ';' -R \\\n | csvtk add-header -t -n taxid,species,phylum \\\n | csvtk pretty -t\n\ntaxid species phylum\n------- ----------------------------------------------- ----------------------------\n9606 Homo sapiens Chordata\n9913 Bos taurus Chordata\n376619 Francisella tularensis Proteobacteria\n349741 Akkermansia muciniphila Verrucomicrobia\n239935 Akkermansia muciniphila Verrucomicrobia\n314101 uncultured murine large bowel bacterium BAC 54B unclassified Bacteria phylum\n11932 Mouse Intracisternal A-particle Artverviricota\n1327037 Croceibacter phage P2559Y Uroviricota\n92489 Erwinia oleae Proteobacteria\n1458427 Serpentinomonas raichei Proteobacteria\n
For some taxids which rank is higher than the lowest rank in -f/--format
, use -T/--trim
to avoid fill missing rank lower than current rank.
$ echo -ne \"2\\n239934\\n239935\\n\" \\\n | taxonkit lineage \\\n | taxonkit reformat -F \\\n | sed -r \"s/;+$//\" \\\n | csvtk -H -t cut -f 1,3\n\n2 Bacteria;unclassified Bacteria phylum;unclassified Bacteria class;unclassified Bacteria order;unclassified Bacteria family;unclassified Bacteria genus;unclassified Bacteria species\n239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;unclassified Akkermansia species\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n\n$ echo -ne \"2\\n239934\\n239935\\n\" \\\n | taxonkit lineage \\\n | taxonkit reformat -F -T \\\n | sed -r \"s/;+$//\" \\\n | csvtk -H -t cut -f 1,3\n\n2 Bacteria\n239934 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia\n239935 Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n
Support tab in format string
$ echo 9606 \\\n | taxonkit lineage \\\n | taxonkit reformat -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{S}\" \\\n | csvtk cut -t -f -2\n\n9606 Eukaryota Chordata Mammalia Primates Hominidae Homo Homo sapiens\n
List seven-level lineage for all TaxIds.
# replace empty taxon with \"Unassigned\"\n$ taxonkit list --ids 1 \\\n | taxonkit lineage \\\n | taxonkit reformat -r Unassigned \n | gzip -c > all.lineage.tsv.gz\n\n# tab-delimited seven-levels\n$ taxonkit list --ids 1 \\\n | taxonkit lineage \\\n | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\" \\\n | csvtk cut -H -t -f -2 \\\n | head -n 5 \\\n | csvtk pretty -H -t\n\n# 8-level\n$ taxonkit list --ids 1 \\\n | taxonkit lineage \\\n | taxonkit reformat -r Unassigned -f \"{k}\\t{p}\\t{c}\\t{o}\\t{f}\\t{g}\\t{s}\\t{t}\" \\\n | csvtk cut -H -t -f -2 \\\n | head -n 5 \\\n | csvtk pretty -H -t\n\n# Fill and trim\n$ memusg -t -s ' taxonkit list --ids 1 \\\n | taxonkit lineage \\\n | taxonkit reformat -F -T \\\n | sed -r \"s/;+$//\" \\\n | gzip -c > all.lineage.tsv.gz '\n\nelapsed time: 19.930s\npeak rss: 6.25 GB\n
From taxid to 7-ranks lineage:
$ cat taxids.txt | taxonkit lineage | taxonkit reformat\n\n# for taxonkit v0.8.0 or later versions\n$ cat taxids.txt | taxonkit reformat -I 1\n
Some TaxIds have the same complete lineage, empty result is returned by default. You can use the flag -a/--output-ambiguous-result
to return one possible result. see #42
$ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t \n19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result\n19:18:29.770 [WARN] we can't distinguish the TaxIds (2507530, 2516889) for lineage: cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019. But you can use -a/--output-ambiguous-result to return one possible result\n2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019\n2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019\n\n$ echo -ne \"2507530\\n2516889\\n\" | taxonkit lineage --data-dir . | taxonkit reformat --data-dir . -t -a\n2507530 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530\n2516889 cellular organisms;Eukaryota;Opisthokonta;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes incertae sedis;Russulales;Russulaceae;Russula;unclassified Russula;Russula sp. 8 KA-2019 Eukaryota;Basidiomycota;Agaricomycetes;Russulales;Russulaceae;Russula;Russula sp. 8 KA-2019 2759;5204;155619;452342;5401;5402;2507530\n
Usage
Convert taxon names to TaxIds\n\nAttention:\n\n 1. Some TaxIds share the same names, e.g, Drosophila.\n These input lines are duplicated with multiple TaxIds.\n\n $ echo Drosophila | taxonkit name2taxid | taxonkit lineage -i 2 -r -L\n Drosophila 7215 genus\n Drosophila 32281 subgenus\n Drosophila 2081351 genus\n\nUsage:\n taxonkit name2taxid [flags]\n\nFlags:\n -h, --help help for name2taxid\n -i, --name-field int field index of name. data should be tab-separated (default 1)\n -s, --sci-name only searching scientific names\n -r, --show-rank show rank\n
Examples
Example data
$ cat names.txt\nHomo sapiens\nAkkermansia muciniphila ATCC BAA-835\nAkkermansia muciniphila\nMouse Intracisternal A-particle\nWei Shen\nuncultured murine large bowel bacterium BAC 54B\nCroceibacter phage P2559Y\n
Default.
# taxonkit name2taxid names.txt\n$ cat names.txt | taxonkit name2taxid | csvtk pretty -H -t\nHomo sapiens 9606\nAkkermansia muciniphila ATCC BAA-835 349741\nAkkermansia muciniphila 239935\nMouse Intracisternal A-particle 11932\nWei Shen \nuncultured murine large bowel bacterium BAC 54B 314101\nCroceibacter phage P2559Y 1327037\n
Show rank.
$ cat names.txt | taxonkit name2taxid --show-rank | csvtk pretty -H -t\nHomo sapiens 9606 species\nAkkermansia muciniphila ATCC BAA-835 349741 strain\nAkkermansia muciniphila 239935 species\nMouse Intracisternal A-particle 11932 species\nWei Shen \nuncultured murine large bowel bacterium BAC 54B 314101 species\nCroceibacter phage P2559Y 1327037 species\n
From name to lineage.
$ cat names.txt | taxonkit name2taxid | taxonkit lineage --taxid-field 2\nHomo sapiens 9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens\nAkkermansia muciniphila ATCC BAA-835 349741 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\nAkkermansia muciniphila 239935 cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\nMouse Intracisternal A-particle 11932 Viruses;Ortervirales;Retroviridae;unclassified Retroviridae;Intracisternal A-particles;Mouse Intracisternal A-particle\nWei Shen\nuncultured murine large bowel bacterium BAC 54B 314101 cellular organisms;Bacteria;environmental samples;uncultured murine large bowel bacterium BAC 54B\nCroceibacter phage P2559Y 1327037 Viruses;Caudovirales;Siphoviridae;unclassified Siphoviridae;Croceibacter phage P2559Y\n
Convert old names to new names.
$ echo Lactobacillus fermentum | taxonkit name2taxid | taxonkit lineage -i 2 -n | cut -f 1,2,4\nLactobacillus fermentum 1613 Limosilactobacillus fermentum\n
Some TaxIds share the same scientific names, e.g, Drosophila.
$ echo Drosophila \\\n | taxonkit name2taxid \\\n | taxonkit lineage -i 2 -r \\\n | taxonkit reformat -i 3 \\\n | csvtk cut -H -t -f 1,2,4,5 \\\n | csvtk pretty -H -t\nDrosophila 7215 genus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila;\nDrosophila 32281 subgenus Eukaryota;Arthropoda;Insecta;Diptera;Drosophilidae;Drosophila;\nDrosophila 2081351 genus Eukaryota;Basidiomycota;Agaricomycetes;Agaricales;Psathyrellaceae;Drosophila;\n
Usage
Filter TaxIds by taxonomic rank range\n\nAttention:\n\n 1. Flag -L/--lower-than and -H/--higher-than are exclusive, and can be\n used along with -E/--equal-to which values can be different.\n 2. A list of pre-ordered ranks is in ~/.taxonkit/ranks.txt, you can use\n your list by -r/--rank-file, the format specification is below.\n 3. All ranks in taxonomy database should be defined in rank file.\n 4. Ranks can be removed with black list via -B/--black-list.\n\n 5. TaxIDs with no rank are kept by default!!!\n They can be optionally discarded by -N/--discard-noranks.\n 6. [Recommended] When filtering with -L/--lower-than, you can use\n -n/--save-predictable-norank to save some special ranks without order,\n where rank of the closest higher node is still lower than rank cutoff.\n\nRank file:\n\n 1. Blank lines or lines starting with \"#\" are ignored.\n 2. Ranks are in decending order and case ignored.\n 3. Ranks with same order should be in one line separated with comma (\",\", no space).\n 4. Ranks without order should be assigned a prefix symbol \"!\" for each rank.\n\nUsage:\n taxonkit filter [flags]\n\nFlags:\n -B, --black-list strings black list of ranks to discard, e.g., '-B \"no rank\" -B \"clade\"\n -N, --discard-noranks discard all ranks without order, type \"taxonkit filter --help\" for details\n -R, --discard-root discard root taxid, defined by --root-taxid\n -E, --equal-to strings output TaxIds with rank equal to some ranks, multiple values can be\n separated with comma \",\" (e.g., -E \"genus,species\"), or give multiple\n times (e.g., -E genus -E species)\n -h, --help help for filter\n -H, --higher-than string output TaxIds with rank higher than a rank, exclusive with --lower-than\n --list-order list user defined ranks in order, from \"$HOME/.taxonkit/ranks.txt\"\n --list-ranks list ordered ranks in taxonomy database, sorted in user defined order\n -L, --lower-than string output TaxIds with rank lower than a rank, exclusive with --higher-than\n -r, --rank-file string user-defined ordered taxonomic ranks, type \"taxonkit filter --help\"\n for details\n --root-taxid uint32 root taxid (default 1)\n -n, --save-predictable-norank do not discard some special ranks without order when using -L, where\n rank of the closest higher node is still lower than rank cutoff\n -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1)\n
Examples
Example data
$ echo 349741 | taxonkit lineage -t | cut -f 3 | sed 's/;/\\n/g' > taxids2.txt\n\n$ cat taxids2.txt\n131567\n2\n1783257\n74201\n203494\n48461\n1647988\n239934\n239935\n349741\n\n$ cat taxids2.txt | taxonkit lineage -r | csvtk -Ht cut -f 1,3,2 | csvtk pretty -H -t\n131567 no rank cellular organisms\n2 superkingdom cellular organisms;Bacteria\n1783257 clade cellular organisms;Bacteria;PVC group\n74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia\n203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae\n48461 order cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales\n1647988 family cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae\n239934 genus cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia\n239935 species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila\n349741 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835\n
Equal to certain rank(s) (-E/--equal-to
)
$ cat taxids2.txt \\\n | taxonkit filter -E Phylum -E Class \\\n | taxonkit lineage -r \\\n | csvtk -Ht cut -f 1,3,2 \\\n | csvtk pretty -H -t\n74201 phylum cellular organisms;Bacteria;PVC group;Verrucomicrobia\n203494 class cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae\n
Lower than a rank (-L/--lower-than
)
$ cat taxids2.txt \\\n | taxonkit filter -L genus \\\n | taxonkit lineage -r -n -L \\\n | csvtk -Ht cut -f 1,3,2 \\\n | csvtk pretty -H -t\n239935 species Akkermansia muciniphila\n349741 strain Akkermansia muciniphila ATCC BAA-835\n
Higher than a rank (-H/--higher-than
)
$ cat taxids2.txt \\\n | taxonkit filter -H phylum \\\n | taxonkit lineage -r -n -L \\\n | csvtk -Ht cut -f 1,3,2 \\\n | csvtk pretty -H -t\n2 superkingdom Bacteria\n
TaxIDs with no rank are kept by default!!! \"no rank\" and \"clade\" have no rank and can be filter out via -N/--discard-noranks
. Futher ranks can be removed with black list via -B/--black-list
.
# 562 is the TaxId of Escherichia coli\n$ taxonkit list --ids 562 \\\n | taxonkit filter -L species \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk freq -Ht -f 2 -nr \\\n | csvtk pretty -H -t\nstrain 2950\nno rank 149\nserotype 141\nserogroup 95\nisolate 1\nsubspecies 1\n\n$ taxonkit list --ids 562 \\\n | taxonkit filter -L species -N -B strain \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk freq -Ht -f 2 -nr \\\n | csvtk pretty -H -t\nserotype 141\nserogroup 95\nisolate 1\nsubspecies 1\n
Combine of -L/-H
with -E
.
$ cat taxids2.txt \\\n | taxonkit filter -L genus -E genus \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -H -t\n239934 genus Akkermansia\n239935 species Akkermansia muciniphila\n349741 strain Akkermansia muciniphila ATCC BAA-835\n
Special cases of \"no rank\". (-n/--save-predictable-norank
). When filtering with -L/--lower-than
, you can use -n/--save-predictable-norank
to save some special ranks without order, where rank of the closest higher node is still lower than rank cutoff.
$ echo -ne \"2605619\\n1327037\\n\" \\\n | taxonkit lineage -t \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -H -t \n131567 no rank cellular organisms\n2 superkingdom Bacteria\n1224 phylum Proteobacteria\n1236 class Gammaproteobacteria\n91347 order Enterobacterales\n543 family Enterobacteriaceae\n561 genus Escherichia\n562 species Escherichia coli\n2605619 no rank Escherichia coli O16:H48\n\n10239 superkingdom Viruses\n2731341 clade Duplodnaviria\n2731360 clade Heunggongvirae\n2731618 phylum Uroviricota\n2731619 class Caudoviricetes\n28883 order Caudovirales\n10699 family Siphoviridae\n196894 no rank unclassified Siphoviridae\n1327037 species Croceibacter phage P2559Y\n\n# save taxids\n$ echo -ne \"2605619\\n1327037\\n\" \\\n | taxonkit lineage -t \\\n | csvtk cut -Ht -f 3 \\\n | csvtk unfold -Ht -f 1 -s \";\" \\\n | tee taxids4.txt\n131567\n2\n1224\n1236\n91347\n543\n561\n562\n2605619\n10239\n2731341\n2731360\n2731618\n2731619\n28883\n10699\n196894\n1327037\n
Now, filter nodes of rank <= species.
$ cat taxids4.txt \\\n | taxonkit filter -L species -E species -N -n \\\n | taxonkit lineage -r -n -L \\\n | csvtk cut -Ht -f 1,3,2 \\\n | csvtk pretty -H -t\n562 species Escherichia coli\n2605619 no rank Escherichia coli O16:H48\n1327037 species Croceibacter phage P2559Y\n
Note that 2605619 (no rank) is saved because its parent node 562 is <= species.
Usage
Compute lowest common ancestor (LCA) for TaxIds\n\nAttention:\n\n 1. This command computes LCA TaxId for a list of TaxIds \n in a field (\"-i/--taxids-field) of tab-delimited file or STDIN.\n 2. TaxIDs should have the same separator (\"-s/--separator\"),\n single charactor separator is prefered.\n 3. Empty lines or lines without valid TaxIds in the field are omitted.\n 4. If some TaxIds are not found in database, it returns 0.\n\nExamples:\n\n $ echo 239934, 239935, 349741 | taxonkit lca -s \", \"\n 239934, 239935, 349741 239934\n\n $ time echo 239934 239935 349741 9606 | taxonkit lca\n 239934 239935 349741 9606 131567\n\nUsage:\n taxonkit lca [flags] \n\nFlags:\n -b, --buffer-size string size of line buffer, supported unit: K, M, G. You need to increase the\n value when \"bufio.Scanner: token too long\" error occured (default \"1M\")\n -h, --help help for lca\n --separater string separater for TaxIds. This flag is same to --separator. (default \" \")\n -s, --separator string separator for TaxIds (default \" \")\n -D, --skip-deleted skip deleted TaxIds and compute with left ones\n -U, --skip-unfound skip unfound TaxIds and compute with left ones\n -i, --taxids-field int field index of TaxIds. Input data should be tab-separated (default 1)\n
Examples:
Example data
$ taxonkit list --ids 9605 -nr --indent \" \"\n9605 [genus] Homo\n 9606 [species] Homo sapiens\n 63221 [subspecies] Homo sapiens neanderthalensis\n 741158 [subspecies] Homo sapiens subsp. 'Denisova'\n 1425170 [species] Homo heidelbergensis\n 2665952 [no rank] environmental samples\n 2665953 [species] Homo sapiens environmental sample\n
Simple one
$ echo 63221 2665953 | taxonkit lca\n63221 2665953 9605\n
Custom field (-i/--taxids-field
) and separater (-s/--separator
).
$ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\"\na 63221,2665953\nb 63221, 741158\n\n$ echo -ne \"a\\t63221,2665953\\nb\\t63221, 741158\\n\" \\\n | taxonkit lca -i 2 -s \",\"\na 63221,2665953 9605\nb 63221, 741158 9606\n
Merged TaxIds.
# merged\n$ echo 92487 92488 92489 | taxonkit lca\n10:08:26.578 [WARN] taxid 92489 was merged into 796334\n92487 92488 92489 1236\n
Deleted TaxIds, you can ommit theses and continue compute with left onces with (-D/--skip-deleted
).
$ echo 1 2 3 | taxonkit lca \n10:30:17.678 [WARN] taxid 3 not found\n1 2 3 0\n\n$ time echo 1 2 3 | taxonkit lca -D\n10:29:31.828 [WARN] taxid 3 was deleted\n1 2 3 1\n
TaxIDs not found in database, you can ommit theses and continue compute with left onces with (-U/--skip-unfound
).
$ echo 61021 61022 11111111 | taxonkit lca\n10:31:44.929 [WARN] taxid 11111111 not found\n61021 61022 11111111 0\n\n$ echo 61021 61022 11111111 | taxonkit lca -U\n10:32:02.772 [WARN] taxid 11111111 not found\n61021 61022 11111111 2628496\n
Usage
Create TaxId changelog from dump archives\n\nAttention:\n 1. This command was originally designed for NCBI taxonomy, where the the TaxIds are stable.\n 2. For other taxonomic data created by \"taxonkit create-taxdump\", e.g., GTDB-taxdump,\n some change events might be wrong, because\n a) There would be dramatic changes between the two versions.\n b) Different taxons in multiple versions might have the same TaxIds, because we only\n check and eliminate taxid collision within a single version.\n So a single version of taxonomic data created by \"taxonkit create-taxdump\" has no problem,\n it's just the changelog might not be perfect.\n\nSteps:\n\n # dependencies:\n # rush - https://github.com/shenwei356/rush/\n\n mkdir -p archive; cd archive;\n\n # --------- download ---------\n\n # option 1\n # for fast network connection\n wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/taxdmp*.zip\n\n # option 2\n # for slow network connection\n url=https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/\n wget $url -O - -o /dev/null \\\n | grep taxdmp | perl -ne '/(taxdmp_.+?.zip)/; print \"$1\\n\";' \\\n | rush -j 2 -v url=$url 'axel -n 5 {url}/{}' \\\n --immediate-output -c -C download.rush\n\n # --------- unzip ---------\n\n ls taxdmp*.zip | rush -j 1 'unzip {} names.dmp nodes.dmp merged.dmp delnodes.dmp -d {@_(.+)\\.}'\n\n # optionally compress .dmp files with pigz, for saving disk space\n fd .dmp$ | rush -j 4 'pigz {}'\n\n # --------- create log ---------\n\n cd ..\n taxonkit taxid-changelog -i archive -o taxid-changelog.csv.gz --verbose\n\nOutput format (CSV):\n\n # fields comments\n taxid # taxid\n version # version / time of archive, e.g, 2019-07-01\n change # change, values:\n # NEW newly added\n # REUSE_DEL deleted taxids being reused\n # REUSE_MER merged taxids being reused\n # DELETE deleted\n # MERGE merged into another taxid\n # ABSORB other taxids merged into this one\n # CHANGE_NAME scientific name changed\n # CHANGE_RANK rank changed\n # CHANGE_LIN_LIN lineage taxids remain but lineage remain\n # CHANGE_LIN_TAX lineage taxids changed\n # CHANGE_LIN_LEN lineage length changed\n change-value # variable values for changes: \n # 1) new taxid for MERGE\n # 2) merged taxids for ABSORB\n # 3) empty for others\n name # scientific name\n rank # rank\n lineage # complete lineage of the taxid\n lineage-taxids # taxids of the lineage\n\n # you can use csvtk to investigate them. e.g.,\n csvtk grep -f taxid -p 1390515 taxid-changelog.csv.gz\n\nUsage:\n taxonkit taxid-changelog [flags]\n\nFlags:\n -i, --archive string directory containing uncompressed dumped archives\n -h, --help help for taxid-changelog\n
Details
Example 1 (E.coli with taxid 562
)
$ pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f taxid -p 562 \\\n | csvtk pretty\ntaxid version change change-value name rank lineage lineage-taxids\n562 2014-08-01 NEW Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562\n562 2014-08-01 ABSORB 662101;662104 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562\n562 2015-11-01 ABSORB 1637691 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562\n562 2016-10-01 CHANGE_LIN_LIN Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562\n562 2018-06-01 ABSORB 469598 Escherichia coli species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli 131567;2;1224;1236;91347;543;561;562\n\n# merged taxids\n$ pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f taxid -p 662101,662104,1637691,469598 \\\n | csvtk pretty\ntaxid version change change-value name rank lineage lineage-taxids\n469598 2014-08-01 NEW Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598\n469598 2016-10-01 CHANGE_LIN_LIN Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598\n469598 2018-06-01 MERGE 562 Escherichia sp. 3_2_53FAA species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. 3_2_53FAA 131567;2;1224;1236;91347;543;561;469598\n662101 2014-08-01 MERGE 562 \n662104 2014-08-01 MERGE 562 \n1637691 2015-04-01 DELETE \n1637691 2015-05-01 REUSE_DEL Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691\n1637691 2015-11-01 MERGE 562 Escherichia sp. MAR species cellular organisms;Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia;Escherichia sp. MAR 131567;2;1224;1236;91347;543;561;1637691\n
Example 2 (SARS-CoV-2).
$ time pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f taxid -p 2697049 \\\n | csvtk pretty\ntaxid version change change-value name rank lineage lineage-taxids\n2697049 2020-02-01 NEW Wuhan seafood market pneumonia virus species Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;unclassified Betacoronavirus;Wuhan seafood market pneumonia virus 10239;2559587;76804;2499399;11118;2501931;694002;696098;2697049\n2697049 2020-03-01 CHANGE_NAME Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n2697049 2020-03-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n2697049 2020-03-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n2697049 2020-06-01 CHANGE_LIN_LEN Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n2697049 2020-07-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 isolate Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n2697049 2020-08-01 CHANGE_RANK Severe acute respiratory syndrome coronavirus 2 no rank Viruses;Riboviria;Orthornavirae;Pisuviricota;Pisoniviricetes;Nidovirales;Cornidovirineae;Coronaviridae;Orthocoronavirinae;Betacoronavirus;Sarbecovirus;Severe acute respiratory syndrome-related coronavirus;Severe acute respiratory syndrome coronavirus 2 10239;2559587;2732396;2732408;2732506;76804;2499399;11118;2501931;694002;2509511;694009;2697049\n\nreal 0m7.644s\nuser 0m16.749s\nsys 0m3.985s\n
Example 3 (All subspecies and strain in Akkermansia muciniphila 239935)
# species in Akkermansia\n$ taxonkit list --show-rank --show-name --indent \" \" --ids 239935\n239935 [species] Akkermansia muciniphila\n 349741 [strain] Akkermansia muciniphila ATCC BAA-835\n\n# check them all \n$ pigz -cd taxid-changelog.csv.gz \\\n | csvtk grep -f taxid -P <(taxonkit list --indent \"\" --ids 239935) \\\n | csvtk pretty lineage-taxids\ntaxid version change change-value name rank lineage lineage-taxids\n239935 2014-08-01 NEW Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;203557;239934;239935\n239935 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;51290;74201;203494;48461;1647988;239934;239935\n239935 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935\n239935 2016-05-01 ABSORB 1834199 Akkermansia muciniphila species cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila 131567;2;1783257;74201;203494;48461;1647988;239934;239935\n349741 2014-08-01 NEW Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;203557;239934;239935;349741\n349741 2015-05-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;Chlamydiae/Verrucomicrobia group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;51290;74201;203494;48461;1647988;239934;239935;349741\n349741 2016-03-01 CHANGE_LIN_TAX Akkermansia muciniphila ATCC BAA-835 no rank cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741\n349741 2020-07-01 CHANGE_RANK Akkermansia muciniphila ATCC BAA-835 strain cellular organisms;Bacteria;PVC group;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae;Akkermansia;Akkermansia muciniphila;Akkermansia muciniphila ATCC BAA-835 131567;2;1783257;74201;203494;48461;1647988;239934;239935;349741\n
More
"},{"location":"usage/#create-taxdump","title":"create-taxdump","text":"Usage
Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV\n\nInput format:\n 0. For GTDB taxonomy file, just use --gtdb.\n We use the numeric assembly accession as the taxon at subspecies rank.\n (without the prefix GCA_ and GCF_, and version number).\n 1. The input file should be tab-delimited, at least one column is needed.\n 2. Ranks can be given either via the first row or the flag --rank-names.\n 3. The column containing the genome/assembly accession is recommended to\n generate TaxId mapping file (taxid.map, id -> taxid).\n -A/--field-accession, field contaning genome/assembly accession\n --field-accession-re, regular expression to extract the accession\n Note that mutiple TaxIds pointing to the same accession are listed as\n comma-seperated integers.\n\nAttention:\n 1. Duplicated taxon names wit different ranks are allowed since v0.16.0, since\n the rank and taxon name are contatenated for generating the TaxId.\n 2. The generated TaxIds are not consecutive numbers, however some tools like MMSeqs2\n required this, you can use the script below for convertion:\n\n https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py\n\n 3. We only check and eliminate taxid collision within a single version of taxonomy data.\n Therefore, if you create taxid-changelog with \"taxid-changelog\", different taxons\n in multiple versions might have the same TaxIds and some change events might be wrong.\n\n So a single version of taxonomic data created by \"taxonkit create-taxdump\" has no problem,\n it's just the changelog might not be perfect.\n\nUsage:\n taxonkit create-taxdump [flags]\n\nFlags:\n -A, --field-accession int field index of assembly accession (genome ID), for outputting\n taxid.map\n -S, --field-accession-as-subspecies treate the accession as subspecies rank\n --field-accession-re string regular expression to extract assembly accession (default \"^(.+)$\")\n --force overwrite existing output directory\n --gtdb input files are GTDB taxonomy file\n --gtdb-re-subs string regular expression to extract assembly accession as the\n subspecies (default \"^\\\\w\\\\w_GC[AF]_(.+)\\\\.\\\\d+$\")\n -h, --help help for create-taxdump\n --line-chunk-size int number of lines to process for each thread, and 4 threads is\n fast enough. (default 5000)\n --null strings null value of taxa (default [,NULL,NA])\n -x, --old-taxdump-dir string taxdump directory of the previous version, for generating\n merged.dmp and delnodes.dmp\n -O, --out-dir string output directory\n -R, --rank-names strings names of all ranks, leave it empty to use the (lowercase) first\n row of input as rank names\n
Examples:
GTDB. See more: https://github.com/shenwei356/gtdb-taxdump
$ taxonkit create-taxdump --gtdb ar53_taxonomy_r207.tsv.gz bac120_taxonomy_r207.tsv.gz --out-dir taxdump\n16:42:35.213 [INFO] 317542 records saved to taxdump/taxid.map\n16:42:35.460 [INFO] 401815 records saved to taxdump/nodes.dmp\n16:42:35.611 [INFO] 401815 records saved to taxdump/names.dmp\n16:42:35.611 [INFO] 0 records saved to taxdump/merged.dmp\n16:42:35.611 [INFO] 0 records saved to taxdump/delnodes.dmp\n
ICTV, See more: https://github.com/shenwei356/ictv-taxdump
MGV. Only Order, Family, Genus information are available.
$ cat mgv_contig_info.tsv \\\n | csvtk cut -t -f ictv_order,ictv_family,ictv_genus,votu_id,contig_id \\\n | sed 1d \\\n > mgv.tsv\n\n$ taxonkit create-taxdump mgv.tsv --out-dir mgv --force -A 5 -R order,family,genus,species\n23:33:18.098 [INFO] 189680 records saved to mgv/taxid.map\n23:33:18.131 [INFO] 58102 records saved to mgv/nodes.dmp\n23:33:18.150 [INFO] 58102 records saved to mgv/names.dmp\n23:33:18.150 [INFO] 0 records saved to mgv/merged.dmp\n23:33:18.150 [INFO] 0 records saved to mgv/delnodes.dmp\n\n$ head -n 5 mgv/taxid.map \nMGV-GENOME-0364295 677052301\nMGV-GENOME-0364296 677052301\nMGV-GENOME-0364303 1414406025\nMGV-GENOME-0364311 1849074420\nMGV-GENOME-0364312 2074846424\n\n$ echo 677052301 | taxonkit lineage --data-dir mgv/ \n677052301 Caudovirales;crAss-phage;OTU-61123\n\n$ echo 677052301 | taxonkit reformat --data-dir mgv/ -I 1 -P\n677052301 k__;p__;c__;o__Caudovirales;f__crAss-phage;g__;s__OTU-61123\n\n$ grep MGV-GENOME-0364295 mgv.tsv \nCaudovirales crAss-phage NULL OTU-61123 MGV-GENOME-0364295\n
Custom lineages with the first row as rank names and treating one column as accession.
$ csvtk pretty -t example/taxonomy.tsv \nid superkingdom phylum class order family genus species\n--------------- ------------ -------------- ------------------- ---------------- ------------------ -------------- --------------------------\nGCF_001027105.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus\nGCF_001096185.1 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus pneumoniae\nGCF_001544255.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecium\nGCF_002949675.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella dysenteriae\nGCF_002950215.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Shigella Shigella flexneri\nGCF_006742205.1 Bacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus epidermidis\nGCF_000006945.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Salmonella Salmonella enterica\nGCF_000017205.1 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas Pseudomonas aeruginosa\nGCF_003697165.2 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia Escherichia coli\nGCF_009759685.1 Bacteria Proteobacteria Gammaproteobacteria Moraxellales Moraxellaceae Acinetobacter Acinetobacter baumannii\nGCF_000148585.2 Bacteria Firmicutes Bacilli Lactobacillales Streptococcaceae Streptococcus Streptococcus mitis\nGCF_000392875.1 Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus Enterococcus faecalis\nGCF_000742135.1 Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Klebsiella Klebsiella pneumonia\n\n# the first column as accession\n$ taxonkit create-taxdump -A 1 example/taxonomy.tsv -O example/taxdump\n16:31:31.828 [INFO] I will use the first row of input as rank names\n16:31:31.843 [INFO] 13 records saved to example/taxdump/taxid.map\n16:31:31.843 [INFO] 39 records saved to example/taxdump/nodes.dmp\n16:31:31.843 [INFO] 39 records saved to example/taxdump/names.dmp\n16:31:31.843 [INFO] 0 records saved to example/taxdump/merged.dmp\n16:31:31.843 [INFO] 0 records saved to example/taxdump/delnodes.dmp\n\n$ export TAXONKIT_DB=example/taxdump\n$ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r | csvtk pretty -Ht\n793223984 Bacteria;Proteobacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii species\n1220345221 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa species\n561101225 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella flexneri species\n1969112428 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella dysenteriae species\n599451526 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli species\n2034984046 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica species\n1859674812 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae species\n773201972 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus species\n1295317147 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis species\n182402976 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecium species\n1566113429 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis species\n891083107 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species\n1357145446 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species\n\n$ head -n 3 example/taxdump/taxid.map\nGCF_001027105.1 773201972\nGCF_001096185.1 891083107\nGCF_001544255.1 182402976\n
Custom lineages with the first row as rank names (pure lineage data)
$ csvtk cut -t -f 2- example/taxonomy.tsv | head -n 2 | csvtk pretty -t \nsuperkingdom phylum class order family genus species\n------------ ---------- ------- ---------- ----------------- -------------- ---------------------\nBacteria Firmicutes Bacilli Bacillales Staphylococcaceae Staphylococcus Staphylococcus aureus\n\n$ csvtk cut -t -f 2- example/taxonomy.tsv \\\n | taxonkit create-taxdump -O example/taxdump2\n16:53:08.604 [INFO] I will use the first row of input as rank names\n16:53:08.614 [INFO] 39 records saved to example/taxdump2/nodes.dmp\n16:53:08.614 [INFO] 39 records saved to example/taxdump2/names.dmp\n16:53:08.614 [INFO] 0 records saved to example/taxdump2/merged.dmp\n16:53:08.615 [INFO] 0 records saved to example/taxdump2/delnodes.dmp\n\n$ export TAXONKIT_DB=example/taxdump2\n$ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r | head -n 2\n793223984 Bacteria;Proteobacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii species\n1220345221 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa species\n
Usage
Generate shell autocompletion script\n\nSupported shell: bash|zsh|fish|powershell\n\nBash:\n\n # generate completion shell\n taxonkit genautocomplete --shell bash\n\n # configure if never did.\n # install bash-completion if the \"complete\" command is not found.\n echo \"for bcfile in ~/.bash_completion.d/* ; do source \\$bcfile; done\" >> ~/.bash_completion\n echo \"source ~/.bash_completion\" >> ~/.bashrc\n\nZsh:\n\n # generate completion shell\n taxonkit genautocomplete --shell zsh --file ~/.zfunc/_taxonkit\n\n # configure if never did\n echo 'fpath=( ~/.zfunc \"${fpath[@]}\" )' >> ~/.zshrc\n echo \"autoload -U compinit; compinit\" >> ~/.zshrc\n\nfish:\n\n taxonkit genautocomplete --shell fish --file ~/.config/fish/completions/taxonkit.fish\n\nUsage:\n taxonkit genautocomplete [flags]\n\nFlags:\n --file string autocompletion file (default \"/home/shenwei/.bash_completion.d/taxonkit.sh\")\n -h, --help help for genautocomplete\n --type string autocompletion type (currently only bash supported) (default \"bash\")\n
"},{"location":"usage/#profile2cami","title":"profile2cami","text":"Usage
Convert metagenomic profile table to CAMI format\n\nInput format: \n 1. The input file should be tab-delimited\n 2. At least two columns needed:\n a) TaxId of taxon at species or lower rank.\n b) Abundance (could be percentage, automatically detected or use -p/--percentage).\n\nAttention:\n 1. Some TaxIds may be merged to another ones in current taxonomy version,\n the abundances will be summed up.\n 2. Some TaxIds may be deleted in current taxonomy version,\n the abundances can be optionally recomputed with the flag -R/--recompute-abd.\n\nUsage:\n taxonkit profile2cami [flags]\n\nFlags:\n -a, --abundance-field int field index of abundance. input data should be tab-separated (default 2)\n -h, --help help for profile2cami\n -0, --keep-zero keep taxons with abundance of zero\n -p, --percentage abundance is in percentage\n -R, --recompute-abd recompute abundance if some TaxIds are deleted in current taxonomy version\n -s, --sample-id string sample ID in result file\n -r, --show-rank strings only show TaxIds and names of these ranks (default\n [superkingdom,phylum,class,order,family,genus,species,strain])\n -i, --taxid-field int field index of taxid. input data should be tab-separated (default 1)\n -t, --taxonomy-id string taxonomy ID in result file\n
Examples
Test data, note that 2824115
is merged to 483329
and 1657696
is deleted in current taxonomy version.
$ cat example/abundance.tsv \n2824115 0.2 merged to 483329\n483329 0.2 absord 2824115\n239935 0.5 no change\n1657696 0.1 deleted\n
Example:
$ taxonkit profile2cami -s sample1 -t 2021-10-01 \\\n example/abundance.tsv\n\n13:17:40.552 [WARN] taxid is deleted in current taxonomy version: 1657696\n13:17:40.552 [WARN] you may recomputed abundance with the flag -R/--recompute-abd\n@SampleID:sample1\n@Version:0.10.0\n@Ranks:superkingdom|phylum|class|order|family|genus|species|strain\n@TaxonomyID:2021-10-01\n@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE\n2 superkingdom 2 Bacteria 50.000000000000000\n2759 superkingdom 2759 Eukaryota 40.000000000000000\n74201 phylum 2|74201 Bacteria|Verrucomicrobia 50.000000000000000\n6656 phylum 2759|6656 Eukaryota|Arthropoda 40.000000000000000\n203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 50.000000000000000\n50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 40.000000000000000\n48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 50.000000000000000\n7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 40.000000000000000\n1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 50.000000000000000\n57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 40.000000000000000\n239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 50.000000000000000\n57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 40.000000000000000\n239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 50.000000000000000\n483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 40.000000000000000\n
Recompute (normalize) the abundance
$ taxonkit profile2cami -s sample1 -t 2021-10-01 \\\n example/abundance.tsv --recompute-abd\n13:19:23.647 [WARN] taxid is deleted in current taxonomy version: 1657696\n@SampleID:sample1\n@Version:0.10.0\n@Ranks:superkingdom|phylum|class|order|family|genus|species|strain\n@TaxonomyID:2021-10-01\n@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE\n2 superkingdom 2 Bacteria 55.555555555555557\n2759 superkingdom 2759 Eukaryota 44.444444444444450\n74201 phylum 2|74201 Bacteria|Verrucomicrobia 55.555555555555557\n6656 phylum 2759|6656 Eukaryota|Arthropoda 44.444444444444450\n203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 55.555555555555557\n50557 class 2759|6656|50557 Eukaryota|Arthropoda|Insecta 44.444444444444450\n48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 55.555555555555557\n7041 order 2759|6656|50557|7041 Eukaryota|Arthropoda|Insecta|Coleoptera 44.444444444444450\n1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 55.555555555555557\n57514 family 2759|6656|50557|7041|57514 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae 44.444444444444450\n239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 55.555555555555557\n57515 genus 2759|6656|50557|7041|57514|57515 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus 44.444444444444450\n239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 55.555555555555557\n483329 species 2759|6656|50557|7041|57514|57515|483329 Eukaryota|Arthropoda|Insecta|Coleoptera|Silphidae|Nicrophorus|Nicrophorus carolina 44.444444444444450\n
See https://github.com/shenwei356/sun2021-cami-profiles
Usage
Remove taxa of given TaxIds and their descendants in CAMI metagenomic profile\n\nInput format: \n The CAMI (Taxonomic) Profiling Output Format \n - https://github.com/CAMI-challenge/contest_information/blob/master/file_formats/CAMI_TP_specification.mkd\n - One file with mutiple samples is also supported.\n\nHow to:\n - No extra taxonomy data needed, so the original taxonomic information are\n used and not changed.\n - A mini taxonomic tree is built from records with abundance greater than\n zero, and only leaves are retained for later use. The rank of leaves may\n be \"strain\", \"species\", or \"no rank\".\n - Relative abundances (in percentage) are recomputed for all leaves\n (reference genome).\n - A new taxonomic tree is built from these leaves, and abundances are \n cumulatively added up from leaves to the root.\n\nExamples:\n 1. Remove Archaea, Bacteria, and EukaryoteS, only keep Viruses:\n taxonkit cami-filter -t 2,2157,2759 test.profile -o test.filter.profile\n 2. Remove Viruses:\n taxonkit cami-filter -t 10239 test.profile -o test.filter.profile\n\nUsage:\n taxonkit cami-filter [flags]\n\nFlags:\n --field-percentage int field index of PERCENTAGE (default 5)\n --field-rank int field index of taxid (default 2)\n --field-taxid int field index of taxid (default 1)\n --field-taxpath int field index of TAXPATH (default 3)\n --field-taxpathsn int field index of TAXPATHSN (default 4)\n -h, --help help for cami-filter\n --leaf-ranks strings only consider leaves at these ranks (default [species,strain,no rank])\n --show-rank strings only show TaxIds and names of these ranks (default\n [superkingdom,phylum,class,order,family,genus,species,strain])\n --taxid-sep string separator of taxid in TAXPATH and TAXPATHSN (default \"|\")\n -t, --taxids strings the parent taxid(s) to filter out\n -f, --taxids-file strings file(s) for the parent taxid(s) to filter out, one taxid per line\n
Examples:
taxonkit profile2cami -s sample1 -t 2021-10-01 \\\n example/abundance.tsv --recompute-abd \\\n | taxonkit cami-filter -t 2759\n@SampleID:sample1\n@Version:0.10.0\n@Ranks:superkingdom|phylum|class|order|family|genus|species|strain\n@TaxonomyID:2021-10-01\n@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE\n2 superkingdom 2 Bacteria 100.000000000000000\n74201 phylum 2|74201 Bacteria|Verrucomicrobia 100.000000000000000\n203494 class 2|74201|203494 Bacteria|Verrucomicrobia|Verrucomicrobiae 100.000000000000000\n48461 order 2|74201|203494|48461 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales 100.000000000000000\n1647988 family 2|74201|203494|48461|1647988 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae 100.000000000000000\n239934 genus 2|74201|203494|48461|1647988|239934 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia 100.000000000000000\n239935 species 2|74201|203494|48461|1647988|239934|239935 Bacteria|Verrucomicrobia|Verrucomicrobiae|Verrucomicrobiales|Akkermansiaceae|Akkermansia|Akkermansia muciniphila 100.000000000000000\n
NCBI taxonomy, version 2021-01-21
TaxIDs. Root node 1
is removed. And These data should be updated along with NCBI taxonomy dataset. Seven sizes of TaxIds are sampled from nodes.dmp
.
# shuffle all taxids\ncut -f 1 nodes.dmp | grep -w -v 1 | shuf > ids.txt\n\n# extract n taxids for testing\nfor n in 1 10 100 1000 2000 4000 6000 8000 10000 20000 40000 60000 80000 100000; do \n head -n $n ids.txt > taxids.n$n.txt\ndone\n
ETE
sudo pip3 install ete3\n\n# create database\n# http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html#upgrading-the-local-database\nfrom ete3 import NCBITaxa\nncbi = NCBITaxa()\nncbi.update_taxonomy_database()\n
TaxonKit
mkdir -p $HOME/.taxonkit\nmkdir -p $HOME/bin/\n\n# data\nwget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz \ntar -zxvf taxdump.tar.gz -C $HOME/.taxonkit\n\n# binary\nwget https://github.com/shenwei356/taxonkit/releases/download/v0.7.2/taxonkit_linux_amd64.tar.gz\ntar -zxvf taxonkit_linux_amd64.tar.gz -C $HOME/bin/\n
taxopy
sudo pip3 install -U taxopy\n\n# taxoopy identical dump files copied from taxonkit\nmkdir -p ~/.taxopy\ncp ~/.taxonkit/{nodes.dmp,names.dmp} ~/.taxopy\n
Scripts/Command as listed below. Python scripts were written following to the official documents, and parallelized querying were not used, including TaxonKit.
ETE get_lineage.ete.py < $infile > $outfile\ntaxopy get_lineage.taxopy.py < $infile > $outfile\ntaxonkit taxonkit lineage --threads 1 --delimiter \"; \" < $infile > $outfile\n
A Python script memusg was used to computate running time and peak memory usage of a process. A Perl scripts run.pl
is used to automatically running tests and generate data for plotting.
Running benchmark:
$ # emptying the buffers cache\n$ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\"\n\ntime perl run.pl -n 3 run_benchmark.sh -o bench.get_lineage.tsv\n
Checking result:
$ md5sum taxids.n*.lineage\n\n# clear\n$ rm *.lineage *.out\n
Plotting benchmark result. R libraries dplyr
, ggplot2
, scales
, ggthemes
, ggrepel
are needed.
# reformat dataset\n# tools: https://github.com/shenwei356/csvtk/\n\nfor f in taxids.n*.txt; do wc -l $f; done \\\n | sort -k 1,1n \\\n | awk '{ print($2\"\\t\"$1) }' \\\n > dataset_rename.tsv\n\ncat bench.get_lineage.tsv \\\n | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\\n | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\\n > bench.get_lineage.reformat.tsv\n\n./plot2.R -i bench.get_lineage.reformat.tsv --width 6 --height 4 --dpi 600 \\\n --labcolor \"log10(queries)\" --labshape \"Tools\"\n
Result
"},{"location":"bench/#benchmark-2-taxonkit-multi-threaded-scalability","title":"Benchmark 2: TaxonKit multi-threaded scalability","text":"Running benchmark:
$ # emptying the buffers cache\n$ su -c \"free && sync && echo 3 > /proc/sys/vm/drop_caches && free\"\n\n\n$ time perl run.pl -n 3 run_benchmark_taxonkit.sh -o bench.taxonkit.tsv\n$ rm *.lineage *.out\n
Plotting benchmark result.
cat bench.taxonkit.tsv \\\n | csvtk sort -t -L dataset:<(cut -f 1 dataset_rename.tsv) -k dataset:u -k app \\\n | csvtk replace -t -f dataset -k dataset_rename.tsv -p '(.+)' -r '{kv}' \\\n > bench.taxonkit.reformat.tsv\n\n./plot_threads2.R -i bench.taxonkit.reformat.tsv --width 6 --height 4 --dpi 600 \\\n --labcolor \"log10(queries)\" --labshape \"Threads\"\n
Result
Please enable JavaScript to view the comments powered by Disqus."}]} \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 9c74b1d..57bdc89 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,42 +2,42 @@TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit
-Version: 0.15.1
+Version: 0.16.0
Author: Wei Shen <shenwei356@gmail.com>
@@ -711,7 +711,7 @@ taxonkitlistLink
Usage
List taxonomic subtrees of given TaxIds
-Attentions:
+Attention:
1. When multiple taxids are given, the output may contain duplicated records
if some taxids are descendants of others.
@@ -1783,7 +1783,7 @@ filterFilter TaxIds by taxonomic rank range
-Attentions:
+Attention:
1. Flag -L/--lower-than and -H/--higher-than are exclusive, and can be
used along with -E/--equal-to which values can be different.
@@ -2119,6 +2119,16 @@ taxid-changelogCreate TaxId changelog from dump archives
+Attention:
+ 1. This command was originally designed for NCBI taxonomy, where the the TaxIds are stable.
+ 2. For other taxonomic data created by "taxonkit create-taxdump", e.g., GTDB-taxdump,
+ some change events might be wrong, because
+ a) There would be dramatic changes between the two versions.
+ b) Different taxons in multiple versions might have the same TaxIds, because we only
+ check and eliminate taxid collision within a single version.
+ So a single version of taxonomic data created by "taxonkit create-taxdump" has no problem,
+ it's just the changelog might not be perfect.
+
Steps:
# dependencies:
@@ -2268,7 +2278,7 @@ create-taxdumpCreate NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV
-Input format:
+Input format:
0. For GTDB taxonomy file, just use --gtdb.
We use the numeric assembly accession as the taxon at subspecies rank.
(without the prefix GCA_ and GCF_, and version number).
@@ -2276,60 +2286,47 @@ create-taxdumpcreate-taxdump16:31:31.843 [INFO] 0 records saved to example/taxdump/delnodes.dmp
$ export TAXONKIT_DB=example/taxdump
-$ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r
-1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species
-2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species
-3809813362 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis species
-4145431389 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecium species
-1569132721 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus species
-1920251658 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis species
-3843752343 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa species
-72054943 Bacteria;Proteobacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii species
-1678121664 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica species
-524994882 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella dysenteriae species
-2695851945 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella flexneri species
-3958205156 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae species
-4093283224 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli species
+$ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r | csvtk pretty -Ht
+793223984 Bacteria;Proteobacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii species
+1220345221 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa species
+561101225 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella flexneri species
+1969112428 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Shigella;Shigella dysenteriae species
+599451526 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli species
+2034984046 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Salmonella;Salmonella enterica species
+1859674812 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae species
+773201972 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus aureus species
+1295317147 Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus epidermidis species
+182402976 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecium species
+1566113429 Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;Enterococcus faecalis species
+891083107 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species
+1357145446 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species
$ head -n 3 example/taxdump/taxid.map
-GCF_001027105.1 1569132721
-GCF_001096185.1 2983929374
-GCF_001544255.1 4145431389
+GCF_001027105.1 773201972
+GCF_001096185.1 891083107
+GCF_001544255.1 182402976
@@ -2448,8 +2445,8 @@ create-taxdump$ export TAXONKIT_DB=example/taxdump2
$ taxonkit list --ids 1 | taxonkit filter -E species | taxonkit lineage -r | head -n 2
-1527235303 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus mitis species
-2983929374 Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pneumoniae species
+793223984 Bacteria;Proteobacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter baumannii species
+1220345221 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas aeruginosa species