Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create-taxdump: accepts arbitrary ranks #60

Closed
shenwei356 opened this issue May 21, 2022 · 1 comment
Closed

create-taxdump: accepts arbitrary ranks #60

shenwei356 opened this issue May 21, 2022 · 1 comment

Comments

@shenwei356
Copy link
Owner

This issue comes from shenwei356/ictv-taxdump#1.

The highest rank in the ICTV taxonomy is the "realm", which is now being ignored in the ictv-taxdump. Because the taxonkit create-taxdump command only supports a fixed number of ranks, there's no way to include it without removing other ranks. Because having the realm is (to my purposes) usually more important

Ideally, we would have a taxdump that includes all the ICTV ranks (including subgenus, subfamily, suborder, etc.), but this might conflict with taxonkit's philosophy of using NCBI's "canonical ranks".

I think it needs a reimplement of the command or another new command, which accepts arbitrary ranks. It should be easy.

  • It still accepts a tab-delimited table as input, but the column order determines the hierarchy of ranks.
  • Rank names can be given as the first row or via the option --rank-names as well, but without the limitation of 8 ranks.
shenwei356 added a commit that referenced this issue May 30, 2022
create-taxdump: accepts arbitrary ranks. #60
@shenwei356
Copy link
Owner Author

The GTDB mode (--gtdb) is compatible, with no changes to the previous version.

And now it can better handle ICTV taxonomy (shenwei356/ictv-taxdump#1)

Usage

Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB and ICTV

Input format: 
  0. For GTDB taxonomy file, just use --gtdb.
     We use the numeric assembly accession as the taxon at subspecies rank.
     (without the prefix GCA_ and GCF_, and version number).
  1. The input file should be tab-delimited, at least one column is needed.
  2. Ranks can be given either via the first row or the flag --rank-names.
  3. The column containing the genome/assembly accession is recommended to
     generate TaxId mapping file (taxid.map, id -> taxid).
       -A/--field-accession,    field contaning genome/assembly accession      
       --field-accession-re,    regular expression to extract the accession
     Note that mutiple TaxIds pointing to the same accession are listed as
     comma-seperated integers. 

Attentions:
  1. Names should be distinct in taxa of different ranks.
     But for these missing some taxon nodes, using names of parent nodes is allowed:

       GB_GCA_018897955.1      d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155

     It can also detect duplicate names with different ranks, e.g.,
     the Class and Genus have the same name B47-G6, and the Order and Family
     between them have different names. In this case, we reassign a new TaxId
     by increasing the TaxId until it being distinct.

       GB_GCA_003663585.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585

  2. Taxa from different parents may have the same name.
     We will assign different TaxIds to them. 

     E.g., in ICTV, many viruses from different species have the same names.
     In practice, we set the "Virus names(s)" as a sub-species rank and also
     specify it as the accession.

       Species             Virus name(s)
       Jerseyvirus SETP3   Salmonella phage SETP7
       Jerseyvirus SETP7   Salmonella phage SETP7

Usage:
  taxonkit create-taxdump [flags] 

Flags:
  -A, --field-accession int         field index of assembly accession (genome ID), for outputting taxid.map
      --field-accession-re string   regular expression to extract assembly accession (default
                                    "^\\w\\w_(.+)$")
      --force                       overwrite existed output directory
      --gtdb                        input files are GTDB taxonomy file
      --gtdb-re-subs string         regular expression to extract assembly accession as the subspecies
                                    (default "^\\w\\w_GC[AF]_(.+)\\.\\d+$")
  -h, --help                        help for create-taxdump
      --line-chunk-size int         number of lines to process for each thread, and 4 threads is fast
                                    enough. (default 5000)
      --null strings                null value of taxa (default [,NULL,NA])
  -x, --old-taxdump-dir string      taxdump directory of the previous version, for generating merged.dmp
                                    and delnodes.dmp
  -O, --out-dir string              output directory
  -R, --rank-names strings          names of all ranks, leave it empty to use the first row of input as
                                    rank names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant