-
Notifications
You must be signed in to change notification settings - Fork 8
8. Updating reference databases
In a new folder, download updated version of MitoFish database according to workflow outlined in Section 1.
- Get accession number of complete mitogenomes
The list of complete mitogenomes has not been updated since Dec 2021, so they don't have to be processed unless there are new genomes
mkdir mitogenomes
cd mitogenomes
wget http://mitofish.aori.u-tokyo.ac.jp/files/mitogenomes.zip
unzip mitogenomes.zip
ls *.fa | cut -d "_" -f1-2 | grep -v complete | sort | uniq >../complete.new.accession
- Get accession number of complete+partial mitogenomes
cd ..
wget http://mitofish.aori.u-tokyo.ac.jp/files/complete_partial_mitogenomes.zip
unzip complete_partial_mitogenomes.zip
grep ">" mito-all | cut -d "|" -f2 | sort | uniq >complete.partial.new.accession
Combine and sort accession numbers - skip if there are no new complete mitogenomes
cat complete.new.accession complete.partial.new.accession | sort | uniq >mitofish.new.accession
Get accession numbers of current dataset
cut -f1 mitofish.old.tsv | grep -v "Accession" | sort | uniq >mitofish.old.accession
- Get list of newly added accession numbers using
comm
comm -23 mitofish.new.accession mitofish.old.accession >new.accession
- Get list of accession numbers missing in the new MitoFish version. These records may either just not be present in the MitoFish database, or have been removed by the submitter due to redundancies etc.
comm -13 mitofish.new.accession mitofish.old.accession >missing.accession
- Check if line numbers add up
wc -l mitofish.new.accession mitofish.old.accession new.accession missing.accession
- Get gene definitions and sequences using NCBI's
efetch
Unix tool bundled within Entrez Direct
To install the EDirect software,
wget ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh
and execute
for i in `cat new.accession`
do
echo $i
# Get gene definition
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>new.definition.tsv
# Get Taxonomy ID
echo "#$i," >>new.taxid.tsv
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDQualifier -element INSDQualifier_value | grep taxon | sed "s/taxon://" >>new.taxid.tsv
done
# Sort output by accession number
sort -k 1 new.definition.tsv | uniq | tr "#" "_" >new.definition.sorted.tsv
cat new.taxid.tsv | tr -d "\n" | tr "#" "\n" | tr "," "\t" | sort -k 1 | uniq | grep -v ^$ >new.taxid.sorted.tsv
Note:
efetch
works rather quickly and has the advantage of being able to retrieve unverified NCBI records. However, it does not output class information and uniform number of columns for taxonomy
- Run
efetch
to retrieve gene definitions formissing.accession
This is to check whether missing accessions are unverified/deleted records or just not included in MitoFish
for i in `cat missing.accession`
do
echo $i
# Get gene definition
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>missing.definition.tsv
Look through missing.definition.tsv
to check for UNVERIFIED records. Then, check whether line numbers of missing.accession
and missing.definition.tsv
match. If they match, that means these "missing" records are not removed from NCBI.
wc -l missing.accession missing.definition.tsv
Scenario #1: If missing.accession
and missing.definition.tsv
have the same line numbers (see example output below), this means that these "missing" records are not removed from NCBI. They are simply not present in the MitoFish database
4678 missing.accession
4678 missing.definition.tsv
9356 total
Scenario #2: If missing.definition.tsv
contains less number of lines than missing.accession
, this means that some records have indeed been removed from NCBI. Using the commands below, we can extract these "real" missing accession numbers to remove them from the latest mitohelper release.
# Get accession numbers from missing.definition.tsv
cut -f1 missing.definition.tsv >notmissing.acc
grep -v -f notmissing.acc missing.accession >realmissing.accession
- We use RESCRIPt to retrieve sequence and taxonomy information:
Nowadays, RESCRIPt is not perfectly synced with the most current NCBI data, resulting in records with no taxonomic information. Batches with missing data leads to plugin errors where RESCRIPt is unable to complete partial downloads. Because of this, records with missing data have to be sequentially removed from the accession list and RESCRIPt will have to be re-run multiple times with increasingly smaller batch sizes (specified in the
split -l
command below).
# Split accession number into smaller files of 1,000 lines each
# RESCRIPt requires a header for input accession ids
echo "id" >id
split -l 1000 new.accession
for i in `ls xa*`
do
cat id $i >rescript.acc
qiime rescript get-ncbi-data --m-accession-ids-file rescript.acc --o-sequences "$i".seq.qza --o-taxonomy "$i".tax.qza --p-n-jobs 5
qiime tools export --input-path "$i".tax.qza --output-path "$i".taxout
qiime tools export --input-path "$i".seq.qza --output-path "$i".seqout
done
- Format sequences of new records from RESCRIPt output
cat *seqout/dna-sequences.fasta | tr "\n" "\t" | tr ">" "\n" | sort -k 1 | uniq | grep -v ^$ >new.seq.sorted.tsv
- Convert taxonomy to mitohelper's TSV format
cat *taxout/taxonomy.tsv | grep -v "Feature ID" | sed "s/k__Metazoa/Eukaryota/" \
| sed "s/; [p/c/o/f/g]__/;/g" | tr ";" "\t" \
| awk -F "\t" '{OFS=";"}{print $1,$2,$3,$4,$5,$6,$7,$7,$8}' \
| sed "s/; s__/ /" | tr ";" "\t" | sort -k 1 | uniq >new.tax.sorted.tsv
- Use
efetch
to retrieve sequences of records that cannot be downloaded by RESCRIPt. These sequences can be added tonew.seq.sorted.tsv
. Taxonomy data can be downloaded, manually processed, and added tonew.tax.sorted.tsv
.
for i in `cat rescript.missing.acc`
do
echo $i
# Get sequence data
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_sequence >>rescript.missing.seq.tsv
# Get taxonomy data - this is not formatted like the mitohelper's datasets and should be manually formatted in Excel or other software
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_taxonomy INSDSeq_organism >>rescript.missing.tax.tsv
done
Format retrieved sequence data to match Mitohelper's format - replace DNA bases in lower case to upper case
cat rescript.missing.seq.tsv | tr '[:lower:]' '[:upper:]' >>new.seq.sorted.tsv
Another way to retrieve the gene definitions is to follow the instructions in Section 2 and Section 3. This is efficient if you have a large number of gene definitions to retrieve.
-
Add classifications scheme used in "Fishes of the World, 5th edition" (Nelson et al. 2016) to
new.tax.sorted.tsv
viavlookup
in Excel or R merge command -
Remove carriage return
^M
characters
cat new.tax.sorted.tsv | tr -d '\r' >new.taxonomy.sorted.tsv
- Combine new gene definition list, taxonomy information, and sequences into a tab-separated table according to the reference file format. Also, remove carriage return
^M
characters. Filter outUNVERIFIED
records.
paste -d "\t" new.definition.sorted.tsv new.taxid.sorted.tsv new.taxonomy.sorted.tsv new.seq.sorted.tsv \
| awk -F "\t" '{OFS="\t"}{print $1,$2,$4,$6,$7,$8,$9,$10,$11,$12,$16,$13,$14}'| tr -d '\r' \
| grep -v UNVERIFIED | sort | uniq >new.records.tsv
- Append table with new records to previous reference file
cat mitofish.old.tsv new.records.tsv >mitofish.add.tsv
- Filter out records that were removed, if present
grep -v -f realmissing.accession mitofish.add.tsv >mitofish."$month".xxx.tsv