7. Create 12S and COI reference files

The algorithm for extracting 12S rRNA records has been updated in the Nov2022 version to be more specific to genomes with 12S rRNA gene annotations. Unannotated mitochondrial genomes are now excluded from mitohelper's reference files. Sequences of the 12S rRNA gene are now extracted from complete mitochondrial genomes to construct a more gene-specific 12S rRNA dataset.

Get 12S rRNA gene sequences from complete mitogenomes

Get list of ribosomal and complete mitogenome accession numbers from current mitohelper release mitofish.all.$month.tsv (if re-populating entire database) or from new.records.tsv (if updating current database)

cat mitofish.all."$month".tsv  | grep -e ribosomal -e rRNA -e 12S | cut -f1 | sort | uniq >ribosomal.new.acc

grep -e "genome" -e "complete mito" -e "DNA, complete sequence" \
mitofish.all."$month".tsv | cut -f1 | sort | uniq >complete.new.mitogenome.acc

cat ribosomal.new.acc complete.new.mitogenome.acc | sort | uniq >ribosomal.complete.acc

From a list of complete mitogenome accession numbers, we use efetch to retrieve the annotations and sequences in each mitogenome

for i in `cat ribosomal.complete.acc`
do
echo $i
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDQualifier -element INSDQualifier_value >"$i".features
done

Example truncated output (e.g. MN394789.features)

Coregonus migratorius
mitochondrion
genomic DNA
Kul52
taxon:884168
tRNA-Phe
GCTGGCGTAGCTTAACTAAAGCATAACACTGAAGCTGTTAAGATGGACCCTAGAAAGTCCCGCAGGCA
12S ribosomal RNA
CAAAGGCTTGGTCCTGACTTTATTATCAGCTTTAACTGAACTTACA
tRNA-Val
CAGAGTGTAGCTAAGACAGAAGAGCACCTCCCTTACACCGAGAAGACATCCGTGCAAATCGGGTCACCCTGA

Example 12S rRNA gene annotations. Note the varying line numbers and format.

12S ribosomal RNA
CAAAGGCTTGGTCCTGACTTTACT

12S ribosomal RNA
putative
AAAAGATTTAATCCTAACCTTTCTATCAGCTCTAACTCAAACTACACATG

12S ribosomal RNA
non-experimental evidence, no additional details recorded
CAAAGGCTTGGTCCTGACTTTACTGTCAACTTTAGCTAGACTT

12S ribosomal RNA
rrnS
CAAAGGCTTGGTCCTGACTTTACTATCAGCTTTAACCCAATTT

12S ribosomal RNA
SAAAGGCCTGGTCCTGACTTTTTTGTCGGCTTTTACTATACTTATACATGCAAGCA\

12S ribosomal RNA
putative
GeneID:64984682
CAAAGGCTTGGTCCTGACTTTACTATCAGCTT

Extract 12S rRNA gene sequences from mitogenome annotations

grep -A 3 -e "12S" -e "s-rRNA" -e "12 ribo" -e "12 rRNA" -e "small subunit ribo" -e "small ribo" -e "rrnS" *features | \
sed "s/^.*features-[ATGCN][ATGCN][ATGCN][ATGCN][ATGCN]/#&/" | \
grep -B 1 "#" | sed "s/#.*-/#/" | sed s"/^.*features/>&/"  | grep -v "\-\-" | tr -d "\n" | tr ">" "\n" | \
grep -v tRNA | sed "s/^/#>/" | tr "#" "\n" | sed "s/.features.*$//" | grep -v ^$ | \
grep -v "^>$" >12S.mitogenome.fasta

Check how many genomes do not contain the 12S rRNA gene. Review and update manually as some genomes contain non ATCGN sequence characters, e.g. S in KJ433564:

ls *.features | cut -d "." -f1 | sort | uniq >12S.dedup.acc
grep ">" 12S.mitogenome.fasta| tr -d ">" | sort | uniq >12S.grep.acc
grep -v -f 12S.grep.acc 12S.dedup.acc >12S.nohit.acc

Some genomes are complete but do not contain any annotations (e.g. OD909576). These will not be included in the 12S reference files.
Retrieve records for processed genomes

Reformat fasta files to tab-separated format

cat 12S.mitogenome.fasta | sed "s/>.*$/&#/" | tr -d "\n" | tr "#" "\t" | tr ">" "\n" | grep -v ^$ | sort -k 1 >12S.seq.sorted.tsv

Get sequence IDs. Considering that some mitogenomes have >12S rRNA genes, extract the fasta accession numbers without deduplication

grep ">" 12S.mitogenome.fasta | cut -d ">" -f2 | sort >12S.seqID.acc

If updating, use grep to retrieve metadata for these 12S records. grep is performed line by line because some records have multiple copies of the 12S rRNA gene:

for i in `cat 12S.seqID.acc`
do
grep -m 1 "$i" new.records.tsv >>new.12S.tsv
done

paste -d "\t" 12S.seq.sorted.tsv new.12S.tsv  | awk -F "\t" '{OFS="\t"}{print $1,$4,$5,$6,$7,$8,$9,$10,$11,$12,$2,$14,$15}' >mitofish.12S.new.tsv
cat mitofish.12S.old.tsv mitofish.12S.new.tsv >mitofish.12S.$month.tsv

If not updating (re-populating an entire database), efetch and RESCRIPt are faster than grep in retrieving metadata for these 12S sequences

efetch code

for i in `cat 12S.seqID.acc`
do
echo $i

# Get gene definition
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>12S.definition.tsv

# Get taxID
echo "#$i," >>12S.taxid.tsv
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDQualifier -element INSDQualifier_value | grep taxon | sed "s/taxon://" >>12S.taxid.tsv
done

Process efetch output. Did not do uniq because some genomes contain multiple copy of 12S rRNA

sort -k 1 12S.definition.tsv | tr "#" "_" >12S.definition.sorted.tsv
cat 12S.taxid.tsv | tr -d "\n" | tr "#" "\n" | tr "," "\t" | sort -k 1 | uniq | grep -v ^$ >12S.taxid.sorted.tsv

RESCRIPt code. Note that RESCRIPt does not accept duplicated seqID

cat 12S.seqID.acc | sort | uniq >12S.seqID.sorted.acc
echo "id" >id 
split -l 1000 12S.seqID.sorted.acc

for i in `ls xa*`
do
cat id $i >rescript.acc
qiime rescript get-ncbi-data --m-accession-ids-file rescript.acc --o-sequences "$i".seq.qza --o-taxonomy "$i".tax.qza
qiime tools export --input-path "$i".tax.qza --output-path "$i".taxout
done

Convert taxonomy to mitohelper's TSV format

cat *taxout/taxonomy.tsv | grep -v "Feature ID" | sed "s/k__Metazoa/Eukaryota/" \
| sed "s/; [p/c/o/f/g]__/;/g" | tr ";" "\t" \
| awk -F "\t" '{OFS=";"}{print $1,$2,$3,$4,$5,$6,$7,$7,$8}' \
| sed "s/; s__/ /" | tr ";" "\t" | sort -k 1 >12S.mitogenome.taxonomy.sorted.tsv

Check and combine files manually to produce mitofish.12S.extracted.tsv

Get 12S rRNA gene sequences from gene records

Case-insensitive keyword searches for 12S rRNA genes in non-complete mitogenomes from database reference file

grep -i -e "12S ribo" -e "12S rRNA" -e "12S small" -e "12 ribosomal RNA" mitofish.all."$month".tsv >mitofish.12S.partI.tsv
grep -v -f 12S.seqID.sorted.acc mitofish.12S.partI.tsv >mitofish.12S.filtered.tsv

Extract 12S genes from other studies with NCBI records that may not contain the keywords above (e.g. small subunit ribosomal RNA)

12S.unconventional.acc is available in the db.scripts folder of the mitohelper repository and contains 993 records.

First, add the beginning of line indicator ^ in front of each accession number to improve the grep search specificity

sed "s/^/\^/" 12S.unconventional.acc >12S.unconventional.grep

Then, write a script grep.sh to retrieve these unconventional accession numbers from mitohelper.all."$month".tsv. To speed up the process, only the first match -m 1 is reported.

for i in `cat 12S.unconventional.grep`
do
grep -m 1 $i mitofish.all."$month".tsv >>mitofish.12S.unconventional.tsv
grep -v -f 12S.seqID.sorted.acc mitofish.12S.unconventional.tsv >mitofish.12S.unconventional.filtered.tsv
done

Combine files and de-duplicate

cat mitofish.12S.extracted.tsv mitofish.12S.partI.tsv mitofish.12S.unconventional.filtered.tsv | tr -d "\r" | sort | uniq >mitofish.12S.noheader.tsv
cat mitofish.header mitofish.12S.noheader.tsv >mitofish.12S."$month".tsv

Create COI reference file

Extract COI hits and filter out cytochrome b records

grep -i -e "(COI)" -e "CO1" -e "cytochrome c oxidase subunit I " \
-e "cytochrome c oxidase subunit 1" -e "COI " -e "complete genome" -e "complete mito" -e "complete sequence" \
-e "COX1" -e "(COXI)" -e "cytochrome oxidase subunit I " -e "genome assembly" mitofish.all."$month".tsv \
| grep -v -i "cytochrome b" | grep -v -i "cytb"  >mitofish.COI.partI.tsv

Extract COI records that may not contain the keywords above (e.g. partial genomes)

grep -f COI.unconventional.acc mitofish.all."$month".tsv >mitofish.COI.unconventional.tsv

COI.unconventional.acc is available in the db.scripts folder of the mitohelper repository

JN184086
AB282832
KP266818

We found three records so far, and this is by no means a complete list of "unconventional" records.

Note: Other databases such as The Barcode of Life Database can contain additional records not deposited in NCBI.

Check for duplicates, then combine everything

# There should be no output from the grep command, if there is no duplicate
grep -f COI.unconventional.acc mitofish.COI.partI.tsv
cat mitofish.header mitofish.COI.partI.tsv mitofish.COI.unconventional.tsv >mitofish.COI."$month".tsv

To update new COI records, perform the keyword search on new.records.tsv, filter out missing records, then append to mitofish.COI.$month.tsv

grep -i -e "(COI)" -e "CO1" -e "cytochrome c oxidase subunit I " \
-e "cytochrome c oxidase subunit 1" -e "COI " -e "complete genome" -e "complete mito" -e "complete sequence" \
-e "COX1" -e "(COXI)" -e "cytochrome oxidase subunit I " -e "genome assembly" new.records.tsv \
| grep -v -i "cytochrome b" | grep -v -i "cytb"  >mitofish.COI.new.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7. Create 12S and COI reference files

Get 12S rRNA gene sequences from complete mitogenomes

Get 12S rRNA gene sequences from gene records

Create COI reference file

Clone this wiki locally