Skip to content

7. Create 12S and COI reference files

shenjean edited this page May 11, 2023 · 152 revisions

The algorithm for extracting 12S rRNA records has been updated in the Nov2022 version to be more specific to genomes with 12S rRNA gene annotations. Unannotated mitochondrial genomes are now excluded from mitohelper's reference files. Sequences of the 12S rRNA gene are now extracted from complete mitochondrial genomes to construct a more gene-specific 12S rRNA dataset.

Get 12S rRNA gene sequences from complete mitogenomes

  • Get list of ribosomal and complete mitogenome accession numbers from current mitohelper release mitofish.all.$month.tsv (if re-populating entire database) or from new.records.tsv (if updating current database)
cat mitofish.all."$month".tsv  | grep -e ribosomal -e rRNA -e 12S | cut -f1 | sort | uniq >ribosomal.new.acc

grep -e "genome" -e "complete mito" -e "DNA, complete sequence" \
mitofish.all."$month".tsv | cut -f1 | sort | uniq >complete.new.mitogenome.acc

cat ribosomal.new.acc complete.new.mitogenome.acc | sort | uniq >ribosomal.complete.acc
  • From a list of complete mitogenome accession numbers, we use efetch to retrieve the annotations and sequences in each mitogenome
for i in `cat ribosomal.complete.acc`
do
echo $i
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDQualifier -element INSDQualifier_value >"$i".features
done
  • Example truncated output (e.g. MN394789.features)
Coregonus migratorius
mitochondrion
genomic DNA
Kul52
taxon:884168
tRNA-Phe
GCTGGCGTAGCTTAACTAAAGCATAACACTGAAGCTGTTAAGATGGACCCTAGAAAGTCCCGCAGGCA
12S ribosomal RNA
CAAAGGCTTGGTCCTGACTTTATTATCAGCTTTAACTGAACTTACA
tRNA-Val
CAGAGTGTAGCTAAGACAGAAGAGCACCTCCCTTACACCGAGAAGACATCCGTGCAAATCGGGTCACCCTGA
  • Example 12S rRNA gene annotations. Note the varying line numbers and format.
12S ribosomal RNA
CAAAGGCTTGGTCCTGACTTTACT
12S ribosomal RNA
putative
AAAAGATTTAATCCTAACCTTTCTATCAGCTCTAACTCAAACTACACATG
12S ribosomal RNA
non-experimental evidence, no additional details recorded
CAAAGGCTTGGTCCTGACTTTACTGTCAACTTTAGCTAGACTT
12S ribosomal RNA
rrnS
CAAAGGCTTGGTCCTGACTTTACTATCAGCTTTAACCCAATTT
12S ribosomal RNA
SAAAGGCCTGGTCCTGACTTTTTTGTCGGCTTTTACTATACTTATACATGCAAGCA\
12S ribosomal RNA
putative
GeneID:64984682
CAAAGGCTTGGTCCTGACTTTACTATCAGCTT
  • Extract 12S rRNA gene sequences from mitogenome annotations
grep -A 3 -e "12S" -e "s-rRNA" -e "12 ribo" -e "12 rRNA" -e "small subunit ribo" -e "small ribo" -e "rrnS" *features | \
sed "s/^.*features-[ATGCN][ATGCN][ATGCN][ATGCN][ATGCN]/#&/" | \
grep -B 1 "#" | sed "s/#.*-/#/" | sed s"/^.*features/>&/"  | grep -v "\-\-" | tr -d "\n" | tr ">" "\n" | \
grep -v tRNA | sed "s/^/#>/" | tr "#" "\n" | sed "s/.features.*$//" | grep -v ^$ | \
grep -v "^>$" >12S.mitogenome.fasta
  • Check how many genomes do not contain the 12S rRNA gene. Review and update manually as some genomes contain non ATCGN sequence characters, e.g. S in KJ433564:
ls *.features | cut -d "." -f1 | sort | uniq >12S.dedup.acc
grep ">" 12S.mitogenome.fasta| tr -d ">" | sort | uniq >12S.grep.acc
grep -v -f 12S.grep.acc 12S.dedup.acc >12S.nohit.acc
  • Some genomes are complete but do not contain any annotations (e.g. OD909576). These will not be included in the 12S reference files.

  • Retrieve records for processed genomes

Reformat fasta files to tab-separated format

cat 12S.mitogenome.fasta | sed "s/>.*$/&#/" | tr -d "\n" | tr "#" "\t" | tr ">" "\n" | grep -v ^$ | sort -k 1 >12S.seq.sorted.tsv

Get sequence IDs. Considering that some mitogenomes have >12S rRNA genes, extract the fasta accession numbers without deduplication

grep ">" 12S.mitogenome.fasta | cut -d ">" -f2 | sort >12S.seqID.acc

If updating, use grep to retrieve metadata for these 12S records. grep is performed line by line because some records have multiple copies of the 12S rRNA gene:

for i in `cat 12S.seqID.acc`
do
grep -m 1 "$i" new.records.tsv >>new.12S.tsv
done
paste -d "\t" 12S.seq.sorted.tsv new.12S.tsv  | awk -F "\t" '{OFS="\t"}{print $1,$4,$5,$6,$7,$8,$9,$10,$11,$12,$2,$14,$15}' >mitofish.12S.new.tsv
cat mitofish.12S.old.tsv mitofish.12S.new.tsv >mitofish.12S.$month.tsv

If not updating (re-populating an entire database), efetch and RESCRIPt are faster than grep in retrieving metadata for these 12S sequences

efetch code

for i in `cat 12S.seqID.acc`
do
echo $i

# Get gene definition
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>12S.definition.tsv

# Get taxID
echo "#$i," >>12S.taxid.tsv
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDQualifier -element INSDQualifier_value | grep taxon | sed "s/taxon://" >>12S.taxid.tsv
done

Process efetch output. Did not do uniq because some genomes contain multiple copy of 12S rRNA

sort -k 1 12S.definition.tsv | tr "#" "_" >12S.definition.sorted.tsv
cat 12S.taxid.tsv | tr -d "\n" | tr "#" "\n" | tr "," "\t" | sort -k 1 | uniq | grep -v ^$ >12S.taxid.sorted.tsv

RESCRIPt code. Note that RESCRIPt does not accept duplicated seqID

cat 12S.seqID.acc | sort | uniq >12S.seqID.sorted.acc
echo "id" >id 
split -l 1000 12S.seqID.sorted.acc

for i in `ls xa*`
do
cat id $i >rescript.acc
qiime rescript get-ncbi-data --m-accession-ids-file rescript.acc --o-sequences "$i".seq.qza --o-taxonomy "$i".tax.qza
qiime tools export --input-path "$i".tax.qza --output-path "$i".taxout
done
  • Convert taxonomy to mitohelper's TSV format
cat *taxout/taxonomy.tsv | grep -v "Feature ID" | sed "s/k__Metazoa/Eukaryota/" \
| sed "s/; [p/c/o/f/g]__/;/g" | tr ";" "\t" \
| awk -F "\t" '{OFS=";"}{print $1,$2,$3,$4,$5,$6,$7,$7,$8}' \
| sed "s/; s__/ /" | tr ";" "\t" | sort -k 1 >12S.mitogenome.taxonomy.sorted.tsv
  • Check and combine files manually to produce mitofish.12S.extracted.tsv

Get 12S rRNA gene sequences from gene records

  • Case-insensitive keyword searches for 12S rRNA genes in non-complete mitogenomes from database reference file
grep -i -e "12S ribo" -e "12S rRNA" -e "12S small" -e "12 ribosomal RNA" mitofish.all."$month".tsv >mitofish.12S.partI.tsv
grep -v -f 12S.seqID.sorted.acc mitofish.12S.partI.tsv >mitofish.12S.filtered.tsv
  • Extract 12S genes from other studies with NCBI records that may not contain the keywords above (e.g. small subunit ribosomal RNA)

12S.unconventional.acc is available in the db.scripts folder of the mitohelper repository and contains 993 records.

First, add the beginning of line indicator ^ in front of each accession number to improve the grep search specificity

sed "s/^/\^/" 12S.unconventional.acc >12S.unconventional.grep 

Then, write a script grep.sh to retrieve these unconventional accession numbers from mitohelper.all."$month".tsv. To speed up the process, only the first match -m 1 is reported.

for i in `cat 12S.unconventional.grep`
do
grep -m 1 $i mitofish.all."$month".tsv >>mitofish.12S.unconventional.tsv
grep -v -f 12S.seqID.sorted.acc mitofish.12S.unconventional.tsv >mitofish.12S.unconventional.filtered.tsv
done
  • Combine files and de-duplicate
cat mitofish.12S.extracted.tsv mitofish.12S.partI.tsv mitofish.12S.unconventional.filtered.tsv | tr -d "\r" | sort | uniq >mitofish.12S.noheader.tsv
cat mitofish.header mitofish.12S.noheader.tsv >mitofish.12S."$month".tsv

Create COI reference file

  • Extract COI hits and filter out cytochrome b records
grep -i -e "(COI)" -e "CO1" -e "cytochrome c oxidase subunit I " \
-e "cytochrome c oxidase subunit 1" -e "COI " -e "complete genome" -e "complete mito" -e "complete sequence" \
-e "COX1" -e "(COXI)" -e "cytochrome oxidase subunit I " -e "genome assembly" mitofish.all."$month".tsv \
| grep -v -i "cytochrome b" | grep -v -i "cytb"  >mitofish.COI.partI.tsv
  • Extract COI records that may not contain the keywords above (e.g. partial genomes)
grep -f COI.unconventional.acc mitofish.all."$month".tsv >mitofish.COI.unconventional.tsv
  • COI.unconventional.acc is available in the db.scripts folder of the mitohelper repository
JN184086
AB282832
KP266818

We found three records so far, and this is by no means a complete list of "unconventional" records.

Note: Other databases such as The Barcode of Life Database can contain additional records not deposited in NCBI.

  • Check for duplicates, then combine everything
# There should be no output from the grep command, if there is no duplicate
grep -f COI.unconventional.acc mitofish.COI.partI.tsv
cat mitofish.header mitofish.COI.partI.tsv mitofish.COI.unconventional.tsv >mitofish.COI."$month".tsv
  • To update new COI records, perform the keyword search on new.records.tsv, filter out missing records, then append to mitofish.COI.$month.tsv
grep -i -e "(COI)" -e "CO1" -e "cytochrome c oxidase subunit I " \
-e "cytochrome c oxidase subunit 1" -e "COI " -e "complete genome" -e "complete mito" -e "complete sequence" \
-e "COX1" -e "(COXI)" -e "cytochrome oxidase subunit I " -e "genome assembly" new.records.tsv \
| grep -v -i "cytochrome b" | grep -v -i "cytb"  >mitofish.COI.new.tsv