-
Notifications
You must be signed in to change notification settings - Fork 8
7. Create 12S and COI reference files
The algorithm for extracting 12S rRNA records has been updated in the Nov2022 version to be more specific to genomes with 12S rRNA gene annotations. Unannotated mitochondrial genomes are now excluded from mitohelper's reference files. Sequences of the 12S rRNA gene are now extracted from complete mitochondrial genomes to construct a more gene-specific 12S rRNA dataset.
- If updating, extract 12S rRNA accession numbers in previous release
cut -f1 mitofish.12S.Nov2022.tsv >12S.old.acc
- Get list of complete mitogenome accession numbers from current mitohelper release or
new.records.tsv
cat mitofish.all."$month".tsv | grep -e 12S -e ribosomal -e rRNA | cut -f1 | sort | uniq >12S.new.mitogenome.acc
grep -e "genome" -e "complete mito" -e "DNA, complete sequence" \
mitofish.all."$month".tsv | cut -f1 | sort | uniq >complete.new.mitogenome.acc
cat 12S.new.mitogenome.acc complete.new.mitogenome.acc | sort | uniq >12S.new.all.acc
- From a list of complete mitogenome accession numbers, we use efetch to retrieve the annotations and sequences in each mitogenome
for i in `cat 12S.toextract.acc`
do
echo $i
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDQualifier -element INSDQualifier_value >"$i".features
done
- Example truncated output (e.g.
MN394789.features
)
Coregonus migratorius
mitochondrion
genomic DNA
Kul52
taxon:884168
tRNA-Phe
GCTGGCGTAGCTTAACTAAAGCATAACACTGAAGCTGTTAAGATGGACCCTAGAAAGTCCCGCAGGCA
12S ribosomal RNA
CAAAGGCTTGGTCCTGACTTTATTATCAGCTTTAACTGAACTTACA
tRNA-Val
CAGAGTGTAGCTAAGACAGAAGAGCACCTCCCTTACACCGAGAAGACATCCGTGCAAATCGGGTCACCCTGA
- Example 12S rRNA gene annotations. Note the varying line numbers and format.
12S ribosomal RNA
CAAAGGCTTGGTCCTGACTTTACT
12S ribosomal RNA
putative
AAAAGATTTAATCCTAACCTTTCTATCAGCTCTAACTCAAACTACACATG
12S ribosomal RNA
non-experimental evidence, no additional details recorded
CAAAGGCTTGGTCCTGACTTTACTGTCAACTTTAGCTAGACTT
12S ribosomal RNA
rrnS
CAAAGGCTTGGTCCTGACTTTACTATCAGCTTTAACCCAATTT
12S ribosomal RNA
SAAAGGCCTGGTCCTGACTTTTTTGTCGGCTTTTACTATACTTATACATGCAAGCA\
12S ribosomal RNA
putative
GeneID:64984682
CAAAGGCTTGGTCCTGACTTTACTATCAGCTT
- Extract 12S rRNA gene sequences from mitogenome annotations
grep -A 3 -e "12S" -e "s-rRNA" -e "12 ribo" -e "12 rRNA" -e "small subunit ribo" -e "small ribo" -e "rrnS" *features | \
sed "s/^.*features-[ATGCN][ATGCN][ATGCN][ATGCN][ATGCN]/#&/" | \
grep -B 1 "#" | sed "s/#.*-/#/" | sed s"/^.*features/>&/" | grep -v "\-\-" | tr -d "\n" | tr ">" "\n" | \
grep -v tRNA | sed "s/^/#>/" | tr "#" "\n" | sed "s/.features.*$//" | grep -v ^$ | \
grep -v "^>$" >12S.mitogenome.fasta
- Check how many genomes do not contain the 12S rRNA gene. Review and update manually as some genomes contain non ATCGN sequence characters, e.g. S in KJ433564:
ls *.features | cut -d "." -f1 | sort | uniq >12S.dedup.acc
grep ">" 12S.mitogenome.fasta| tr -d ">" | sort | uniq >12S.grep.acc
grep -v -f 12S.grep.acc 12S.dedup.acc >12S.nohit.acc
-
Some genomes are complete but do not contain any annotations (e.g. OD909576). These will not be included in the 12S reference files.
-
Retrieve records for processed genomes
Reformat fasta files to tab-separated format
cat 12S.mitogenome.fasta 12S.others.fasta | sed "s/>.*$/&#/" | tr -d "\n" | tr "#" "\t" | tr ">" "\n" | grep -v ^$ >12S.seq.sorted.tsv
``
Get sequence IDs. Considering that some mitogenomes have >12S rRNA genes, extract the fasta accession numbers without deduplication
```sh
grep ">" *fasta | cut -d ">" -f2 >12S.seqID.acc
efetch and RESCRIPt are faster than grep
in retrieving metadata for these 12S sequences
efetch
code
for i in `cat 12S.seqID.acc`
do
echo $i
# Get gene definition
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>12S.definition.tsv
# Get taxID
echo "#$i," >>12S.taxid.tsv
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDQualifier -element INSDQualifier_value | grep taxon | sed "s/taxon://" >>12S.taxid.tsv
done
Process efetch output. Did not do uniq
because some genomes contain multiple copy of 12S rRNA
sort -k 1 12S.definition.tsv | tr "#" "_" >12S.definition.sorted.tsv
cat 12S.taxid.tsv | tr -d "\n" | tr "#" "\n" | tr "," "\t" | sort -k 1 | uniq | grep -v ^$ >12S.taxid.sorted.tsv
RESCRIPt code. Note that RESCRIPt does not accept duplicated seqID
cat 12S.seqID.acc | sort | uniq >12S.seqID.sorted.acc
echo "id" >id
split -l 1000 12S.seqID.sorted.acc
for i in `ls xa*`
do
cat id $i >rescript.acc
qiime rescript get-ncbi-data --m-accession-ids-file rescript.acc --o-sequences "$i".seq.qza --o-taxonomy "$i".tax.qza
qiime tools export --input-path "$i".tax.qza --output-path "$i".taxout
done
- Convert taxonomy to mitohelper's TSV format
cat *taxout/taxonomy.tsv | grep -v "Feature ID" | sed "s/k__Metazoa/Eukaryota/" \
| sed "s/; [p/c/o/f/g]__/;/g" | tr ";" "\t" \
| awk -F "\t" '{OFS=";"}{print $1,$2,$3,$4,$5,$6,$7,$7,$8}' \
| sed "s/; s__/ /" | tr ";" "\t" | sort -k 1 >12S.mitogenome.taxonomy.sorted.tsv
- Check and combine files manually to produce
mitofish.12S.extracted.tsv
- Case-insensitive keyword searches for 12S rRNA genes in non-complete mitogenomes from database reference file
grep -i -e "12S ribo" -e "12S rRNA" -e "12S small" -e "12 ribosomal RNA" mitofish.all."$month".tsv >mitofish.12S.partI.tsv
grep -v -f 12S.seqID.sorted.acc mitofish.12S.partI.tsv >mitofish.12S.filtered.tsv
- Extract 12S genes from other studies with NCBI records that may not contain the keywords above (e.g. small subunit ribosomal RNA)
12S.unconventional.acc
is available in thedb.scripts
folder of the mitohelper repository and contains 993 records.
First, add the beginning of line indicator ^
in front of each accession number to improve the grep search specificity
sed "s/^/\^/" 12S.unconventional.acc >12S.unconventional.grep
Then, write a script grep.sh
to retrieve these unconventional accession numbers from mitohelper.all."$month".tsv
. To speed up the process, only the first match -m 1
is reported.
for i in `cat 12S.unconventional.grep`
do
grep -m 1 $i mitofish.all."$month".tsv >>mitofish.12S.unconventional.tsv
grep -v -f 12S.seqID.sorted.acc mitofish.12S.unconventional.tsv >mitofish.12S.unconventional.filtered.tsv
done
- Combine files and de-duplicate
cat mitofish.12S.extracted.tsv mitofish.12S.partI.tsv mitofish.12S.unconventional.filtered.tsv | tr -d "\r" | sort | uniq >mitofish.12S.noheader.tsv
cat mitofish.header mitofish.12S.noheader.tsv >mitofish.12S."$month".tsv
- Extract COI hits and filter out cytochrome b records
grep -i -e "(COI)" -e "CO1" -e "cytochrome c oxidase subunit I " \
-e "cytochrome c oxidase subunit 1" -e "COI " -e "complete genome" -e "complete mito" -e "complete sequence" \
-e "COX1" -e "(COXI)" -e "cytochrome oxidase subunit I " -e "genome assembly" mitofish.all."$month".tsv \
| grep -v -i "cytochrome b" | grep -v -i "cytb" >mitofish.COI.partI.tsv
- Extract COI records that may not contain the keywords above (e.g. partial genomes)
grep -f COI.unconventional.acc mitofish.all."$month".tsv >mitofish.COI.unconventional.tsv
-
COI.unconventional.acc
is available in thedb.scripts
folder of the mitohelper repository
JN184086
AB282832
KP266818
We found three records so far, and this is by no means a complete list of "unconventional" records.
Note: Other databases such as The Barcode of Life Database can contain additional records not deposited in NCBI.
- Check for duplicates, then combine everything
# There should be no output from the grep command, if there is no duplicate
grep -f COI.unconventional.acc mitofish.COI.partI.tsv
cat mitofish.header mitofish.COI.partI.tsv mitofish.COI.unconventional.tsv >mitofish.COI."$month".tsv