-
Notifications
You must be signed in to change notification settings - Fork 8
7. Create 12S and COI reference files
The algorithm for extracting 12S rRNA records has been updated in the Nov2022 version to be more specific to genomes with 12S rRNA gene annotations. Unannotated mitochondrial genomes are now excluded from mitohelper's reference files. Sequences of the 12S rRNA gene are now extracted from complete mitochondrial genomes to construct a more gene-specific 12S rRNA dataset.
- Get list of ribosomal and complete mitogenome accession numbers from current mitohelper release
mitofish.all.$month.tsv
(if re-populating entire database) or fromnew.definition.tsv
ornew.records.tsv
(if updating current database)
cat new.definition.tsv | grep -e ribosomal -e rRNA -e 12S -e genome -e "complete mito" -e "DNA, complete sequence" \
| grep -v UNVERIFIED | grep -v "large subunit ribosomal" | cut -f1 >ribosomal.new.acc
- From a list of partial+complete mitogenome accession numbers, we use efetch to retrieve the annotations and sequences in each mitogenome
for i in `cat ribosomal.new.acc`
do
echo $i
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDQualifier -element INSDQualifier_value >"$i".features
done
- Example truncated output (e.g.
MN394789.features
)
Coregonus migratorius
mitochondrion
genomic DNA
Kul52
taxon:884168
tRNA-Phe
GCTGGCGTAGCTTAACTAAAGCATAACACTGAAGCTGTTAAGATGGACCCTAGAAAGTCCCGCAGGCA
12S ribosomal RNA
CAAAGGCTTGGTCCTGACTTTATTATCAGCTTTAACTGAACTTACA
tRNA-Val
CAGAGTGTAGCTAAGACAGAAGAGCACCTCCCTTACACCGAGAAGACATCCGTGCAAATCGGGTCACCCTGA
- Example 12S rRNA gene annotations. Note the varying line numbers and format.
12S ribosomal RNA
CAAAGGCTTGGTCCTGACTTTACT
12S ribosomal RNA
putative
AAAAGATTTAATCCTAACCTTTCTATCAGCTCTAACTCAAACTACACATG
12S ribosomal RNA
non-experimental evidence, no additional details recorded
CAAAGGCTTGGTCCTGACTTTACTGTCAACTTTAGCTAGACTT
12S ribosomal RNA
rrnS
CAAAGGCTTGGTCCTGACTTTACTATCAGCTTTAACCCAATTT
12S ribosomal RNA
SAAAGGCCTGGTCCTGACTTTTTTGTCGGCTTTTACTATACTTATACATGCAAGCA\
12S ribosomal RNA
putative
GeneID:64984682
CAAAGGCTTGGTCCTGACTTTACTATCAGCTT
- Extract 12S rRNA gene sequences from mitogenome annotations
grep -A 3 -e "12S" -e "12 ribo" -e "12 rRNA" -e "small subunit ribo" -e "small ribo" -e "rrnS" *features | \
sed "s/^.*features-[ATGCN][ATGCN][ATGCN][ATGCN][ATGCN]/#&/" | \
grep -B 1 "#" | sed "s/#.*-/#/" | sed s"/^.*features/>&/" | grep -v "\-\-" | tr -d "\n" | tr ">" "\n" | \
grep -v tRNA | grep -v 16S | sed "s/^/#>/" | tr "#" "\n" | sed "s/.features.*$//" | grep -v ^$ | \
grep -v "^>$" >12S.mitogenome.fasta
- Check how many genomes do not contain the 12S rRNA gene. Review and update manually as some genomes contain non ATCGN sequence characters, e.g. S in KJ433564:
ls *.features | cut -d "." -f1 | sort | uniq >12S.dedup.acc
grep ">" 12S.mitogenome.fasta| tr -d ">" | sort | uniq >12S.grep.acc
grep -v -f 12S.grep.acc 12S.dedup.acc >12S.nohit.acc
-
Some genomes are complete but do not contain any annotations (e.g. OD909576). These will not be included in the 12S reference files.
-
Manually check duplicated records to see if they contain multiple copies of 12S rRNA genes, or if they include 16S rRNA gene annotated similarly to 12S genes
-
Retrieve records for processed genomes
Reformat fasta files to tab-separated format
cat 12S.mitogenome.fasta | sed "s/>.*$/&#/" | tr -d "\n" | tr "#" "\t" | tr ">" "\n" | grep -v ^$ | sort -k 1 >12S.seq.sorted.tsv
Get sequence IDs. Considering that some mitogenomes have >1 12S rRNA genes, extract the fasta accession numbers without deduplication
grep ">" 12S.mitogenome.fasta | cut -d ">" -f2 | sort >12S.seqID.acc
If updating, use
grep
to retrieve metadata for these 12S records.grep
is performed line by line because some records have multiple copies of the 12S rRNA gene:
for i in `cat 12S.seqID.acc`
do
grep -m 1 "$i" new.records.tsv >>new.12S.tsv
done
paste -d "\t" 12S.seq.sorted.tsv new.12S.tsv | awk -F "\t" '{OFS="\t"}{print $1,$4,$5,$6,$7,$8,$9,$10,$11,$12,$2,$14,$15}' >mitofish.12S.new.tsv
cat mitofish.12S.old.tsv mitofish.12S.new.tsv >mitofish.12S.$month.tsv
If not updating (re-populating an entire database), efetch and RESCRIPt are faster than grep
in retrieving metadata for these 12S sequences
efetch
code
for i in `cat 12S.seqID.acc`
do
echo $i
# Get gene definition
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDSeq -element INSDSeq_primary-accession INSDSeq_definition >>12S.definition.tsv
# Get taxID
echo "#$i," >>12S.taxid.tsv
efetch -db nuccore -id $i -format gpc | xtract -pattern INSDQualifier -element INSDQualifier_value | grep taxon | sed "s/taxon://" >>12S.taxid.tsv
done
Process efetch output. Did not do uniq
because some genomes contain multiple copy of 12S rRNA
sort -k 1 12S.definition.tsv | tr "#" "_" >12S.definition.sorted.tsv
cat 12S.taxid.tsv | tr -d "\n" | tr "#" "\n" | tr "," "\t" | sort -k 1 | uniq | grep -v ^$ >12S.taxid.sorted.tsv
RESCRIPt code. Note that RESCRIPt does not accept duplicated seqID
cat 12S.seqID.acc | sort | uniq >12S.seqID.sorted.acc
echo "id" >id
split -l 1000 12S.seqID.sorted.acc
for i in `ls xa*`
do
cat id $i >rescript.acc
qiime rescript get-ncbi-data --m-accession-ids-file rescript.acc --o-sequences "$i".seq.qza --o-taxonomy "$i".tax.qza
qiime tools export --input-path "$i".tax.qza --output-path "$i".taxout
done
- Convert taxonomy to mitohelper's TSV format
cat *taxout/taxonomy.tsv | grep -v "Feature ID" | sed "s/k__Metazoa/Eukaryota/" \
| sed "s/; [p/c/o/f/g]__/;/g" | tr ";" "\t" \
| awk -F "\t" '{OFS=";"}{print $1,$2,$3,$4,$5,$6,$7,$7,$8}' \
| sed "s/; s__/ /" | tr ";" "\t" | sort -k 1 >12S.mitogenome.taxonomy.sorted.tsv
- Check and combine files manually to produce
mitofish.12S.extracted.tsv
- Case-insensitive keyword searches for 12S rRNA genes in non-complete mitogenomes from database reference file
grep -i -e "12S ribo" -e "12S rRNA" -e "12S small" -e "12 ribosomal RNA" mitofish.all."$month".tsv >mitofish.12S.partI.tsv
grep -v -f 12S.seqID.sorted.acc mitofish.12S.partI.tsv >mitofish.12S.filtered.tsv
- Extract 12S genes from other studies with NCBI records that may not contain the keywords above (e.g. small subunit ribosomal RNA)
12S.unconventional.acc
is available in thedb.scripts
folder of the mitohelper repository and contains 993 records.
First, add the beginning of line indicator ^
in front of each accession number to improve the grep search specificity
sed "s/^/\^/" 12S.unconventional.acc >12S.unconventional.grep
Then, write a script grep.sh
to retrieve these unconventional accession numbers from mitohelper.all."$month".tsv
. To speed up the process, only the first match -m 1
is reported.
for i in `cat 12S.unconventional.grep`
do
grep -m 1 $i mitofish.all."$month".tsv >>mitofish.12S.unconventional.tsv
grep -v -f 12S.seqID.sorted.acc mitofish.12S.unconventional.tsv >mitofish.12S.unconventional.filtered.tsv
done
- Combine files and de-duplicate
cat mitofish.12S.extracted.tsv mitofish.12S.partI.tsv mitofish.12S.unconventional.filtered.tsv | tr -d "\r" | sort | uniq >mitofish.12S.noheader.tsv
cat mitofish.header mitofish.12S.noheader.tsv >mitofish.12S."$month".tsv
- Extract COI hits and filter out cytochrome b and false positive records
grep -i -e "(COI)" -e "CO1" -e "cytochrome c oxidase subunit I " \
-e "cytochrome c oxidase subunit 1" -e "COI " -e "complete genome" -e "complete mito" -e "mitochondrial DNA, complete" \
-e "COX1" -e "(COXI)" -e "cytochrome oxidase subunit I " -e "genome assembly" mitofish.all."$month".tsv \
| grep -v -i "cytochrome b" | grep -v -i "cytb" >mitofish.COI.candidate.tsv
- Filter out and inspect false positive hits that contain the keyword
complete sequence
but not COI. The stringsCOI
orCO1
could be in strain/isolate names - e.g.Mexico1
,Acanthobunocephalus nicoi
,Bryconamericus plutarcoi
. Remove records that are true positives for the next step.
cat mitofish.COI.candidate.tsv | grep -v cyto | grep -v genome | grep -v COI | grep -v "mitochondrial DNA, complete" >mitofish.nonCOI.tsv
- Remove false positive hits
cut -f1 mitofish.nonCOI.tsv >mitofish.nonCOI.acc
grep -v -f mitofish.nonCOI.acc mitofish.COI.candidate.tsv >mitofish.COI.partI.tsv
- Extract COI records that may not contain the keywords above (e.g. partial genomes)
grep -f COI.unconventional.acc mitofish.all."$month".tsv >mitofish.COI.unconventional.tsv
-
COI.unconventional.acc
is available in thedb.scripts
folder of the mitohelper repository
JN184086
AB282832
KP266818
We found three records so far, and this is by no means a complete list of "unconventional" records.
Note: Other databases such as The Barcode of Life Database can contain additional records not deposited in NCBI.
- Check for duplicates, then combine everything
# There should be no output from the grep command, if there is no duplicate
grep -f COI.unconventional.acc mitofish.COI.partI.tsv
cat mitofish.header mitofish.COI.partI.tsv mitofish.COI.unconventional.tsv >mitofish.COI."$month".tsv
- To update new COI records, perform the keyword search on
new.records.tsv
, filter out false positive records, then append tomitofish.COI.$month.tsv