Antimicrobial Peptide discovery pipelines typically fall into two major
categories. One approach, is to start with a biological tissue or
secretion with activity against bacteria, perform a fractionation step
and then isolate the active fractions for further fractionation. This
process is continued until a pure fraction is obtained for chemical
analysis. This “bioassay guided fractionation” approach is time
consuming and requires a large amount of starting material. An
alternative, and more recent approach is to start by sequencing the
transcriptome (or genome) of an organism and use in silico analyses to
identify a relatively small number of candidates. These candidate
sequences are then used to synthesise peptides to allow testing of
antibacterial activity. ampir
is designed to fit into this latter type
of pipeline as one of the in silico steps. Recent successful examples
demonstrating the overall principle of this technique include:
- A study by Kim et al 2016 which identified 11 AMPs with strong antibacterial activity in the American cockroach.
- A study by Yoo et al 2014 which identified 10 novel AMPs in the centipede, Scolopendra subspinipes mutilans
Both these studies used a long sequence of in silico steps to go from around 70-80k transcripts in the whole transcriptome down to a short list of around 20 candidate sequences. Key steps include:
- Screening based on physicochemical properties
- Removal of homologs to known AMPs
- Differential expression (e.g. upregulated in response to bacterial challenge).
ampir
comes with two predictive models, one called mature
, which is
designed to work with mature peptides whereas another called precursor
is designed to work with full length precursor sequences as input.
The study by Yoo et al provides mature peptide sequences for the final
candidate set of 18 peptides which were sent for synthesis. Here we
demonstrate the use of ampir
’s mature model to predict whether these
peptides are antimicrobial. Since these 18 candidates have already been
subjected to a comprehensive analysis we do not expect that ampir
will
be able to distinguish between those that passed the final step (assay
of the synthesised product) but we do expect that the majority should be
classified as AMPs by ampir
.
The mature peptides from Yoo et al are provided in the file
centipede.xlsx
which we can read in as follows:
library(tidyverse)
centipede_peps <- readxl::read_excel("raw_data/case_studies/centipede.xlsx")
This dataset has the following information
ESTID | PepID | Seq | Verified |
---|---|---|---|
Unigene34794_All | SmAE_P1 | KKKHRQFLGIRNYYKEFIPNLSDITSPLHVLLKK | verified |
Unigene45899_All | SmAE_P2 | RKLWWRLAAVLTCKSIVKYGHGGERLIEPSSSWFPPKFPSG | unverified |
Unigene30751_All | SmAE_P3 | QDCPGVRDHLVRGALDLGGVANRTTSRSKYGTKKPKK | unverified |
Unigene39330_All | SmAE_P4 | GPGETGKKRKRELVSSQLGHLVGRKKISKALSQKISEIWGD | unverified |
Unigene22235_All | SmAE_P5 | YGGGYGYRRPYYGGYHGGYYRRPYYYGGYYGGGYKYKHWGCRFF | verified |
NA | SmAE_P6 | YGGGYGYRRPYYGGYHGGYYRRPYYYGGYYGGGYKYKHWGCRFF | verified |
To run ampir
we need the Seq
column as well as a column with unique
names for each sequence (we will use PepID
). The data can be prepared
for input to ampir
as follows:
library(ampir)
# Creates a data.frame where the first column is the ID and the second is the sequence
centipede_ampir_in <- centipede_peps %>% select(PepID,Seq)
Now we run predictions using the “mature” model.
# Run the predictions using the "mature" model
centipede_ampir <- predict_amps(centipede_ampir_in, model = "mature")
head(centipede_ampir) %>% knitr::kable()
PepID | Seq | prob_AMP |
---|---|---|
SmAE_P1 | KKKHRQFLGIRNYYKEFIPNLSDITSPLHVLLKK | 0.2861474 |
SmAE_P2 | RKLWWRLAAVLTCKSIVKYGHGGERLIEPSSSWFPPKFPSG | 0.8081439 |
SmAE_P3 | QDCPGVRDHLVRGALDLGGVANRTTSRSKYGTKKPKK | 0.6778932 |
SmAE_P4 | GPGETGKKRKRELVSSQLGHLVGRKKISKALSQKISEIWGD | 0.9585819 |
SmAE_P5 | YGGGYGYRRPYYGGYHGGYYRRPYYYGGYYGGGYKYKHWGCRFF | 0.7131284 |
SmAE_P6 | YGGGYGYRRPYYGGYHGGYYRRPYYYGGYYGGGYKYKHWGCRFF | 0.7131284 |
Note that the outputs from ampir include the original columns of the
input data as well as a new column called prob_AMP
which contains the
probability that the sequence is an AMP.
To look at the overall performance of ampir
we plot the density of
prob_AMP
values along the range from 0 to 1. This shows that although
ampir
correctly predicts almost all peptides are AMPs, it is unable to
tell whether these would pass verification or not.
centipede_verification <- centipede_ampir %>% left_join(centipede_peps)
ggplot(centipede_verification) + geom_density(aes(x=prob_AMP)) + geom_point(aes(y=0,x=prob_AMP,color=Verified)) + xlim(0,1)
A more typical use for ampir
is as an early step in an AMP discovery
pipeline. As an example we consider pipelines for discovery of AMPs from
frogs in the genus Odorrana. Odorrana is a diverse genus of frogs
native to East Asia. Their skin secretions have proven to be a rich
source of novel antimicrobial peptides, many of which have been
synthesised and assayed to demonstrate activity against bacteria. There
are currently 53 reviewed and 95 unreviewed entries in UniProt for
Odorrana and the keyword “Antimicrobial”.
Discovery of AMPs in Odorana exemplifies the potential use case for
ampir. To demonstrate its use we obtained complete transcriptome
sequences for Odorrana margaretae from supplementary material provided
in Qiao et al
2013.
Unfortunately none of these sequences have confirmed antibacterial
activity (due to lack of testing) but since Odorrana is a diverse
genus we were able to identify 10 O. margaretae sequences that are
close homologs of AMPs from other Odorrana species. Given that there
are at least 100 AMPs in the Human proteome and over 200 in Arabidopsis
thaliana this is almost certainly a vast underestimate of the true
number of AMPs in O. margaretae. For our purposes it is sufficient to
demonstrate the utility of ampir
as an early step in an AMP discovery
pipeline.
Required Data: The code examples shown below assume that you have
downloaded and unpacked the AMP_pub
data package.
The following two command-line steps are not shown for brevity:
- Predicting protein sequences from the O. margaretae transcriptome 01_transdecoder.sh
- Detecting Odorrana AMP homologs in O. margaretae 02_blastmatch.sh
Firstly we identify the set of known AMPs in O. margaretae by importing BLAST results showing matches between O. margaretae and known Odorrana AMPs.
blast_colnames <- c('qaccver','saccver','pident','length','mismatch','gapopen','qstart','qend','sstart','send','evalue','bitscore')
omarg_amps_blast <- read_tsv("raw_data/case_studies/o_margaretae_uniprot.blastp", col_names = blast_colnames) %>%
group_by(qaccver) %>% top_n(1,bitscore)
omarg_amp_homologs <- omarg_amps_blast$qaccver
omarg_amp_homologs
## [1] "odorranaTranscripts_10060_347.p1" "odorranaTranscripts_11116_407.p1"
## [3] "odorranaTranscripts_13513_363.p1" "odorranaTranscripts_13966_372.p1"
## [5] "odorranaTranscripts_27291_286.p1" "odorranaTranscripts_28862_308.p1"
## [7] "odorranaTranscripts_36173_329.p1" "odorranaTranscripts_36173_329.p1"
## [9] "odorranaTranscripts_36742_286.p1" "odorranaTranscripts_37089_292.p1"
Next we use the read_faa()
function from ampir
to read the entire
O. margaretae transcriptome.
omarg_aa <- read_faa("raw_data/case_studies/o_margaretae_aa.fasta")
# Since these are raw outputs from Trinity we also use the following step to clean their sequence names
omarg_aa <- omarg_aa %>% mutate(seq_name = str_match(seq_name,"^[^\\ ]*")[,1])
head(omarg_aa) %>% knitr::kable()
seq_name | seq_aa |
---|---|
odorranaTranscripts_10000_515.p1 | MEAEQSTEGGQSQVSENPHSEYGLTDNVERIVENEKASTEKASKQKVDLQALPTRAYLDQTVVPILLQGLSVLAKERPPNPIEFLASYLLKNKAQFEDRS |
odorranaTranscripts_10002_1290.p1 | MDAPWELAGGLCCTHRGIITDGNNSPGAKLGINFDGNRSLMAWLWRPWLVIGSDVFGQGIVTTRDCSTSPIQPHVIRGMVDRWDLGAREVVVIFGYLAV |
odorranaTranscripts_10003_792.p2 | MTVNITQLLAPVEVELSAINPSPPILFTIKKPPPPIIPFFRQFCLPVQKC |
odorranaTranscripts_10004_2529.p1 | MQLIGPGCRLPAGANILKERRQETVHFAKLIFSWLVNKMKLCSLSLPSFYAHDWMSKVNTVSPFH |
odorranaTranscripts_10004_2529.p2 | MQFDPEVARPQPSLCFEVHSHLQIHPYGDYMQEPQWGGVRCPRLAATGSPIESTGRMLAPFS |
odorranaTranscripts_10005_1223.p1 | SSEEVLNLRFPLHRACRDGDLHCLRSLLQGPPGNTAQQLGKEDSCYGWTPIHWAAHFGKLECLVQLVGAGCSVNASTTRFAQTPAHIAAFGGHPHCLMWLIQAGANVNKQDYVGETPVHKAARAGSLDCLKALVSNGGQIDFKNASGLTAADLAHSQGFTECAQFLLNLQNSQLNGFYCTNSLNGVPQNASSHLNGGINHRKRSFHDLEASVVKKLRTEMNSFGTEYMTGRLEEDDSMHVENHSDISGGIGQFSDRCCNIPMIEDLKPQESTSNRPTDANNFTVSATQTPFRCISQYAYF |
Now we use the predict_amps()
function in ampir
to predict AMPs.
Since these are full length proteins we use the “precursor” model and
since there are a large number of sequences we set n_cores
to the
number of cores on our desktop machine. This step takes around 20
seconds on a 2018 Mac Mini.
omarg_aa_ampir <- predict_amps(omarg_aa,n_cores=6, model = "precursor")
Finally we plot the distribution of prob_AMP
scores for the whole
transcriptome and for AMP homologs. This plot suggests that by using
ampir
as an initial pipeline step with a cut-off (prob_AMP
>0.5)
the search space for AMPs can be reduced from around 77,000 to just
5,000. With so few verified AMPs for this organism it is difficult to
determine whether a higher cut-off (e.g. prob_AMP
>0.9) could be used
to provide greater specificity.
omarg_aa_ampir_plot <- omarg_aa_ampir %>%
mutate(amp = ifelse(seq_name %in% omarg_amp_homologs,"AMP Homolog","Non-AMP")) %>%
add_count(amp)
ggplot(omarg_aa_ampir_plot,aes(x=amp,y=prob_AMP)) +
geom_violin(aes(color=amp)) + xlab("")