This R
package provides functionality to annotate transcripts with genetic variants that introduce premature termination codons (stop codons) with predicted escape from NMD. More details can be found in the package documentation, and in our preprint here.
The aenmd
package depends on transcript annotations that are provided via data packages that can be found in the aenmd_data repository on GitHub. Further on, a command line interface/utility (aenmd_cli) is available on GitHub as well.
Below we have code for installing and getting started with aenmd
. Since installation of Github packages can be somewhat "interesting" we also provide a notebook on Google colab, so that users can install aenmd
in a controlled environment. You can get started by following this link: getting started on Colab.
#- need the remotes package
if(!require(remotes)){
install.packages("remotes")
}
#- install data package (current default is ENSEMBL v 105)
remotes::install_github("kostkalab/aenmd_data/aenmd.data.ensdb.v105")
#- install aenmd package
remotes::install_github("kostkalab/aenmd")
Sometimes installation of aenmd
can fail. If that happens, the following might help:
-
R
version. Check that the version ofR
you are using is at least4.1.0
. You can do so by running the commandR.Version()
-
Trouble installing Bioconductor dependencies. If Bioconductor dependencies fail to install, it might help to (a) check/update your Bioconductor versiona and (b) install offending packages with
BiocManager::install()
. The following commands might be helpful (also see the Bioconcductor installation instructions):# 1. UPDATE BIOCONDUCTOR # #- This will (amongst other things) identify out-of-date packages. #- Run the suggested command to update your installation > BiocManager::valid() # ... # #create a valid installation with # #BiocManager::install(c( # "BiocManager", "broom", "bslib", "cpp11", "curl", #"dbplyr", "digest", # "downlit", "emdbook", "emmeans", "future", "gargle", #"ggfun", "ggplotify", # "gprofiler2", "igraph", "jsonlite", "lme4", "locfit", #"matrixStats", # "mvtnorm", "parallelly", "pbapply", "pkgbuild", #"plotly", "pROC", # "processx", "Rcpp", "RcppAnnoy", "RcppArmadillo", #"readxl", "rmarkdown", # "scatterpie", "sf", "shiny", "sp", "styler", "testthat", #"torch", # "usethis", "uwot", "vctrs", "xml2" #), update = TRUE, ask = FALSE, force = TRUE) # # ...
# 2. INSTALL BIOCONDUCTOR DEPENDENCIES WITH BIOCMANAGER #- aenmd's bioconductor dependencies > aenmd_bioc_deps = c("BiocFileCache", "Biostrings", "BSgenome", "BSgenome.Hsapiens.NCBI.GRCh38","GenomeInfoDb", "GenomicRanges", "IRanges", "S4Vectors", "SummarizedExperiment", "VariantAnnotation") #- Install them with BiocManager > BiocManager::install(aenmd_bioc_deps)
If this does not help, please reach out to the developers.
Below is a short example, where we use a small vcf file provided with the aenmd
package.
First, we load variants we'd like to annotate. aenmd
uses GenomicRanges
with ref
and alt
metadata columns to represent variants.
#- load aenmd package
library(aenmd) #- automatically loads annotation data as well
#- load variants to annotate (1,000 random ClinVar variants)
vcf_file <- system.file('extdata/clinvar_20221211_noinfo_sample1k.vcf.gz', package = 'aenmd')
vcf <- aenmd:::parse_vcf_VariantAnnotation(vcf_file)
vcf_rng <- vcf$vcf_rng
vcf_rng
#GRanges object with 1000 ranges and 5 metadata columns:
# seqnames ranges strand | param_range_id ref alt qual filter
# <Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSet> <numeric> <character>
# [1] 1 940501-941150 * | NA GGAGCCTGCA...CAGATCTCCT G NA .
# [2] 1 942504-942505 * | NA CG C NA .
# [3] 1 1041417 * | NA C T NA .
#...
Next, we filter out variants that:
- overlap splice regions
- don't overlap coding sequence
- are SNVs but don't generate stop codons
We also generate a unique key for each variant.
#- Variant filtering
vcf_rng_fil <- process_variants(vcf_rng)
vcf_rng_fil
#GRanges object with 70 ranges and 7 metadata columns:
# seqnames ranges strand | param_range_id ref alt qual filter type key
# <Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSet> <numeric> <character> <character> <character>
# [1] 1 12011551 * | NA C T NA . snv 1:012011551|C|T
# [2] 1 99902707 * | NA C T NA . snv 1:099902707|C|T
# [3] 1 115727014 * | NA C A NA . snv 1:115727014|C|A
#...
For the remaining variants we annotate each overlapping transcript:
#- annotate
vcf_rng_ann <- annotate_nmd(vcf_rng_fil, rettype="gr")
vcf_rng_ann
#GRanges object with 142 ranges and 8 metadata columns:
# seqnames ranges strand | param_range_id ref alt qual filter type key res_aenmd
# <Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSet> <numeric> <character> <character> <character> <DataFrame>
# ENST00000235329|1:012011551|C|T 1 12011551 * | NA C T NA . snv 1:012011551|C|T TRUE:TRUE:FALSE:...
# ENST00000294724|1:099902707|C|T 1 99902707 * | NA C T NA . snv 1:099902707|C|T TRUE:FALSE:FALSE:...
# ENST00000361915|1:099902707|C|T 1 99902707 * | NA C T NA . snv 1:099902707|C|T TRUE:FALSE:FALSE:...
#...
Note the res_aenmd
column in the metadata of vcf_rng_ann.
This contains aenmd
's results.
Also, each range is named by the transcript-variant combination analyzed.
Below, we explore aenmd
's results
vcf_rng_ann$res_aenmd
#DataFrame with 142 rows and 7 columns
# is_ptc is_last is_penultimate is_cssProximal is_single is_407plus transcript
# <logical> <logical> <logical> <logical> <logical> <logical> <character>
#1 TRUE TRUE FALSE FALSE FALSE FALSE ENST00000235329
#2 TRUE FALSE FALSE FALSE FALSE FALSE ENST00000294724
#3 TRUE FALSE FALSE FALSE FALSE FALSE ENST00000361915
#4 TRUE FALSE FALSE FALSE FALSE FALSE ENST00000370163
#5 TRUE FALSE FALSE FALSE FALSE FALSE ENST00000370165
#... ... ... ... ... ... ... ...
#138 TRUE FALSE FALSE FALSE FALSE FALSE ENST00000288447
#139 TRUE FALSE FALSE FALSE FALSE FALSE ENST00000357033
#140 TRUE FALSE FALSE FALSE FALSE FALSE ENST00000447523
#141 FALSE FALSE FALSE FALSE FALSE FALSE ENST00000303391
#142 FALSE FALSE FALSE FALSE FALSE FALSE ENST00000453960
The first column indicates whether a variant causes a premature termination codon (PTC) in a transcript. The next five columns indicate rules as to whether the variant/transcript combination is expected to escape NMD. The last column gives the transcript name.
Above we see that the last two variant/transcript combinations do not have a PTC; the first combination is predicted to escape NMD, because the PTC is located in the last exon.
With the current version the following known issues exist:
-
Mitochondrial chromosomes have alternative start/stop codons, also in vertebrates.
This is not currently implemented inaenmd
, so that PTC calls for locations on the MT chromosome are only with respect to "normal" stop codons. -
aenmd
filters out variants that overlap splice regions in any annotated transcript. Therefore, variants that overlap splice regions in some transcripts but generate PTCs in others are currently not evaluated.