GitHub

aenmd

R package for annotating predicted escape from nonsense-mediated decay (NMD)

Introduction
Installing aenmd
Getting started
Known issues

Introduction

This R package provides functionality to annotate transcripts with genetic variants that introduce premature termination codons (stop codons) with predicted escape from NMD. More details can be found in the package documentation, and in our preprint here.

The aenmd package depends on transcript annotations that are provided via data packages that can be found in the aenmd_data repository on GitHub. Further on, a command line interface/utility (aenmd_cli) is available on GitHub as well.

Getting started on Google colab

Below we have code for installing and getting started with aenmd. Since installation of Github packages can be somewhat "interesting" we also provide a notebook on Google colab, so that users can install aenmd in a controlled environment. You can get started by following this link: getting started on Colab.

Installing `aenmd`

#- need the remotes package
if(!require(remotes)){
    install.packages("remotes")
}

#- install data package (current default is ENSEMBL v 105)
remotes::install_github("kostkalab/aenmd_data/aenmd.data.ensdb.v105")

#- install aenmd package
remotes::install_github("kostkalab/aenmd")

Trouble shooting installing `aenmd`

Sometimes installation of aenmd can fail. If that happens, the following might help:

R version. Check that the version of R you are using is at least 4.1.0. You can do so by running the command R.Version()

Trouble installing Bioconductor dependencies. If Bioconductor dependencies fail to install, it might help to (a) check/update your Bioconductor versiona and (b) install offending packages with BiocManager::install(). The following commands might be helpful (also see the Bioconcductor installation instructions):

# 1. UPDATE BIOCONDUCTOR
#
#- This will (amongst other things) identify out-of-date packages. 
#- Run the suggested command to update your installation
> BiocManager::valid()
# ...
#
#create a valid installation with
#
#BiocManager::install(c(
#  "BiocManager", "broom", "bslib", "cpp11", "curl", #"dbplyr", "digest",
#  "downlit", "emdbook", "emmeans", "future", "gargle", #"ggfun", "ggplotify",
#  "gprofiler2", "igraph", "jsonlite", "lme4", "locfit", #"matrixStats",
#  "mvtnorm", "parallelly", "pbapply", "pkgbuild", #"plotly", "pROC",
#  "processx", "Rcpp", "RcppAnnoy", "RcppArmadillo", #"readxl", "rmarkdown",
#  "scatterpie", "sf", "shiny", "sp", "styler", "testthat", #"torch",
#  "usethis", "uwot", "vctrs", "xml2"
#), update = TRUE, ask = FALSE, force = TRUE)
#
# ...

# 2. INSTALL BIOCONDUCTOR DEPENDENCIES WITH BIOCMANAGER

#- aenmd's bioconductor dependencies

> aenmd_bioc_deps = c("BiocFileCache", "Biostrings", "BSgenome", "BSgenome.Hsapiens.NCBI.GRCh38","GenomeInfoDb", "GenomicRanges", "IRanges", "S4Vectors", "SummarizedExperiment", "VariantAnnotation")

#- Install them with BiocManager
> BiocManager::install(aenmd_bioc_deps)

If this does not help, please reach out to the developers.

Getting started

Below is a short example, where we use a small vcf file provided with the aenmd package.

1. Import variants from vcf file

First, we load variants we'd like to annotate. aenmd uses GenomicRanges with ref and alt metadata columns to represent variants.

#- load aenmd package
library(aenmd) #- automatically loads annotation data as well

#- load variants to annotate (1,000 random ClinVar variants)
vcf_file <- system.file('extdata/clinvar_20221211_noinfo_sample1k.vcf.gz', package = 'aenmd')
vcf      <- aenmd:::parse_vcf_VariantAnnotation(vcf_file)
vcf_rng  <- vcf$vcf_rng
vcf_rng
#GRanges object with 1000 ranges and 5 metadata columns:
#         seqnames              ranges strand | param_range_id                     ref            alt      qual      filter
#            <Rle>           <IRanges>  <Rle> |       <factor>          <DNAStringSet> <DNAStringSet> <numeric> <character>
#     [1]        1       940501-941150      * |             NA GGAGCCTGCA...CAGATCTCCT              G        NA           .
#     [2]        1       942504-942505      * |             NA                      CG              C        NA           .
#     [3]        1             1041417      * |             NA                       C              T        NA           .
#...

2. Filter out variants we don't annotate

Next, we filter out variants that:

overlap splice regions
don't overlap coding sequence
are SNVs but don't generate stop codons

We also generate a unique key for each variant.

#- Variant filtering
vcf_rng_fil <- process_variants(vcf_rng)
vcf_rng_fil
#GRanges object with 70 ranges and 7 metadata columns:
#       seqnames              ranges strand | param_range_id                     ref            alt      qual      filter        type                    key
#          <Rle>           <IRanges>  <Rle> |       <factor>          <DNAStringSet> <DNAStringSet> <numeric> <character> <character>            <character>
#   [1]        1            12011551      * |             NA                       C              T        NA           .         snv        1:012011551|C|T
#   [2]        1            99902707      * |             NA                       C              T        NA           .         snv        1:099902707|C|T
#   [3]        1           115727014      * |             NA                       C              A        NA           .         snv        1:115727014|C|A
#...

3. Annotate remaining variant/transcript pairs

For the remaining variants we annotate each overlapping transcript:

#- annotate
vcf_rng_ann <- annotate_nmd(vcf_rng_fil, rettype="gr")
vcf_rng_ann
#GRanges object with 142 ranges and 8 metadata columns:
#                                  seqnames    ranges strand | param_range_id            ref            alt      qual      filter        type             key            res_aenmd
#                                     <Rle> <IRanges>  <Rle> |       <factor> <DNAStringSet> <DNAStringSet> <numeric> <character> <character>     <character>          <DataFrame>
#  ENST00000235329|1:012011551|C|T        1  12011551      * |             NA              C              T        NA           .         snv 1:012011551|C|T  TRUE:TRUE:FALSE:...
#  ENST00000294724|1:099902707|C|T        1  99902707      * |             NA              C              T        NA           .         snv 1:099902707|C|T TRUE:FALSE:FALSE:...
#  ENST00000361915|1:099902707|C|T        1  99902707      * |             NA              C              T        NA           .         snv 1:099902707|C|T TRUE:FALSE:FALSE:...
#...

Note the res_aenmd column in the metadata of vcf_rng_ann. This contains aenmd's results. Also, each range is named by the transcript-variant combination analyzed. Below, we explore aenmd's results

vcf_rng_ann$res_aenmd
#DataFrame with 142 rows and 7 columns
#       is_ptc   is_last is_penultimate is_cssProximal is_single is_407plus      transcript
#    <logical> <logical>      <logical>      <logical> <logical>  <logical>     <character>
#1        TRUE      TRUE          FALSE          FALSE     FALSE      FALSE ENST00000235329
#2        TRUE     FALSE          FALSE          FALSE     FALSE      FALSE ENST00000294724
#3        TRUE     FALSE          FALSE          FALSE     FALSE      FALSE ENST00000361915
#4        TRUE     FALSE          FALSE          FALSE     FALSE      FALSE ENST00000370163
#5        TRUE     FALSE          FALSE          FALSE     FALSE      FALSE ENST00000370165
#...       ...       ...            ...            ...       ...        ...             ...
#138      TRUE     FALSE          FALSE          FALSE     FALSE      FALSE ENST00000288447
#139      TRUE     FALSE          FALSE          FALSE     FALSE      FALSE ENST00000357033
#140      TRUE     FALSE          FALSE          FALSE     FALSE      FALSE ENST00000447523
#141     FALSE     FALSE          FALSE          FALSE     FALSE      FALSE ENST00000303391
#142     FALSE     FALSE          FALSE          FALSE     FALSE      FALSE ENST00000453960

The first column indicates whether a variant causes a premature termination codon (PTC) in a transcript. The next five columns indicate rules as to whether the variant/transcript combination is expected to escape NMD. The last column gives the transcript name.

Above we see that the last two variant/transcript combinations do not have a PTC; the first combination is predicted to escape NMD, because the PTC is located in the last exon.

Known Issues

With the current version the following known issues exist:

Mitochondrial chromosomes have alternative start/stop codons, also in vertebrates.
This is not currently implemented in aenmd, so that PTC calls for locations on the MT chromosome are only with respect to "normal" stop codons.
aenmd filters out variants that overlap splice regions in any annotated transcript. Therefore, variants that overlap splice regions in some transcripts but generate PTCs in others are currently not evaluated.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
R		R
inst/extdata		inst/extdata
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
Readme.md		Readme.md
bitbucket-pipelines.yml		bitbucket-pipelines.yml
colab_install_aenmd.ipynb		colab_install_aenmd.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aenmd

R package for annotating predicted escape from nonsense-mediated decay (NMD)

Introduction

Getting started on Google colab

Installing `aenmd`

Trouble shooting installing `aenmd`

Getting started

1. Import variants from vcf file

2. Filter out variants we don't annotate

3. Annotate remaining variant/transcript pairs

Known Issues

About

Releases

Packages

Languages

License

kostkalab/aenmd

Folders and files

Latest commit

History

Repository files navigation

aenmd

R package for annotating predicted escape from nonsense-mediated decay (NMD)

Introduction

Getting started on Google colab

Installing aenmd

Trouble shooting installing aenmd

Getting started

1. Import variants from vcf file

2. Filter out variants we don't annotate

3. Annotate remaining variant/transcript pairs

Known Issues

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Installing `aenmd`

Trouble shooting installing `aenmd`

Packages