Skip to content

An R package for quantifying transposable elements at the loci-level

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

coriell-research/rmskProfiler

Repository files navigation

rmskProfiler

This package provides an end-to-end solution for accurately quantifying transposable elements from RNA-seq data at the loci-level and properly importing into R for downstream analysis.

Installation

You can install the development version of rmskProfiler from GitHub with:

# install.packages("pak")
pak::pak("coriell-research/rmskProfiler")

Python dependencies

This package also requires an installation of Python (either conda or virtualenv). A helper function in this package exists to download the necessary Python dependencies to a special environment called “r-rmskProfiler” using reticulate. For example, on my machine which uses mambaforge, I can install the env like so:

install_rmskProfiler(
  method = "conda",
  conda = "/home/gennaro/mambaforge/condabin/conda",
  channel = "bioconda"
  )

Once installed, you should load subsequent runs of the package with:

library(rmskProfiler)
reticulate::use_condaenv("r-rmskProfiler")

However, if you do not wish to generate your own indexes and instead use one of the pre-built versions (hg38 and mm10 using function defaults), then you don’t have to worry about Python dependencies since Python is only used during index generation.

Salmon dependency

This package assumes that you have a recent version of Salmon installed and available on your PATH. If not, please follow the latest installation instructions before using this package.

Usage

The package has three main components:

  1. Generate a Salmon index using unique RepeatMasker elements + transcripts + genomic decoy
  2. Quantify reads with Salmon using Gibbs sampling
  3. Import the Salmon quants as a SummarizedExperiment for downstream differential expression analysis

A full pipeline for generation of the index, quantification, and importing of quants looks like:

library(rmskProfiler)
reticulate::use_condaenv("r-rmskProfiler")


# Generate the Salmon index for humans using 12 threads 
# and save to directory named 'hg38-resources'
generateIndex("hg38-resources", species = "Hs", threads = 12)

fq1 <- c("/path/to/sample1.R1.fq.gz", "/path/to/sample2.R1.fq.gz", "/path/to/sample3.R1.fq.gz")
fq2 <- c("/path/to/sample1.R2.fq.gz", "/path/to/sample2.R2.fq.gz", "/path/to/sample3.R2.fq.gz")
sample_names <- c("sample1", "sample2", "sample3")

# Perform quantification with Salmon on fastq files
salmonQuant(
  fq1 = fq1, 
  fq2 = fq2, 
  sample_names = sample_names, 
  resource_dir = "hg38-resources", 
  out_dir = "quants", 
  "--gcBias",                       # Additional arguments can be passed as character strings
  "--seqBias",
  "--posBias",
  "--threads 12"
  )

# Import the transcripts and TE loci counts as a SummarizedExperiment object
# rowData contains transcript and TE annotations and GRanges
se <- importQuants("quants", resources_dir = "hg38-resources")

# Proceed to downstream analysis using edgeR

Filtering and aggregating

The imported object contains information on the TE-loci and transcript levels but sometimes a gene/subfamily-level analysis is desired. The following code can be used to filter out TE-loci assigned to exons and then summarize counts for TE-loci to the subfamily level and transcripts to the gene-level.

Below, any TE-loci that overlaps with an exon, 3’UTR, 5’UTR on either strand or has more than one exact sequence location in the genome are removed from downstream analysis. Protein coding transcripts are selected for downstream analysis.

library(SummarizedExperiment)

# Select loci to keep
loci <- subset(
  rowData(se), 
  !is.na(Hash) & 
  hasUnstrandedExonic == FALSE & 
  hasUnstranded3UTR == FALSE & 
  hasUnstranded5UTR == FALSE &
  N_Loci == 1,
  select = "Hash",
  drop = TRUE
  )

# Select only protein coding transcripts
tx <- subset(
  rowData(se), 
  !is.na(transcript_id) & gene_type == "protein_coding", 
  "transcript_id", 
  drop = TRUE
  )

# Filter the Summarized experiment object
keep <- rownames(se) %in% c(tx, loci)
filtered <- se[keep, ]

# Sum assays to the gene/subfamily level
aggregated <- aggregateCounts(filtered, level = "subfamily")

About

An R package for quantifying transposable elements at the loci-level

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published