This page documents the content and provenance of the data files within the repository.
Transcript sequences:
data/yeast_CDS_w_250utrs.fa
Coding sequence locations:
data/yeast_CDS_w_250utrs.gff3
These files hold S288c annotations and ORF sequences.
These files were created as follows:
- The file genome release R64-2-1 (file name
S288C_reference_genome_R64-2-1_20150113.tgz
) was downloaded from the Saccharomyces Genome Database. - The files
saccharomyces_cerevisiae_R64-2-1_20150113.gff
andS288C_reference_sequence_R64-2-1_20150113.fsa
were extracted from the.tgz
file. - The sequence and annotation files for the whole approximate Saccharomyces cerevisiae transcriptome were prepared using script_for_transcript_annotation.Rmd.
The files can be used as inputs to riboviz. However, yeast_CDS_w_250utrs.fa
and yeast_CDS_w_250utrs.gff3
were downsampled to provide a manageable data set for demonstration purposes, as described in the next section.
Transcript sequences to align to, from just the left arm of chromosome 1:
vignette/input/yeast_YAL_CDS_w_250utrs.fa
Matched genome feature file, specifying coding sequences locations (start and stop coordinates):
vignette/input/yeast_YAL_CDS_w_250utrs.gff3
As the yeast data files described in the previous section are very large, these files were downsampled for demonstration processes. The data files yeast_CDS_w_250utrs.fa
and yeast_CDS_w_250utrs.gff3
were processed by filtering only ORFs in the left arm of chromosome 1, for which the ORF names start with YALnnnx
. This produced the above yeast genome and annotation data files.
The document Appendix A1: Yeast Nomenclature Systematic Open Reading Frame (ORF) and Other Genetic Designations describes the ORF naming convention.
The files can be used as inputs to riboviz.
rRNA sequences to avoid aligning to:
vignette/input/yeast_rRNA_R64-1-1.fa
This files was created as follows:
- The file genome release R64-1-1 (file name
S288C_reference_genome_R64-1-1_20110203.tgz
) was downloaded from the Saccharomyces Genome Database. - The file
rna_coding_R64-1-1_20110203.fasta
was extracted from the.tgz
file. - Selected
RDN-n-n
sequences were copied and pasted from this file.
The file can be used as an input to riboviz.
data/yeast_codon_table.tsv
This file was produced using script_for_transcript_annotation.Rmd as part of the preparation described in Saccharomyces cerevisiae (yeast) genome and annotation data above.
An identical table (differing only in the commented lines in the header) is produced by:
python -m riboviz.tools.get_cds_codons \
-f data/yeast_CDS_w_250utrs.fa \
-g data/yeast_CDS_w_250utrs.gff3 \
-c data/yeast_codon_table_alternative.tsv
data/yeast_codon_pos_i200.RData
This file was produced using version 1.0 of script_for_transcript_annotation.Rmd.
The file contains data identical from codon position 201 onwards of the data in data/yeast_codon_table.tsv
.
The .Rdata
file is obsolete as of March 2021, and will be removed once we have updated the code section CalculateCodonSpecificRibosomeDensity
to use input data formatted from data/yeast_codon_table.tsv
instead
data/yeast_features.tsv
Data within this file was derived as follows:
Length_log10
: genome release R64-2-1 (GFF) from the Saccharomyces Genome Database.utr
, 5' UTR length: Arribere, J.A. and Wendy V. Gilbert, W.V. "Roles for transcript leaders in translation and mRNA decay revealed by transcript leader sequencing", Genome Res. 2013. 23: 977-987 doi:10.1101/gr.150342.112polyA
length: Subtelny, A.O. et al. "Poly(A)-tail profiling reveals an embryonic switch in translational control", Nature, 508(66), 29/01/2019 doi:10.1038/nature13007uATGs
: Estimated from 2 by counting the upstream ATGs in the annotated 5'UTRutr_gc
: Estimated from 2 by calculating proportion of G/C in the annotated 5' UTR.FE_cap
: Estimated from 2. using sequences of length 70 nts from the 5' end of the mRNA transcript with folding energies calculated at 37 degress Centigrade following Supplementary Methods for Weinberg et al. 2016 "Improved Ribosome-Footprint and mRNA Measurements Provide Insights into Dynamics and Regulation of Yeast Translation", Cell Reports, 14(7), 23 February 2016, 1787-1799 doi: 10.1016/j.celrep.2016.01.043. Calculations were done using RNAfold in the ViennaRNA package.FE_atg
: Estimated from 30 nt upstream from ATG.
data/yeast_tRNAs.tsv
A-site displacement values (based on standard yeast data from Ingolia 2009, Weinberg & Shah 2016):
data/yeast_standard_asite_disp_length.txt
These files are all read by generate_stats_figs.R to help with generating plots and tables of results data.
~1mi-sampled RPFs wild-type no additive:
vignette/input/SRR1042855_s1mi.fastq.gz
~1mi-sampled RPFs wild-type + 3-AT:
vignette/input/SRR1042864_s1mi.fastq.gz
These read data files are downsampled ribosome profiling data from Saccharomyces cerevisiae. The data has been downsampled to provide a dataset that is realistic, but small enough to run quickly.
The data is from the paper Guydosh N.R. and Green R. "Dom34 rescues ribosomes in 3' untranslated regions", Cell. 2014 Feb 27;156(5):950-62. doi: 10.1016/j.cell.2014.02.006. The NCBI accession for the whole dataset is #GSE52968:
- SRX386986: GSM1279570: wild-type no additive, SRR1042855
- SRX386995: GSM1279579: wild-type plus 3-AT, SRR1042864
In July 2017, these files were imported using NCBI's fastq-dump
and gzipped to produce:
SRR1042855.fastq.gz
SRR1042864.fastq.gz
(these files are not in the repository)
Note: The NCBI SRA (Sequence Read Archive) comments that "With release 2.9.1 of sra-tools we have finally made available the tool fasterq-dump, a replacement for the much older fastq-dump
and fastq-dump
is still supported as it handles more corner cases than fasterq-dump
, but it is likely to be deprecated in the future."
These files can alternatively be accessed via SRA Explorer:
- Search for: SRR1042855
- Select GSM1279570: wild-type no additive; Saccharomyces cerevisiae; OTHER
- Click Add 1 to collection
- Search for: SRR1042864
- Select GSM1279579: wild-type 3-AT; Saccharomyces cerevisiae; OTHER
- Click Add 1 to collection
- Click 2 saved datasets
- Click Bash script for downloading FastQ files
- Click Download
- Run the bash script, e.g.
$ source sra_explorer_fastq_download.sh
- Warning: the total download time may take ~40 minutes or more. The files are 1.5GB and 2.2GB respectively.
The data was sampled uniformly at random 1/50 reads from each file, producing about 1 million reads total, to produce the downsampled read data files.
data/simdata/
folder:
multiplex_barcodes.tsv
multiplex.fastq
multiplex_umi_barcode_adaptor.fastq
multiplex_umi_barcode.fastq
umi3.fastq
umi3_umi_adaptor.fastq
umi3_umi.fastq
umi5_umi3.fastq
umi5_umi3_umi_adaptor.fastq
umi5_umi3_umi.fastq
deplex/num_reads.tsv
deplex/Tag0.fastq
deplex/Tag1.fastq
deplex/Tag2.fastq
deplex/Unassigned.fastq
These files are simple simulated FASTQ files to test adaptor trimming, UMI extraction and deduplication using UMI-tools when invoked from within the riboviz workflow.
These files were created by running riboviz.tools.create_fastq_simdata.
The files can be used as inputs to riboviz.
data/demultiplex/
Test data for riboviz.tools.demultiplex_fastq
.
Data was imported from https://github.com/ewallace/pyRNATagSeq, commit 6ffd465fb0a80d2134bad9d2147c877c3b363720 (Thu May 11 23:44:13 2017).
Sample_4reads_R1.fastq.gz
: artificial sample with 4 read 1s.Sample_4reads_R2.fastq.gz
: 4 read 2s corresponding toSample_4reads_R1.fastq.gz
.Sample_init10000_R1.fastq.gz
: Initial 10000 read 1s from a paired-end S. cerevisiae dataset.Sample_init10000_R2.fastq.gz
: 10000 read 2s corresponding toSample_init10000_R1.fastq.gz
.TagSeqBarcodedOligos2015.txt
: TagSeq barcoded oligos used in Shishkin, et al. (2015). "Simultaneous generation of many RNA-seq libraries in a single reaction", Nature Methods, 12(4), 323–325. doi: 10.1038/nmeth.3313
data/Mok-tinysim-gffsam
folder.
Used by rscripts/tests/testthat/test_bam_to_h5.R
.
Created using:
- riboviz,
test-bam-to-h5-238
branch, 7b944eb, Wed Feb 3 07:57:11 2021. - example-datasets,
origin
branch, commit 24c2fe4, Mon Jan 18 17:21:17 2021. - amandamok/simRiboSeq,
master
branch, commit 8367709, Wed Jan 13 13:51:18 2021.
Get example-datasets
:
$ git clone https://github.com/riboviz/example-datasets/
Get amandamok/simRiboSeq
:
$ git clone https://github.com/amandamok/simRiboSeq/
$ ls simRiboSeq/simulation_runs/riboviz/
...
tiny_2genes.fq
Create configuration files directory in riboviz
:
$ cd riboviz
$ mkdir -p Mok-tinysim/input
$ cp ../simRiboSeq/simulation_runs/riboviz/tiny_2genes.fq Mok-tinysim/input/
$ cp ../example-datasets/simulated/mok/Mok-tinysim_config.yaml .
Edit .yaml
and change ../../riboviz/
to ../
:
orf_fasta_file: ../example-datasets/simulated/mok/annotation/tiny_2genes_20utrs.fa
orf_gff_file: ../example-datasets/simulated/mok/annotation/tiny_2genes_20utrs.gff3
rrna_fasta_file: ../example-datasets/simulated/mok/contaminants/Sc_rRNA_example.fa
Run riboviz:
$ nextflow run prep_riboviz.nf -params-file Mok-tinysim_config.yaml -ansi-log false
Create and populate data/Mok-tinysim-gffsam
:
$ mkdir data/Mok-tinysim-gffsam
$ samtools view -h Mok-tinysim/output/A/A.bam > data/Mok-tinysim-gffsam/A.sam
$ cp ../example-datasets/simulated/mok/annotation/tiny_2genes_20utrs.gff3 data/Mok-tinysim-gffsam/
riboviz/test/data/trim_5p_mismatch.sam
riboviz/test/data/trim_5pos5neg.sam
These files are used by riboviz.test.test_trim_5p_mismatch
for testing riboviz.trim_5p_mismatch
.
These files were created by running riboviz using vignette/vignette_config.yaml
and the data in vignette/input/
. Lines were copied and pasted from the SAM files output then these lines were manually edited to produce a desired range of outcomes.
WTnone_rRNA_map_20.sam
WTnone_rRNA_map_20.bam
WTnone_rRNA_map_20.bam.bai
WTnone_rRNA_map_6_primary.sam
WTnone_rRNA_map_6_primary.bam
WTnone_rRNA_map_6_primary.bam.bai
WTnone_rRNA_map_14_secondary.sam
WTnone_rRNA_map_14_secondary.bam
WTnone_rRNA_map_14_secondary.bam.bai
The SAM files were created from the file tmp/WTnone/rRNA_map.sam
from a run of the vignette (using riboviz version commit 9efaf93, 08/10/2020):
WTnone_rRNA_map_20.sam
: the first 20 sequences fromrRNA_map.sam
.WTnone_rRNA_map_6_primary.sam
: the 6 mapped (primary) sequences fromWTnone_rRNA_map_20.sam
.WTnone_rRNA_map_14_secondary.sam
: the 14 remaining unmapped, non-primary, sequences fromWTnone_rRNA_map_20.sam
.
The BAM and BAI files were created as follows:
$ samtools view -b <FILE>.sam | samtools sort -@1 -O bam -o <FILE>.bam
$ samtools index <FILE>.bam