This example-datasets repository is for the configuration files and genome/annotation files needed to run the riboviz ribosome profiling pipeline on specific datasets. It aims to:
- provide specific example datasets for new users to try or to adapt
- share up-to-date tested example datasets between the riboviz development team
The main riboviz repository contains documentation of how to run riboviz in general.
This section will contain suggestions of example datasets to start with.
- config.yaml files that describe all parameters for the riboviz run
- trancriptome or ORFeome files needed:
- .fasta files of transcript/extended-ORF sequences
- .gff files that describe the CDS/ORF position within the fasta file
- .fasta files of contaminants to exclude (rRNA, tRNA, etc)
Generally, the transcriptome fasta/gff files and contaminant fasta files would be referred to by multiple config.yaml files in the same species.
- read files, which are too big
- fastq or fastq.gz
- bam, sam, etc
- genome fasta files. Instead, please refer to a genome build or link to the file.
- genome-centric gffs also probably do not belong here
- processed data files such as riboviz outputs
- everything else not specifically listed in "what belongs in example-datasets"
Please open an issue on github if there is something we have overlooked.
Caution: any repository should not exceed 1GB in size. GitHub's What is my disk quota? comments "If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down."
The repository is organised roughly phylogenetically into subfolders, and then example datasets yaml files for each species are in the same folder, with fasta/gff files in a subfolder annotation
.
We have organised the repository into top-level folders with kingdoms, and then within that organised by genus (e.g. /fungi/neurospora
, /animalia/homo
). Kingdom and genus names are all lower case to avoid confusion with weblinks. When we set up the repository, this seemed to provide a useful compromise providing human readability and easy navigability.
For example, for re-analysis of the yeast meiosis dataset from Brar 2012, aligning to an approximate dataset with CDS flanked by 250nt UTRs at both ends, are in fungi/saccharomyces
:
fungi/saccharomyces/Brar_2012_Meiosis_RPF_6-samples_CDS_w_250utrs_config.yaml
fungi/saccharomyces/annotation
for the annotation gff and fasta files:Saccharomyces_cerevisiae_yeast_CDS_w_250utrs.fa
- fasta file of approximate transcriptsSaccharomyces_cerevisiae_yeast_CDS_w_250utrs.gff3
- gff file of locations of ORFs on those transcriptsSaccharomyces_cerevisiae_yeast_CDS_w_250utrs_annotation_provenance.txt
- describes the provenance of these (where those files came from)
fungi/saccharomyces/contaminants
for the contaminants fasta files:Saccharomyces_cerevisiae_yeast_rRNA_R64-1-1.fa
Saccharomyces_cerevisiae_yeast_rRNA_R64-1-1-fasta_provenance.txt
- describes the provenance of the fasta file
Each of these directories contains a README.md file with more detailed information
animals (humans, mice, flies, worms, etc.)
archaea (Sulfolobus, Thermococcus, etc.)
eubacteria (E.coli, B. subtilis, etc)
yeasts, mushrooms, moulds, etc.
cress, grasses, trees, etc.
eukaryotes that aren't animals, plants, or fungi (toxoplasma, plasmodium, etc).
This may be convenient, despite protista being a dated and polyphyletic category. Please file a github issue to suggest a change.
Artificial datasets that don't come from a complete real genome.
We welcome community contributions!
We request that example datasets are submitted when they have been tested thoroughly, i.e. riboviz runs on the example dataset on relevant .fastq-format data. Please submit by forking the repository, and putting in a pull request for that contains only:
- config.yaml files that describe all parameters for the riboviz run, and IF NEEDED:
- trancriptome or ORFeome files needed:
- .fasta files of transcript/extended-ORF sequences
- .gff files that describe the CDS/ORF position within the fasta file
- .fasta files of contaminants to exclude (rRNA, tRNA, etc)
The .fasta/.gff files would not be needed if example-datasets already had an analysis of another dataset on the same transcriptome, so please check first.
The config.yaml
file should contain all parameters needed to run riboviz. This is described in prep-riboviz-config.md.
If your example dataset runs riboviz on published data in archives such as GEO/SRA/ENA, please ensure that config.yaml fastq filenames correspond to the accession numbers of the relevant SRA/ENA files.
Please begin the config.yaml
with a provenance
entry providing metadata on the riboviz run, the authors of the file, the version of riboviz that ran on the dataset, and the data source including publication reference and DOI, for example:
provenance:
authors: # people who put together this config.yaml file
- author: John Smith III
email: John.Smith.III@ed.ac.uk
- author: ...
email: ...
website: https://www.ed.ac.uk/some-bio-project
date: 2020-04-01
riboviz-version: TAG | COMMIT
GEO: GSExxxxxxx # gene expression omnibus references for dataset, if relevant
reference: Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling, Ingolia et al 2009
DOI: 10.1126/science.1168978
notes: >
Re-analysis of data from Ingolia 2009 to an updated yeast transcriptome.
We are currently (May 2020) reviewing the format of this in issue #riboviz166, so the format may change.
Annotation files (.fasta files of transcript/extended-ORF sequences, .gff files that describe the CDS/ORF position within the fasta file), should be placed within They should ideally be checked with check_fasta_gff.py, which currently checks if start and stop codons are as expected. This can be run as follows:
$ python -m riboviz.tools.check_fasta_gff -f FASTA -g GFF
For example,
$ python -m riboviz.tools.check_fasta_gff -f data/yeast_CDS_w_250utrs.fa \
-g data/yeast_CDS_w_250utrs.gff3
You can submit files with non-ATG start codons or in-frame stops if you have good reason to do so, check_fasta_gff.py
is a diagnostic not a prescription.
We are currently working on improving specification and testing for annotation files, see #riboviz174.
This is a .fasta-format file of everything that you want ignored in the downstream riboviz analysis. It will generally encompass ribosomal rRNA from your species of interest, perhaps also transfer RNA and other abundant non-coding RNA sequences.
These are .txt format files that describe provenance or metadata covering where the annotation and contaminants come from. Ideally they should include data on repositories, genome releases, references, etc. These are in separate files, because .fasta files do not generally accept comments in the header.
For an example, see: fungi/saccharomyces/annotation/Saccharomyces_cerevisiae_yeast_CDS_w_250utrs_annotation_provenance.txt
When your example dataset is complete, please put in a pull request to the master branch and we will review.
We aim to implement automatic checking using the configuration validation option for nextflow, see issue #riboviz172.