Example datasets to run with RiboViz.

This example-datasets repository is for the configuration files and genome/annotation files needed to run the riboviz ribosome profiling pipeline on specific datasets. It aims to:

provide specific example datasets for new users to try or to adapt
share up-to-date tested example datasets between the riboviz development team

The main riboviz repository contains documentation of how to run riboviz in general.

Start here - DRAFT

This section will contain suggestions of example datasets to start with.

Contents and structure of example-datasets

What belongs in example-datasets

config.yaml files that describe all parameters for the riboviz run
trancriptome or ORFeome files needed:
- .fasta files of transcript/extended-ORF sequences
- .gff files that describe the CDS/ORF position within the fasta file
.fasta files of contaminants to exclude (rRNA, tRNA, etc)

Generally, the transcriptome fasta/gff files and contaminant fasta files would be referred to by multiple config.yaml files in the same species.

What does not belong here

read files, which are too big
- fastq or fastq.gz
- bam, sam, etc
genome fasta files. Instead, please refer to a genome build or link to the file.
- genome-centric gffs also probably do not belong here
processed data files such as riboviz outputs
everything else not specifically listed in "what belongs in example-datasets"

Please open an issue on github if there is something we have overlooked.

Caution: any repository should not exceed 1GB in size. GitHub's What is my disk quota? comments "If your repository exceeds 1GB, you might receive a polite email from GitHub Support requesting that you reduce the size of the repository to bring it back down."

Repository structure is loosely phylogenetic

The repository is organised roughly phylogenetically into subfolders, and then example datasets yaml files for each species are in the same folder, with fasta/gff files in a subfolder annotation.

We have organised the repository into top-level folders with kingdoms, and then within that organised by genus (e.g. /fungi/neurospora, /animalia/homo). Kingdom and genus names are all lower case to avoid confusion with weblinks. When we set up the repository, this seemed to provide a useful compromise providing human readability and easy navigability.

An example of an example dataset from brewer's yeast

For example, for re-analysis of the yeast meiosis dataset from Brar 2012, aligning to an approximate dataset with CDS flanked by 250nt UTRs at both ends, are in fungi/saccharomyces:

fungi/saccharomyces/Brar_2012_Meiosis_RPF_6-samples_CDS_w_250utrs_config.yaml
fungi/saccharomyces/annotation for the annotation gff and fasta files:
- Saccharomyces_cerevisiae_yeast_CDS_w_250utrs.fa - fasta file of approximate transcripts
- Saccharomyces_cerevisiae_yeast_CDS_w_250utrs.gff3 - gff file of locations of ORFs on those transcripts
- Saccharomyces_cerevisiae_yeast_CDS_w_250utrs_annotation_provenance.txt - describes the provenance of these (where those files came from)
fungi/saccharomyces/contaminants for the contaminants fasta files:
- Saccharomyces_cerevisiae_yeast_rRNA_R64-1-1.fa
- Saccharomyces_cerevisiae_yeast_rRNA_R64-1-1-fasta_provenance.txt - describes the provenance of the fasta file

Top-level directories are kingdoms, with an artificial one for simulated data

Each of these directories contains a README.md file with more detailed information

animalia

animals (humans, mice, flies, worms, etc.)

archaea

archaea (Sulfolobus, Thermococcus, etc.)

bacteria

eubacteria (E.coli, B. subtilis, etc)

fungi

yeasts, mushrooms, moulds, etc.

plantae

cress, grasses, trees, etc.

protista

eukaryotes that aren't animals, plants, or fungi (toxoplasma, plasmodium, etc).

This may be convenient, despite protista being a dated and polyphyletic category. Please file a github issue to suggest a change.

simulated

Artificial datasets that don't come from a complete real genome.

How to submit an example dataset

We welcome community contributions!

We request that example datasets are submitted when they have been tested thoroughly, i.e. riboviz runs on the example dataset on relevant .fastq-format data. Please submit by forking the repository, and putting in a pull request for that contains only:

config.yaml files that describe all parameters for the riboviz run, and IF NEEDED:
trancriptome or ORFeome files needed:
- .fasta files of transcript/extended-ORF sequences
- .gff files that describe the CDS/ORF position within the fasta file
.fasta files of contaminants to exclude (rRNA, tRNA, etc)

The .fasta/.gff files would not be needed if example-datasets already had an analysis of another dataset on the same transcriptome, so please check first.

config.yaml

The config.yaml file should contain all parameters needed to run riboviz. This is described in prep-riboviz-config.md.

If your example dataset runs riboviz on published data in archives such as GEO/SRA/ENA, please ensure that config.yaml fastq filenames correspond to the accession numbers of the relevant SRA/ENA files.

Please begin the config.yaml with a provenance entry providing metadata on the riboviz run, the authors of the file, the version of riboviz that ran on the dataset, and the data source including publication reference and DOI, for example:

provenance:
  authors: # people who put together this config.yaml file
  - author: John Smith III
    email: John.Smith.III@ed.ac.uk
  - author: ...
    email: ...
  website: https://www.ed.ac.uk/some-bio-project
  date: 2020-04-01
  riboviz-version: TAG | COMMIT
  GEO: GSExxxxxxx # gene expression omnibus references for dataset, if relevant
  reference: Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling, Ingolia et al 2009
  DOI: 10.1126/science.1168978
  notes: >
    Re-analysis of data from Ingolia 2009 to an updated yeast transcriptome.

We are currently (May 2020) reviewing the format of this in issue #riboviz166, so the format may change.

annotation files

Annotation files (.fasta files of transcript/extended-ORF sequences, .gff files that describe the CDS/ORF position within the fasta file), should be placed within They should ideally be checked with check_fasta_gff.py, which currently checks if start and stop codons are as expected. This can be run as follows:

$ python -m riboviz.tools.check_fasta_gff -f FASTA -g GFF

For example,

$ python -m riboviz.tools.check_fasta_gff -f data/yeast_CDS_w_250utrs.fa \
    -g data/yeast_CDS_w_250utrs.gff3

You can submit files with non-ATG start codons or in-frame stops if you have good reason to do so, check_fasta_gff.py is a diagnostic not a prescription.

We are currently working on improving specification and testing for annotation files, see #riboviz174.

contaminant files

This is a .fasta-format file of everything that you want ignored in the downstream riboviz analysis. It will generally encompass ribosomal rRNA from your species of interest, perhaps also transfer RNA and other abundant non-coding RNA sequences.

provenance files

These are .txt format files that describe provenance or metadata covering where the annotation and contaminants come from. Ideally they should include data on repositories, genome releases, references, etc. These are in separate files, because .fasta files do not generally accept comments in the header.

For an example, see: fungi/saccharomyces/annotation/Saccharomyces_cerevisiae_yeast_CDS_w_250utrs_annotation_provenance.txt

pull request

When your example dataset is complete, please put in a pull request to the master branch and we will review.

We aim to implement automatic checking using the configuration validation option for nextflow, see issue #riboviz172.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Example datasets to run with RiboViz.

Table of contents

Start here - DRAFT

Contents and structure of example-datasets

What belongs in example-datasets

What does not belong here

Repository structure is loosely phylogenetic

An example of an example dataset from brewer's yeast

Top-level directories are kingdoms, with an artificial one for simulated data

animalia

archaea

bacteria

fungi

plantae

protista

simulated

How to submit an example dataset

config.yaml

annotation files

contaminant files

provenance files

pull request

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
animalia		animalia
archaea		archaea
bacteria		bacteria
fungi		fungi
plantae		plantae
protista		protista
simulated		simulated
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

3mma-mack/example-datasets

Folders and files

Latest commit

History

Repository files navigation

Example datasets to run with RiboViz.

Table of contents

Start here - DRAFT

Contents and structure of example-datasets

What belongs in example-datasets

What does not belong here

Repository structure is loosely phylogenetic

An example of an example dataset from brewer's yeast

Top-level directories are kingdoms, with an artificial one for simulated data

animalia

archaea

bacteria

fungi

plantae

protista

simulated

How to submit an example dataset

config.yaml

annotation files

contaminant files

provenance files

pull request

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages