Skip to content

Commit

Permalink
fix: updated documentation and pipeline description
Browse files Browse the repository at this point in the history
  • Loading branch information
mkatsanto committed Feb 2, 2024
1 parent 2226bf5 commit 5b87ca4
Show file tree
Hide file tree
Showing 5 changed files with 76 additions and 17 deletions.
22 changes: 8 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,23 +269,16 @@ your run.
# Sample downloads from SRA
An independent Snakemake workflow `workflow/rules/sra_download.smk` is included
for the download of SRA samples with [sra-tools].
> Note: as of Snakemake 7.3.1, only profile conda is supported.
> Singularity fails because the *sra-tools* Docker container only has `sh`
but `bash` is required.
> Note: The workflow uses the implicit temporary directory
from snakemake, which is called with [resources.tmpdir].
for the download of SRA samples.
The workflow expects the following config:
* `samples`, a sample table (tsv) with column *sample* containing *SRR* identifiers,
* `samples`, a sample table (tsv) with column *sample* containing *RR* identifiers,
see example [here](tests/input_files/sra_samples.tsv).
* `outdir`, an output directory
* `samples_out`, a pointer to a modified sample table with location of fastq files
* `cluster_log_dir`, the cluster log directory.
For executing the example one can use the following
For executing the example one can use the following conda execution
(with activated *zarp* environment):
```bash
Expand All @@ -297,11 +290,13 @@ snakemake --snakefile="workflow/rules/sra_download.smk" \
log_dir="logs" \
cluster_log_dir="logs/cluster_log"
```
or the singularity one by replacing ```local-conda``` with ```local-singularity```
After successful execution, `results/sra_downloads/sra_samples.out.tsv` should contain:
```tsv
sample fq1 fq2
SRR18552868 results/sra_downloads/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/SRR18549672/SRR18549672_2.fastq.gz
ssample fq1 fq2
SRR18552868 results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz
SRR18549672 results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142 results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz
```
Expand Down Expand Up @@ -355,5 +350,4 @@ After successful execution - if all parameters could be either inferred or were
[slurm]: <https://slurm.schedmd.com/documentation.html>
[zavolan-lab]: <https://www.biozentrum.unibas.ch/research/researchgroups/overview/unit/zavolan/research-group-mihaela-zavolan/>
[pipeline-documentation]: pipeline_documentation.md
[sra-tools]: <https://github.com/ncbi/sra-tools>
[resources.tmpdir]: <https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?#standard-resources>
65 changes: 65 additions & 0 deletions pipeline_documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,16 @@ on installation and usage please see [here](README.md).
- [`map_genome_star`](#map_genome_star)
- [`quantification_salmon`](#quantification_salmon)
- [`genome_quantification_kallisto`](#genome_quantification_kallisto)
- [Description of SRA download workflow steps](#description-of-sra-download-workflow-steps)
- [SRA Sequencing mode-independent ](#sra-sequencing-mode-independent)
- [`get_layout`](#get_layout)
- [`prefetch`](#prefetch)
- [`add_fq_file_path`](#add_fq_file_path)
- [SRA Sequencing mode-specific](#sra-sequencing-mode-specific)
- [`fasterq_dump`](#fasterq_dump)
- [`compress_fastq`](#remove_polya_cutadapt)
- [`process_fastq`](#process_fastq)


## Third-party software used

Expand All @@ -59,13 +69,16 @@ on installation and usage please see [here](README.md).
| **bedtools** | [GPLv2][license-gpl2] | _"[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF"_ | [code][code-bedtools] / [manual][code-bedtools] |
| **cutadapt** | [MIT][license-mit] | _"[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads"_ | [code][code-cutadapt] / [manual][docs-cutadapt] / [publication][pub-cutadapt] |
| **gffread** | [MIT][license-mit] | _"[...] validate, filter, convert and perform various other operations on GFF files"_ | [code][code-gffread] / [manual][docs-gffread] |
| **Entrez Direct** | [custom][license-entrez-direct] | _"[...] an advanced method for accessing the NCBI's set of interconnected databases from a UNIX terminal window"_ | [code][code-entrez-direct] / [manual][docs-entrez-direct] / [publication][pub-entrez-direct] |
| **FastQC** | [GPLv3][license-gpl3] | _"A quality control analysis tool for high throughput sequencing data"_ | [code][code-fastqc] / [manual][docs-fastqc] |
| **ImageMagick** | [custom][license-imagemagick]^ | _"[...] create, edit, compose, or convert bitmap images"_ | [code][code-imagemagick] / [manual][docs-imagemagick] |
| **kallisto** | [BSD-2][license-bsd2] | _"[...] program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads"_ | [code][code-kallisto] / [manual][docs-kallisto] / [publication][pub-kallisto] |
| **MultiQC** | [GPLv3][license-gpl3] | _"Aggregate results from bioinformatics analyses across many samples into a single report"_ | [code][code-multiqc] / [manual][docs-multiqc] / [publication][pub-multiqc] |
| **pigz** | [custom][license-pigz] | _"[...] parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data"_ | [code][code-pigz] / [manual][docs-pigz] |
| **RSeqC** | [GPLv3][license-gpl3] | _"[...] comprehensively evaluate different aspects of RNA-seq experiments, such as sequence quality, GC bias, polymerase chain reaction bias, nucleotide composition bias, sequencing depth, strand specificity, coverage uniformity and read distribution over the genome structure."_ | [code][code-rseqc] / [manual][docs-rseqc] / [publication][pub-rseqc] |
| **Salmon** | [GPLv3][license-gpl3] | _"Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment"_ | [code][code-salmon] / [manual][docs-salmon] / [publication][pub-salmon] |
| **SAMtools** | [MIT][license-mit] | _"[...] suite of programs for interacting with high-throughput sequencing data"_ | [code][code-samtools] / [manual][docs-samtools] / [publication][pub-samtools] |
| **SRA Tools** | [custom][license-sra-tools] | _"[...] collection of tools and libraries for using data in the INSDC Sequence Read Archives"_ | [code][code-sra-tools] / [manual][docs-sra-tools] |
| **STAR** | [MIT][license-mit] | _"**S**pliced **T**ranscripts **A**lignment to a **R**eference"_ - _"RNA-seq aligner"_ | [code][code-star] / [manual][docs-star] / [publication][pub-star] |

^ compatible with [GPLv3][license-gpl3]
Expand Down Expand Up @@ -715,18 +728,63 @@ Generate pseudoalignments of reads to transcripts with
- `--single`: Quantify single-end reads **(single-end only)**
- `--pseudobam`: Save pseudoalignments to transcriptome to BAM file


## Description of SRA download workflow steps

> This separate workflow consists of three Snakemake files: A main `sra_download.smk` and an
> individual Snakemake file for each sequencing mode (single-end and
> paired-end), as parameters for some tools differ between sequencing modes.
> The main `sra_download.smk` contains general steps for downloading the samples
> from the SRA repository and determining the sequencing mode in order to execute
> the appropriate subsequent rules.Individual steps of the workflow are described
> briefly, and links to the respective software manuals are given. Parameters that
> can be modified by the user (via the samples table) are also described. Descriptions
> for steps for which individual "rules" exist for single- and paired-end
> sequencing libraries are combined, and only differences between the modes are
> highlighted.

### SRA Sequencing mode-independent

#### `get_layout`
Get the library type of each sample (paired or single-end) using efetch [**Entrez direct**](#third-party-software-used).
- **Output**
- A file with a name either PAIRED or SINGLE which is used downstream to run the
appropriate subworkflow.

#### `prefetch`
Download the SRA entry using [**SRA Tools**](#third-party-software-used)

#### `add_fq_file_path`
Aggregate the fastq file(s) path(s) in a table for all samples.

### SRA Sequencing mode-specific

#### `fasterq_dump`
Converts SRA entry to fastq file(s) using [**SRA Tools**](#third-party-software-used)

### `compress_fastq`
Compresses fastq file(s) to .gz format. [**pigz**](#third-party-software-used)

### `process_fastq`
Keep the fastq.gz file path in a table, later aggregated in one table in `add_fq_file_path`.


[code-alfa]: <https://github.com/biocompibens/ALFA>
[code-bedgraphtobigwig]: <https://github.com/ucscGenomeBrowser/kent>
[code-bedtools]: <https://github.com/arq5x/bedtools2>
[code-cutadapt]: <https://github.com/marcelm/cutadapt>
[code-gffread]: <https://github.com/gpertea/gffread>
[code-entrez-direct]: <https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/>
[code-fastqc]: <https://github.com/s-andrews/FastQC>
[code-imagemagick]: <https://github.com/ImageMagick/ImageMagick/>
[code-kallisto]: <https://github.com/pachterlab/kallisto>
[code-multiqc]: <https://github.com/ewels/MultiQC>
[code-pigz]: <https://github.com/madler/pigz>
[code-rseqc]: <http://rseqc.sourceforge.net/>
[code-salmon]: <https://github.com/COMBINE-lab/salmon>
[code-samtools]: <https://github.com/samtools/samtools>
[code-sra-tools]: <https://github.com/ncbi/sra-tools>
[code-star]: <https://github.com/alexdobin/STAR>
[custom-script-gtf-to-bed12]: <https://github.com/zavolanlab/zgtf>
[custom-script-tin]: <https://github.com/zavolanlab/tin-score-calculation>
Expand All @@ -738,25 +796,32 @@ Generate pseudoalignments of reads to transcripts with
[docs-cutadapt]: <https://cutadapt.readthedocs.io/en/stable/>
[docs-cutadapt-m]: <https://cutadapt.readthedocs.io/en/stable/guide.html#filtering-reads>
[docs-gffread]: <http://ccb.jhu.edu/software/stringtie/gff.shtml#gffread>
[docs-entrez-direct]: <https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/README>
[docs-fastqc]: <http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/>
[docs-imagemagick]: <https://imagemagick.org/>
[docs-kallisto]: <http://pachterlab.github.io/kallisto/manual.html>
[docs-multiqc]: <https://multiqc.info/docs/>
[docs-pigz]:<https://zlib.net/pigz/pigz.pdf>
[docs-rseqc]: <http://rseqc.sourceforge.net/#usage-information>
[docs-salmon]: <https://salmon.readthedocs.io/en/latest/>
[docs-salmon-selective-alignment]: <https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/>
[docs-samtools]: <http://www.htslib.org/doc/samtools.html>
[docs-snakemake]: <https://snakemake.readthedocs.io/en/stable/>
[docs-snakemake-target-rule]: <https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#step-7-adding-a-target-rule>
[docs-sra-tools]: <https://github.com/ncbi/sra-tools/wiki>
[docs-star]: <https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>
[docs-star-rpm-norm]: <https://ycl6.gitbooks.io/rna-seq-data-analysis/visualization.html>
[license-bsd2]: <https://opensource.org/licenses/BSD-2-Clause>
[license-entrez-direct]: <https://www.ncbi.nlm.nih.gov/books/NBK179288/>
[license-gpl2]: <https://opensource.org/licenses/GPL-2.0>
[license-gpl3]: <https://opensource.org/licenses/GPL-3.0>
[license-imagemagick]: <https://github.com/ImageMagick/ImageMagick/blob/master/LICENSE>
[license-mit]: <https://opensource.org/licenses/MIT>
[license-pigz]: <https://github.com/madler/pigz/blob/master/README>
[license-sra-tools]: <https://github.com/ncbi/sra-tools/blob/master/LICENSE>
[pub-alfa]: <https://doi.org/10.1186/s12864-019-5624-2>
[pub-cutadapt]: <https://doi.org/10.14806/ej.17.1.200>
[pub-entrez-direct]: <https://www.ncbi.nlm.nih.gov/books/NBK179288/>
[pub-kallisto]: <https://doi.org/10.1038/nbt.3519>
[pub-multiqc]: <https://doi.org/10.1093/bioinformatics/btw354>
[pub-rseqc]: <https://doi.org/10.1093/bioinformatics/bts356>
Expand Down
2 changes: 1 addition & 1 deletion workflow/rules/sra_download.smk
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ rule all:


checkpoint get_layout:
"Get the layout of each sample."
"Get the library type of each sample (paired or single-end)."
output:
outdir=directory(os.path.join(config["outdir"], "get_layout", "{sample}")),
params:
Expand Down
2 changes: 1 addition & 1 deletion workflow/rules/sra_paired_end.smk
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ rule compress_fastq_pe:


rule process_fastq_pe:
"Aggregate names of samples"
"Keep the fastq.gz file paths in a table"
input:
file1=os.path.join(
config["outdir"], "compress", "{sample}", "{sample}_1.fastq.gz"
Expand Down
2 changes: 1 addition & 1 deletion workflow/rules/sra_single_end.smk
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ rule compress_fastq_se:


rule process_fastq_se:
"Compress fastq inplace with pigz at best (9) compression level."
"Keep the fastq.gz file path in a table"
input:
file=os.path.join(config["outdir"], "compress", "{sample}", "{sample}.fastq.gz"),
output:
Expand Down

0 comments on commit 5b87ca4

Please sign in to comment.