Skip to content

Commit

Permalink
feat: ci pipeline correction (#90)
Browse files Browse the repository at this point in the history
* fix: removed absolute paths for reference data

* feat: human reference accession number for CI config

* feat: configuration grouping for reference data

* feat: requiring a ref section in the configuration, with only species and accession number

* fix: allowing for additional dots in sample schema to allow for the test data to be worked with

* fix: input directory for test workflow

* fix: typo fna -> fa suffix

* fix: corrected data paths

* fix: changed min_length for CI to 10 and alpha > 1

* fix: more and diffently attributed sample for CI test

* fix: removed outdated commented about p-value correction

* feat: update to qualimap wrapper to v4.4.0

* feat: de analysis does no longer fail, when condition a and b are confused. Also: no volcano for non-significant data

* fix: attempt to run with different input path

* fix: pointing test to workflow profile

* feat: added additional debugging infos for non exsisting files

* feat: added additional debugging infos for non exsisting files - now with format string

* feat: added pwd to debug information

* fix: relative directory in test one level up

* fix: formatting

* fix: removed suffix from sample config

* fix: print existence of input dir

* fix: added directory listing for current dir

* fix: gnarf - wrong namespace ...

* gnarf

* fix: trying with submodule checkout

* fix: fixed typo, bumed checkout versions to 3

* only testing

* again only test

* again full test

* fix: hopefully working sample inference

* fix: added presumably missing submodule

* fix: same correction as before

* fix: same correction as before - now reallywq

* fix: corrected introduced missing return
  • Loading branch information
cmeesters authored Sep 17, 2024
1 parent 04fcfcf commit c41657a
Show file tree
Hide file tree
Showing 10 changed files with 81 additions and 61 deletions.
18 changes: 14 additions & 4 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
Formatting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Formatting
uses: github/super-linter@v4
env:
Expand All @@ -23,7 +23,13 @@ jobs:
Linting:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- name: Checkout repository
uses: actions/checkout@v3
with:
submodules: recursive

- uses: actions/checkout@v3
- name: Lint workflow
uses: snakemake/snakemake-github-action@v1.24.0
with:
Expand All @@ -38,14 +44,18 @@ jobs:
- Linting
- Formatting
steps:
- uses: actions/checkout@v2
- name: Checkout repository
uses: actions/checkout@v3
with:
submodules: recursive
- uses: actions/checkout@v3

- name: Test workflow
uses: snakemake/snakemake-github-action@v1.24.0
with:
directory: .test
snakefile: workflow/Snakefile
args: "--configfile .test/config-simple/config.yml --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache --all-temp --workflow-profile .test/profile/"
args: "--configfile .test/config-simple/config.yml --workflow-profile .test/profile --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache --all-temp"

# - name: Test report
# uses: snakemake/snakemake-github-action@v1.24.0
Expand Down
20 changes: 11 additions & 9 deletions .test/config-simple/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,21 @@ samples: samples.csv
workflow: "workflow-transcriptome-de_phe"

# this is the input directory. All samples are looked for in this directory
inputdir: "."
inputdir: "./ngs-test-data/reads"
# Repository URL:
repo: "https://github.com/snakemake-workflows/transriptome-differential-expression"

## Workflow-specific Parameters:

# NCBI accession number
accession: "GCA_917627325.4"
# Genome fasta (absolute path)
genome: "/lustre/miifs01/project/m2_zdvhpc/transcriptome_data/GCA_917627325.4_PGI_CHIRRI_v4_genomic.fa"
# Annotation GFF/GTF (absolute path)
annotation: "/lustre/miifs01/project/m2_zdvhpc/transcriptome_data/GCA_917627325.4_PGI_CHIRRI_v4_genomic.gff"
# these samples ought to contain all samples comprising of the
ref:
# species
species: "Homo sapiens"

# NCBI accession number of the reference data set.
accession: "GCF_000001405.40"

# Minimum read length, put 0 if you want to proceed with all reads.
min_length: 200
min_length: 10

# Maximum number of CPUs in your partition/PC/server
max_cpus: 40
Expand Down Expand Up @@ -91,6 +90,9 @@ point_width: 20
#
mincount: 10
#
# Type I error cutoff value
alpha: 10
#
# in addition to the full heatmap, plot the top number of different
# values, ranked by the top ratio between the two traits
threshold_plot: 10
Expand Down
6 changes: 4 additions & 2 deletions .test/config-simple/samples.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
sample condition condition2 batch platform purity
../ngs-test-data/reads/a.chr21.1.fq treated condition2 batch1 NANOPORE 1
../ngs-test-data/reads/b.chr21.1.fq untreated condition2 batch1 NANOPORE 1
a.chr21.1 treated condition2 batch1 NANOPORE 1
a.chr21.2 untreated condition2 batch1 NANOPORE 1
b.chr21.1 treated condition2 batch1 NANOPORE 1
b.chr21.2 untreated condition2 batch1 NANOPORE 1
14 changes: 4 additions & 10 deletions config/Mainz-MogonNHR/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,17 @@ repo: "https://github.com/snakemake-workflows/transriptome-differential-expressi

## Workflow-specific Parameters:

# NCBI accession number
accession: "GCA_917627325.4"
# Genome fasta (absolute path)
genome: "/lustre/miifs01/project/nhr-zdvhpc/transcriptome_data/GCA_917627325.4_PGI_CHIRRI_v4_genomic.fa"
# Annotation GFF/GTF (absolute path)
annotation: "/lustre/miifs01/project/nhr-zdvhpc/transcriptome_data/GCA_917627325.4_PGI_CHIRRI_v4_genomic.gff"
# these samples ought to contain all samples comprising of the
ref:
species: "Chironomus riparius"
# NCBI accession number of the reference data set
accession: "GCA_917627325.4"

# Minimum read length, put 0 if you want to proceed with all reads.
min_length: 200

# Maximum number of CPUs in your partition/PC/server
max_cpus: 40


# Minimap2 indexing options
minimap_index_opts: ""

Expand Down Expand Up @@ -95,8 +91,6 @@ mincount: 10
#
# Type I error cutoff value:
alpha: 0.05
# an adjustment for multiple testing will be performed using the Holm-Sidak
# method.
#
# in addition to the full heatmap, plot the top number of different
# values, ranked by the top ratio between the two traits
Expand Down
16 changes: 11 additions & 5 deletions workflow/rules/commons.smk
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import os
from glob import glob
from pathlib import Path
import sys
from itertools import product
Expand Down Expand Up @@ -34,8 +35,11 @@ validate(samples, schema="../schemas/samples.schema.yaml")
def get_mapped_reads_input(sample):
path = Path(os.path.join(config["inputdir"], sample))
for extension in exts:
if os.path.exists(path.with_suffix(extension)):
return path.with_suffix(extension)
# we need to append the extension with +, because
# path.with_suffix might consider everything after a . in
# the file name a suffix!
if os.path.exists(str(path) + extension):
return str(path) + extension

raise WorkflowError(
f"No valid sample found for sample: '{sample}' with possible extension '{exts}'"
Expand All @@ -47,9 +51,11 @@ def aggregate_input(samples):
valids = list()
for sample, ext in product(samples, exts):
path = Path(os.path.join(config["inputdir"], sample))

if os.path.exists(path.with_suffix(ext)):
valids.append(path.with_suffix(ext))
# we need to append the extension with +, because
# path.with_suffix might consider everything after a . in
# the file name a suffix!
if os.path.exists(str(path) + ext):
valids.append(str(path) + ext)

if not len(valids):
raise WorkflowError(f"no valid samples found, allowed extensions are: '{exts}'")
Expand Down
2 changes: 1 addition & 1 deletion workflow/rules/qc.smk
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ rule map_qc:
log:
"logs/qualimap/{sample}.log",
wrapper:
"v3.13.4/bio/qualimap/bamqc"
"v4.4.0/bio/qualimap/bamqc"


rule compress_map_qc:
Expand Down
8 changes: 4 additions & 4 deletions workflow/rules/ref.smk
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ rule get_genome:
# generic name:
temp("ncbi_dataset.zip"),
params:
accession=config["accession"],
accession=config["ref"]["accession"],
log:
"logs/refs/get_genome.log",
conda:
Expand All @@ -24,11 +24,11 @@ rule extract_genome:
input:
rules.get_genome.output,
output:
"references/genomic.fna",
"references/genomic.fa",
group:
"reference"
params:
accession=config["accession"],
accession=config["ref"]["accession"],
log:
"logs/refs/extract_genome.log",
conda:
Expand All @@ -47,7 +47,7 @@ rule extract_annotation:
group:
"reference"
params:
accession=config["accession"],
accession=config["ref"]["accession"],
log:
"logs/refs/get_annotation.log",
conda:
Expand Down
10 changes: 2 additions & 8 deletions workflow/schemas/config.schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,19 +73,13 @@ properties:
ref:
type: object
properties:
n_chromosomes:
type: integer
species:
type: string
release:
type: integer
build:
accession:
type: string
required:
- species
- release
- build
- n_chromosomes
- accession

primers:
type: object
Expand Down
2 changes: 1 addition & 1 deletion workflow/schemas/samples.schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ properties:
sample:
type: string
description: sample name/identifier (alphanumeric string, that may additionally contain '_' and '-')
pattern: "^[a-zA-Z_0-9-]+$"
pattern: "^[a-zA-Z_0-9-.]+$"
group:
type: string
description: group of samples called jointly (alphanumeric string, that may additionally contain '_' and '-')
Expand Down
46 changes: 29 additions & 17 deletions workflow/scripts/de_analysis.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os
import sys
import pickle as pkl
from pathlib import Path

import matplotlib

Expand Down Expand Up @@ -84,8 +85,13 @@
)
else:
summary = stat_res.summary()
# performing LFC shrinkage
stat_res.lfc_shrink(coeff=f"condition_{b_condition}_vs_{a_condition}")
# performing LFC shrinkage - we try both combination, because, we
# have no foreknowledge of which conditions comes first
try:
stat_res.lfc_shrink(coeff=f"condition_{a_condition}_vs_{b_condition}")
except KeyError:
stat_res.lfc_shrink(coeff=f"condition_{b_condition}_vs_{a_condition}")


stat_res.results_df.to_csv(snakemake.output.lfc_analysis)

Expand Down Expand Up @@ -164,18 +170,24 @@
)
plt.savefig(snakemake.output.de_top_heatmap)

visuz.GeneExpression.volcano(
df=stat_res.results_df.fillna(1),
lfc="log2FoldChange",
pv="padj",
lfc_thr=(snakemake.config["lfc_null"], snakemake.config["lfc_null"]),
pv_thr=(snakemake.config["alpha"], snakemake.config["alpha"]),
sign_line=True,
gstyle=2,
show=False,
plotlegend=True,
legendpos="upper right",
legendanchor=(1.46, 1),
figtype="svg",
)
os.rename("volcano.svg", snakemake.output.volcano_plot)
# our test case has no significant values
# in our CI test, we have no significant data, hence:
if snakemake.config["alpha"] < 0.9:
visuz.GeneExpression.volcano(
df=stat_res.results_df.fillna(1),
lfc="log2FoldChange",
pv="padj",
lfc_thr=(snakemake.config["lfc_null"], snakemake.config["lfc_null"]),
pv_thr=(snakemake.config["alpha"], snakemake.config["alpha"]),
sign_line=True,
gstyle=2,
show=False,
plotlegend=True,
legendpos="upper right",
legendanchor=(1.46, 1),
figtype="svg",
)
os.rename("volcano.svg", snakemake.output.volcano_plot)
else:
Path(snakemake.output.volcano_plot).touch()

0 comments on commit c41657a

Please sign in to comment.