Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for translational efficiency analysis with anota2seq #90

Open
FelixKrueger opened this issue Jan 30, 2025 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@FelixKrueger
Copy link
Contributor

Description of feature

Background: We used a public dataset with matched Ribo-seq and RNA-seq and tried to get the anota2seq step to work. [by we I mostly mean my colleague @naiarabediaga]

As the workflow failed via the ribo-seq pipeline, we tried to get it to work by downloading all relevant files and run it locally. As contrast file we used:

id,variable,reference,target,batch,pair
KI_LIF_vs_WT,treatment,WT_LIF,KI_LIF,,pair

Script: anota2seqrun.r

We noticed that several options are missing in the opt list (lines 111-140), and some of them are crucial for the pipeline to run correctly. I assume some of these can be passed via the extra_anota2seq_run_args parameter, but it's currently a little obscure. Maybe important ones could be exposed and/or get mentioned more explicitly?

Here are the missing options Naiara identified, which have prevented the pipeline from running smoothly:

  1. opt$gene_id_col: This option defines the row.names when creating the count.table in line 198. While I assume this option may have been carried over from previous scripts, I don't see it. As a temporary workaround, I manually defined row.names as "gene_id", but it would be better if this option were properly included.
count.table <-
  read_delim_flexible(
    file = "salmon.merged.gene_counts_scaled.tsv",
    header = TRUE,
    row.names = "gene_id",
    check.names = FALSE
  )
  1. opt$samples_pairing_col: While this doesn't seem to be a major issue (because, in its absence, the script uses the order in the sample sheet), it would still be useful to have the option to explicitly specify the column for sample pairing. If this parameter were added to the opt list, the script would be able to take sample pairing into account during processing (lines 300), which would improve the analysis pipeline.

  2. opt$subset_to_contrast_samples: This is a real issue. This variable is set to FALSE by default. If you don't subset both the counts table and the sample sheet to include only the samples involved in the contrast, the pipeline crashes. The subsetting is already there (lines 264-269), but since the variable is set to FALSE the condition will never get executed. Since we can currently only run a single contrast, but will likely have merged salmon matrices, the pipeline will currently crashes if there were additional samples in the run. Exposing this value as a boolean switch (maybe with the default being TRUE) should solve this issue.

  3. opt$exclude_samples_col: This option is meant to remove samples with specified values in a given field (largely complementary to the outcome of 3). The comments say "probably don't use this (4.) as well as the above (3.)),". Exposing this variable would allow excluding samples, e.g. if QC steps indicate failure.

  4. opt$samples_batch_col: This option is set to NULL, and thus seems to be missing from the script. Like opt$samples_pairing_col, it hasn't caused the script to crash, but it would be beneficial to have the ability to define a batch column for batch effect correction or other related purposes. The conditional for this option is mentioned in lines 321-323.

Most of these changes don't appear to be major, but currently prevent the ANOTA2SEQ process to complete successfully with real world data.

Again, many thanks!

@FelixKrueger FelixKrueger added the enhancement New feature or request label Jan 30, 2025
@pinin4fjords
Copy link
Member

pinin4fjords commented Jan 30, 2025

We noticed that several options are missing in the opt list (lines 111-140), and some of them are crucial for the pipeline to run correctly. I assume some of these can be passed via the extra_anota2seq_run_args parameter, but it's currently a little obscure. Maybe important ones could be exposed and/or get mentioned more explicitly?

In general, the module was written with structure copied over from other related modules. Not all options were 'plumbed in' to the workflow during the initial development.

Here are the missing options Naiara identified, which have prevented the pipeline from running smoothly:

  1. opt$gene_id_col: This option defines the row.names when creating the count.table in line 198. While I assume this option may have been carried over from previous scripts, I don't see it. As a temporary workaround, I manually defined row.names as "gene_id", but it would be better if this option were properly included.

count.table <-
read_delim_flexible(
file = "salmon.merged.gene_counts_scaled.tsv",
header = TRUE,
row.names = "gene_id",
check.names = FALSE
)

Maybe file a separate feature request for this to increase workflow flexibility, useful for someone to do in a future release.

  1. opt$samples_pairing_col: While this doesn't seem to be a major issue (because, in its absence, the script uses the order in the sample sheet), it would still be useful to have the option to explicitly specify the column for sample pairing. If this parameter were added to the opt list, the script would be able to take sample pairing into account during processing (lines 300), which would improve the analysis pipeline.

This is (or should be) passed through from the pair column in the contrast file, as per the documentation, please file a separate bug if that's not happening.

  1. opt$subset_to_contrast_samples: This is a real issue. This variable is set to FALSE by default. If you don't subset both the counts table and the sample sheet to include only the samples involved in the contrast, the pipeline crashes. The subsetting is already there (lines 264-269), but since the variable is set to FALSE the condition will never get executed. Since we can currently only run a single contrast, but will likely have merged salmon matrices, the pipeline will currently crashes if there were additional samples in the run. Exposing this value as a boolean switch (maybe with the default being TRUE) should solve this issue.

Since you say the pipeline crashes without this set to true, lets just hard code TRUE from the modules.conf. See #91.

  1. opt$exclude_samples_col: This option is meant to remove samples with specified values in a given field (largely complementary to the outcome of 3). The comments say "probably don't use this (4.) as well as the above (3.)),". Exposing this variable would allow excluding samples, e.g. if QC steps indicate failure.

Maybe file a separate feature request for this, useful for someone to do in a future release.

  1. opt$samples_batch_col: This option is set to NULL, and thus seems to be missing from the script. Like opt$samples_pairing_col, it hasn't caused the script to crash, but it would be beneficial to have the ability to define a batch column for batch effect correction or other related purposes. The conditional for this option is mentioned in lines 321-323.>

As for the pair column, this is (or should be) passed through from the pair column in the contrast file, as per the documentation, please file a separate bug if that's not happening.

Most of these changes don't appear to be major, but currently prevent the ANOTA2SEQ process to complete successfully with real world data.

Again, many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants