feat: local refs #100

yeising · 2024-10-25T10:04:29Z

Enables users to input their own reference data instead of providing an accession number, which will then be used to download reference data.

Summary by CodeRabbit

New Features
- Added genome and annotation file parameters to configuration files for enhanced transcriptome analysis.
- Introduced a new function to validate and retrieve reference file paths.
Improvements
- Enhanced clarity and error handling in workflow rules related to genome and annotation file extraction.
- Updated documentation to clarify usage of local and remote reference files.
- Updated the accession parameter handling to improve user experience.

These changes improve the usability and reliability of the transcriptome analysis workflow.

coderabbitai · 2024-10-25T10:04:37Z

Warning

Rate limit exceeded

@cmeesters has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 8 minutes and 8 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Files that changed from the base of the PR and between 383f475 and faefa48.

Walkthrough

The pull request introduces updates to several configuration and workflow files related to transcriptome analysis. New parameters for genome and annotation file paths are added in two configuration files, enhancing clarity and usability. Additionally, a new function for validating reference files is introduced in the workflow, alongside modifications to existing rules to utilize this function, improving error handling and specificity in file paths.

Changes

File	Change Summary
`.test/config-simple/config.yml`	Added `genome` and `annotation` parameters under `ref` section, specifying paths to genome and annotation files.
`config/Mainz-MogonNHR/config.yml`	Added `genome` and `annotation` parameters under `ref`, specifying paths and allowed file extensions for reference files.
`workflow/rules/commons.smk`	Introduced `get_reference_files` function for validating and retrieving reference file paths from the configuration.
`workflow/rules/ref.smk`	Updated output path in `get_references` rule and modified input sources in `get_genome` and `get_annotation` rules to use the new function for file retrieval.

Possibly related PRs

feat: qm report #95: The changes in .test/config-simple/config.yml regarding the addition of new parameters for genome and annotation files are related to the modifications in the same file in PR feat: qm report #95, which also involves updates to configuration parameters relevant to transcriptome workflows.

Suggested reviewers

yeising

Poem

In the meadow, where data flows,
New paths for genomes and annotations grow.
With files in order, the workflow sings,
Error checks and clarity, oh what joy it brings!
Hop along, let’s analyze with glee,
For every rabbit knows, it’s data harmony! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (6)

workflow/rules/ref.smk (2)

10-10: Consider making the get_genome rule conditional.

While organizing downloads in a "references" directory is good, this rule should only run when using NCBI accessions, not when local references are provided.

Consider using a conditional workflow or making this rule optional when local references are specified in the config. This would prevent unnecessary downloads when local files are used.

Line range hint 1-61: Consider separating local and NCBI reference handling into distinct rules.

The current implementation combines local and NCBI reference handling within the same rules, which could make debugging and maintenance more challenging. Consider restructuring the workflow to have separate rules for each source type:

Create distinct rules for local references:

use_local_genome

use_local_annotation

Create distinct rules for NCBI references:

download_ncbi_genome

download_ncbi_annotation

Use a rule selection mechanism based on config settings

This separation would make the workflow more maintainable and easier to debug.

Would you like me to provide a detailed example of this alternative structure?
.test/config-simple/config.yml (1)
19-22: Fix trailing whitespace and consider documenting supported file formats.

The new parameters are well-structured, but there's a minor formatting issue.

Apply this diff to remove trailing whitespace:
-    annotation: "./ngs-test-data/ref/annotation.chr21.gtf"    
+    annotation: "./ngs-test-data/ref/annotation.chr21.gtf"
Consider expanding the comments to list all supported file formats:
-    # Genome FA or FASTA or FNA
+    # Genome file path (supported formats: .fa, .fasta, .fna)
-    # Annotation GFF or GTF
+    # Annotation file path (supported formats: .gff, .gtf)
🧰 Tools

🪛 yamllint

[error] 22-22: trailing spaces

(trailing-spaces)
config/Mainz-MogonNHR/config.yml (2)

27-28: Remove extra blank line.

There are too many consecutive blank lines (3 > 2).

Remove one blank line to comply with YAML formatting guidelines.

🧰 Tools

🪛 yamllint

[warning] 28-28: too many blank lines

(3 > 2) (empty-lines)

17-23: Consider adding path validation in the workflow.

To ensure robustness, consider adding validation for the reference file paths:

Verify file existence

Validate file extensions match the allowed patterns

Check file format integrity (e.g., valid FASTA/GFF format)

Would you like me to provide an example implementation of the validation logic?

🧰 Tools

🪛 yamllint

[error] 23-23: trailing spaces

(trailing-spaces)
workflow/rules/commons.smk (1)
63-68: Enhance error messages with specific requirements.

The error messages could be more helpful by including the supported file formats and providing guidance on how to fix the issue.

Consider updating the error messages:
     if genome or annotation:
         raise ValueError(
-            "Only one reference file provided, provide either both genome and annotation or an NCBI accession number."
+            f"Only one reference file provided. You must provide either:\n"
+            f"1. Both genome (.fa, .fna, .fasta) and annotation (.gtf, .gff) files, or\n"
+            f"2. An NCBI accession number with optional local files"
         )
-    raise ValueError("No valid reference files or accession number provided.")
+    raise ValueError(
+        "No valid reference files or accession number provided. Please check:\n"
+        "1. Genome file exists and has extension: .fa, .fna, or .fasta\n"
+        "2. Annotation file exists and has extension: .gtf or .gff\n"
+        "3. NCBI accession number is provided in the config"
+    )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e9f1495 and dea4c71.

📒 Files selected for processing (4)

.test/config-simple/config.yml (1 hunks)
config/Mainz-MogonNHR/config.yml (1 hunks)
workflow/rules/commons.smk (1 hunks)
workflow/rules/ref.smk (4 hunks)

🧰 Additional context used

📓 Learnings (1)

config/Mainz-MogonNHR/config.yml (1)

Learnt from: cmeesters
PR: snakemake-workflows/transcriptome-differential-expression#96
File: config/Mainz-MogonNHR/config.yml:69-69
Timestamp: 2024-10-11T15:29:48.149Z
Learning: Changes to DESeq2 parameters in `config/Mainz-MogonNHR/config.yml` might be addressed and documented in separate commits or PRs.

🪛 yamllint

.test/config-simple/config.yml

[error] 22-22: trailing spaces

(trailing-spaces)

config/Mainz-MogonNHR/config.yml

[error] 23-23: trailing spaces

(trailing-spaces)

[warning] 28-28: too many blank lines

(3 > 2) (empty-lines)

🪛 Gitleaks

config/Mainz-MogonNHR/config.yml

25-25: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🔇 Additional comments (4)

workflow/rules/ref.smk (1)
25-27: 🛠️ Refactor suggestion

Enhance input validation and error handling.

While the changes support both local and downloaded files, there are several areas for improvement:

Add validation for the input genome file format (should be FASTA)

Improve error handling in the shell command

Consider moving the default path to config

Here's a suggested improvement for the shell command:
-[ -f "{input}" ] && mv {input} {output} || unzip -p {input} ncbi_dataset/data/{params.accession}/*.fna > {output} 2>> {log}
+if [ -f "{input}" ]; then
+    # Validate FASTA format
+    if head -n 1 "{input}" | grep -q '^>'; then
+        mv "{input}" "{output}"
+    else
+        echo "Error: Input file is not in FASTA format" >&2
+        exit 1
+    fi
+else
+    if ! unzip -p "{input}" ncbi_dataset/data/{params.accession}/*.fna > "{output}" 2>> "{log}"; then
+        echo "Error: Failed to extract genome from zip file" >&2
+        exit 1
+    fi
+fi
Also applies to: 40-40
.test/config-simple/config.yml (1)

19-22: Clarify precedence between local files and accession number.

The configuration now supports both local reference files and NCBI accession numbers, but it's unclear which takes precedence if both are provided. This could lead to confusion or unexpected behavior.

Let's check if this is handled in the workflow rules:

Consider:

Making these parameters mutually exclusive in the schema

Adding validation to ensure only one method is used

Documenting the precedence rules in the configuration comments

🧰 Tools

🪛 yamllint

[error] 22-22: trailing spaces

(trailing-spaces)

config/Mainz-MogonNHR/config.yml (1)

17-17: LGTM! Clear documentation of reference data options.

The comment effectively communicates the two available options for providing reference data.
workflow/rules/commons.smk (1)
53-61: 🛠️ Refactor suggestion

Verify accession number format when only one reference file is provided.

When operating with a mix of local files and accession numbers, it would be beneficial to validate the accession number format to catch potential issues early.

Let's check if there are any existing accession number validations in the codebase:

Consider adding accession validation:
     if accession:
+        # Validate accession number format (e.g., GCF_000001405.39)
+        if not re.match(r'^GCF_\d{9}\.\d+$', accession):
+            raise ValueError(f"Invalid accession number format: {accession}")
         if genome:
             return {"genome": genome}
         if annotation:
             return {"annotation": annotation}
         return {}

workflow/rules/ref.smk

.test/config-simple/config.yml

config/Mainz-MogonNHR/config.yml

workflow/rules/commons.smk

cmeesters

ah, sorry, the review must be finished to be visible!

config/Mainz-MogonNHR/config.yml

workflow/rules/commons.smk

workflow/rules/ref.smk

config/Mainz-MogonNHR/config.yml

.test/config-simple/config.yml

config/Mainz-MogonNHR/config.yml

… clarity

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (4)

workflow/rules/ref.smk (2)
Line range hint 1-4: Update localrules declaration and ensure consistency.

The localrules declaration contains outdated rule names:

extract_annotation → get_annotation

extract_genome → get_genome

Also, consider adding get_references to the localrules if it doesn't require computational resources.

Apply this fix:
localrules:
    get_genome,
-    extract_annotation,
-    extract_genome,
+    get_annotation,
+    get_references,
Missing directory creation for references path

The verification confirms that:

No explicit directory creation for "references/" exists in the workflow

While Snakemake's directory() function is used in other rules (qc.smk), it's not used for the references directory

The references directory is used as input/output in multiple rules (datamod.smk)

To prevent potential failures, add either:

A directory("references") output in an appropriate rule

Or use shell: "mkdir -p references" before the dataset download

🔗 Analysis chain

Line range hint 7-19: Consider directory creation and local reference handling.

While the rule handles NCBI dataset downloads, two concerns need attention:

The output directory "references/" might not exist, potentially causing failures

This rule seems unnecessary when using local references (as per PR objectives)

Let's verify if the directory is created elsewhere:

Consider:

Adding directory creation if not handled elsewhere

Making this rule conditional based on whether local references are used

The previous search might have been too specific. Let's broaden our search to understand:

If there are any directory creation commands

The main workflow files that might handle this

Other mentions of the references directory

Any shell scripts that might handle setup

This will help us determine if the directory creation is handled elsewhere or if it's indeed missing.

Let's examine if the references directory is handled in datamod.smk or if there are any Snakemake directory creation patterns. This will help us confirm if the directory creation is truly missing or if it's handled through Snakemake's built-in directory function or other rules.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for directory creation
rg -l 'mkdir.*references'
Length of output: 27

Script:
#!/bin/bash
# Let's try different approaches to find directory creation patterns

# Check for any mkdir commands in the workflow
rg -l 'mkdir'

# Look for directory creation in Snakefile or other workflow files
fd -e smk -e snakefile -i snakefile

# Check if references directory is mentioned in other rules
rg -l 'references/'

# Look for directory creation in shell commands or scripts
fd -e sh
Length of output: 132

Script:
#!/bin/bash
# Let's check the content of datamod.smk to see if it handles the references directory
cat workflow/rules/datamod.smk

# Also check if there are any directory creation patterns in Snakemake rules
rg -A 5 'directory\(' 

# Look for any shell commands that might create directories
rg -A 3 'shell:.*mkdir'
Length of output: 1221
config/Mainz-MogonNHR/config.yml (2)
16-19: Enhance documentation clarity for reference file requirements.

While the documentation explains the basic options, it would be helpful to clarify:

Whether both genome and annotation files are required when providing local files

The exact conditions when an accession number is required (e.g., "if either genome OR annotation is missing")
 # The reference genome respectively transcriptome, is defined here:
 # You can provide either 'genome' or 'annotation' reference files locally.
-# If both 'genome' and 'annotation' are available no 'accession' number is needed. 
-# In any other case an 'accession' number is needed to download missing reference data remotely.
+# If both 'genome' and 'annotation' are available locally, no 'accession' number is needed.
+# An 'accession' number is required if either the genome or annotation file is missing,
+# as it will be used to download the missing reference data remotely.
🧰 Tools

🪛 yamllint

[error] 18-18: trailing spaces

(trailing-spaces)

31-32: Enhance accession number documentation.

The comment could be more specific about the accession number format and provide an example.
-    # Remote reference data
-    # NCBI accession number of the reference data set; can be left empty if both reference files are available locally"
+    # Remote reference data
+    # NCBI GenBank assembly accession (e.g., GCA_XXXXXXXXX.X) for downloading missing reference files.
+    # Can be left empty if both genome and annotation files are provided locally.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between dea4c71 and 3d0abdf.

📒 Files selected for processing (2)

config/Mainz-MogonNHR/config.yml (1 hunks)
workflow/rules/ref.smk (3 hunks)

🧰 Additional context used

📓 Learnings (1)

config/Mainz-MogonNHR/config.yml (1)

Learnt from: cmeesters
PR: snakemake-workflows/transcriptome-differential-expression#96
File: config/Mainz-MogonNHR/config.yml:69-69
Timestamp: 2024-10-11T15:29:48.149Z
Learning: Changes to DESeq2 parameters in `config/Mainz-MogonNHR/config.yml` might be addressed and documented in separate commits or PRs.

🪛 Gitleaks

config/Mainz-MogonNHR/config.yml

33-33: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🪛 yamllint

config/Mainz-MogonNHR/config.yml

[error] 18-18: trailing spaces

(trailing-spaces)

[error] 22-22: trailing spaces

(trailing-spaces)

[error] 27-27: trailing spaces

(trailing-spaces)

[error] 30-30: trailing spaces

(trailing-spaces)

[warning] 36-36: too many blank lines

(3 > 2) (empty-lines)

🔇 Additional comments (2)

workflow/rules/ref.smk (1)

61-63: ⚠️ Potential issue

Previous review comments still apply; additional concerns noted.

The previous review comment about improving shell command reliability is still valid. Additionally:

The trailing semicolon is unnecessary

Error messages should be more descriptive

The command structure is inconsistent with the get_genome rule

Please address the previous review comments and maintain consistency with the get_genome rule implementation.
config/Mainz-MogonNHR/config.yml (1)
23-29: ⚠️ Potential issue

Improve configuration portability and documentation.

The absolute paths make the configuration less portable across different environments.

The comments about empty fields could be more specific.

There are trailing spaces that should be removed.

Apply these changes:
-    # Local reference data
-    # Genome file path; can be left empty if remote download is prefered.
-    # allowed extensions: .FA or .FNA or .FASTA
-    genome: "/lustre/project/nhr-zdvhpc/transcriptome_data/GCA_917627325.4_PGI_CHIRRI_v4_genomic.fa"
-    # Annotation file path; can be left empty if remote download is prefered. 
-    # allowed extensions: .GFF or .GTF
-    annotation: "/lustre/project/nhr-zdvhpc/transcriptome_data/GCA_917627325.4_PGI_CHIRRI_v4_genomic.gff"
+    # Local reference data
+    # Genome file path (can be empty if using accession for remote download)
+    # Allowed extensions: .FA, .FNA, or .FASTA
+    genome: "${REFERENCE_DATA}/GCA_917627325.4_PGI_CHIRRI_v4_genomic.fa"
+    # Annotation file path (can be empty if using accession for remote download)
+    # Allowed extensions: .GFF or .GTF
+    annotation: "${REFERENCE_DATA}/GCA_917627325.4_PGI_CHIRRI_v4_genomic.gff"
This addresses the same portability concerns raised in previous reviews.

🧰 Tools

🪛 yamllint

[error] 27-27: trailing spaces

(trailing-spaces)

workflow/rules/ref.smk

config/Mainz-MogonNHR/config.yml

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

config/Mainz-MogonNHR/config.yml (2)
16-19: Enhance documentation clarity.

While the documentation clearly explains the relationship between local files and accession numbers, consider adding these clarifications:

Specify that both genome and annotation files must be provided together when using local files

Mention the supported file formats (.fa/.fna/.fasta for genome, .gff/.gtf for annotation) in this section
 # The reference genome respectively transcriptome, is defined here:
 # You can provide either 'genome' or 'annotation' reference files locally.
-# If both 'genome' and 'annotation' are available no 'accession' number is needed.
-# In any other case an 'accession' number is needed to download missing reference data remotely.
+# If both 'genome' (*.fa/*.fna/*.fasta) and 'annotation' (*.gff/*.gtf) files are provided locally,
+# no 'accession' number is needed. In any other case, an 'accession' number is required to
+# download missing reference data remotely.
31-33: Improve documentation for remote reference configuration.

The comment could be more specific about when the accession number is required.
 # Remote reference data
-# NCBI accession number of the reference data set; can be left empty if both reference files are available locally"
+# NCBI accession number of the reference data set (e.g., GCA_917627325.4).
+# Required only when either genome or annotation files are not provided locally.
+# Must be a valid NCBI GenBank/RefSeq assembly accession.
Note: The Gitleaks warning about the accession number being a potential API key can be safely ignored as this is a valid NCBI accession number format.

🧰 Tools

🪛 Gitleaks

33-33: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 3d0abdf and 6d223d7.

📒 Files selected for processing (1)

config/Mainz-MogonNHR/config.yml (1 hunks)

🧰 Additional context used

📓 Learnings (1)

config/Mainz-MogonNHR/config.yml (1)

Learnt from: cmeesters
PR: snakemake-workflows/transcriptome-differential-expression#96
File: config/Mainz-MogonNHR/config.yml:69-69
Timestamp: 2024-10-11T15:29:48.149Z
Learning: Changes to DESeq2 parameters in `config/Mainz-MogonNHR/config.yml` might be addressed and documented in separate commits or PRs.

🪛 Gitleaks

config/Mainz-MogonNHR/config.yml

33-33: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

config/Mainz-MogonNHR/config.yml

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)

workflow/rules/commons.smk (1)

36-38: Enhance function documentation for better maintainability.

The docstring should be expanded to include:

Parameter description and type
Return value structure
Possible exceptions and their conditions

 def get_reference_files(config):
     """
-    Get reference files from config file and validate them.
+    Validate and retrieve reference file paths from configuration.
+
+    Args:
+        config (dict): Configuration dictionary containing reference information
+            with optional keys: 'ref.genome', 'ref.annotation', 'ref.accession'
+
+    Returns:
+        dict: Dictionary containing validated paths with possible keys:
+            - {'genome': str, 'annotation': str} when both files are valid
+            - {'genome': str} or {'annotation': str} when one file is valid with accession
+            - {} when only accession is provided
+
+    Raises:
+        ValueError: If only one reference file is provided without accession,
+                   or if no valid reference files or accession are provided
     """

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 6d223d7 and 7e9b794.

📒 Files selected for processing (1)

workflow/rules/commons.smk (1 hunks)

🔇 Additional comments (1)

workflow/rules/commons.smk (1)
56-73: Verify the reference file handling logic.

The error handling and return logic look good, with clear messages and proper handling of different input combinations. Let's verify that the implementation aligns with the PR objectives.

✅ Verification successful

Reference file handling logic is correctly implemented and used consistently

The verification confirms that:

The get_reference_files function is properly integrated with the workflow rules in ref.smk

The function's return values are correctly used in both get_genome and get_annotation rules

The shell commands handle both local files and NCBI dataset cases appropriately

Error handling aligns with the workflow's requirements, throwing descriptive errors for invalid combinations

The implementation correctly supports:

Using local genome and annotation files when both are provided

Using NCBI accession when local files aren't available

Proper validation of file extensions and existence

Clear error messages for invalid combinations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that the reference file handling is used consistently across the workflow

# Test 1: Check for usage of get_reference_files
echo "Checking for get_reference_files usage:"
rg "get_reference_files" -l

# Test 2: Check for related rule definitions
echo -e "\nChecking for related rules:"
rg "rule (get_genome|get_annotation)" -A 5

# Test 3: Check for config schema validation
echo -e "\nChecking config schema for reference file definitions:"
cat ../schemas/config.schema.yaml | rg -A 10 "ref:"
Length of output: 1132

Script:
#!/bin/bash
# Let's verify the reference handling implementation more thoroughly

# Check the full implementation of get_reference_files function
echo "Checking get_reference_files implementation:"
rg "def get_reference_files" -A 20 workflow/rules/commons.smk

# Check how the function is used in ref.smk
echo -e "\nChecking ref.smk implementation:"
cat workflow/rules/ref.smk

# Check for any config validation or schema in the repository
echo -e "\nChecking for config schema files:"
fd "config.*\.ya?ml"

# Check for test files related to reference handling
echo -e "\nChecking for related test files:"
fd "test.*" | rg "ref|reference"
Length of output: 3199

workflow/rules/commons.smk

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

.test/config-simple/config.yml (2)
19-20: LGTM! Consider enhancing the file extension documentation.

The comment about file extensions is helpful, but could be more explicit about case sensitivity and whether compressed formats are supported.

Consider updating the comment to:
-    # Genome FA or FASTA or FNA
+    # Reference genome file (supported extensions: .fa, .fasta, .fna, case-insensitive)
24-25: LGTM! Consider clarifying the accession usage.

The comment explains why the accession is empty in the test configuration. However, it would be helpful to document the relationship between accession and local files.

Consider updating the comments to:
-    # empty, because every real reference is too big.
-    accession: ""
+    # NCBI accession number (leave empty when using local genome/annotation files)
+    # Empty in test config as real reference datasets are too large
+    accession: ""

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 7e9b794 and 5b7000d.

📒 Files selected for processing (1)

.test/config-simple/config.yml (1 hunks)

🧰 Additional context used

🪛 yamllint

.test/config-simple/config.yml

[error] 22-22: trailing spaces

(trailing-spaces)

.test/config-simple/config.yml

…nnotation missing

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)

.test/config-simple/config.yml (2)

19-20: Enhance file extension documentation for genome configuration.

The comment should explicitly list all supported extensions with their case sensitivity.

-    # Genome FA or FASTA or FNA
+    # Genome file (supported extensions: .fa, .fasta, .fna, case-insensitive)

24-25: Clarify the relationship between local files and accession.

The comment could better explain that the accession is optional when local genome and annotation files are provided.

-    # empty, because every real reference is too big.
+    # Optional: NCBI accession number. Not required when local genome and annotation files are provided.
+    # Set to None in test config as real reference data would be too large.

workflow/rules/commons.smk (1)

35-38: Enhance function documentation with detailed docstring.

The docstring should include parameters, return values, and exceptions for better maintainability.

 def get_reference_files(config):
     """
-    Get reference files from config file and validate them.
+    Validate and retrieve reference file paths from configuration.
+    
+    Args:
+        config (dict): Configuration dictionary containing reference paths and accession
+            
+    Returns:
+        dict: Dictionary containing validated paths for 'genome' and/or 'annotation'
+            
+    Raises:
+        ValueError: If only one reference file is provided or if no valid references exist
+        KeyError: If accessing non-existent config keys
     """

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 5b7000d and 383f475.

📒 Files selected for processing (2)

.test/config-simple/config.yml (1 hunks)
workflow/rules/commons.smk (1 hunks)

🧰 Additional context used

🪛 yamllint

.test/config-simple/config.yml

[error] 22-22: trailing spaces

(trailing-spaces)

🔇 Additional comments (2)

workflow/rules/commons.smk (2)

39-54: Past review comment is still applicable.

The validation logic issues highlighted in the previous review remain unaddressed.

59-65: Verify the partial return behavior with accession number.

The current logic allows returning partial reference data when an accession number is present. This might lead to unexpected behavior if the workflow expects both genome and annotation files.

✅ Verification successful

Let me gather more context about the workflow configuration and usage patterns.

Let me check one more aspect to understand the workflow's behavior with the reference files.

Based on the gathered context, I can now provide the final response:

The partial return behavior is intentional and correct.

The current implementation in get_reference_files is correct because:

The workflow supports both local and remote reference files, as evidenced by the config schema and usage patterns.
When an accession number is present, the workflow downloads missing files through the get_genome and get_annotation rules, which handle both local files and NCBI dataset downloads.
The partial returns are properly handled by the dependent rules, which use .get() with a fallback to "references/ncbi_dataset.zip" when a specific reference file is not provided.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of get_reference_files to verify how partial returns are handled
rg -A 5 "get_reference_files"

Length of output: 1019

Script:

#!/bin/bash
# Check workflow configuration patterns and defaults
rg -A 10 "config\[.ref" 

# Look for any validation or error handling related to reference files
rg -B 3 -A 7 "raise.*reference"

Length of output: 2667

Script:

#!/bin/bash
# Check the rule dependencies and workflow structure
rg -B 2 -A 5 "rule.*genome|rule.*annotation"

# Look for any config schema or validation
fd -e yaml -e yml | xargs cat

Length of output: 23285

workflow/rules/commons.smk

cmeesters

awesome, finally working

yeising added 5 commits October 25, 2024 11:16

feat: add flexibility for reference data

d187b5b

fix: added new flexibility to test data

4d308a8

doc: clarified config and helper function

81ba5aa

style: snakefmt

543c513

feat: accepts one reference file if accession number is provided

da79d7b

yeising added 3 commits October 25, 2024 12:05

fix: snakefmt

8cb7244

style: snakefmt

db78897

style: removed redundancy

dea4c71

yeising marked this pull request as ready for review October 25, 2024 10:35

yeising requested a review from cmeesters October 25, 2024 10:35

coderabbitai bot reviewed Oct 25, 2024

View reviewed changes

cmeesters reviewed Oct 25, 2024

View reviewed changes

yeising added 4 commits October 25, 2024 13:45

fix: amde shell cmd more robust, provided comments, renamed rules for…

1e9ed03

… clarity

fix: provided logs for first case

c3ad5b7

doc: clarified references

3d0abdf

fix: allowed suffixes

c3afbf5

coderabbitai bot reviewed Oct 25, 2024

View reviewed changes

workflow/rules/ref.smk Show resolved Hide resolved

config/Mainz-MogonNHR/config.yml Outdated Show resolved Hide resolved

style: formatting

6d223d7

coderabbitai bot reviewed Oct 25, 2024

View reviewed changes

config/Mainz-MogonNHR/config.yml Outdated Show resolved Hide resolved

fix: clearar more stable reference validation

7e9b794

coderabbitai bot reviewed Oct 28, 2024

View reviewed changes

workflow/rules/commons.smk Outdated Show resolved Hide resolved

fix: trying to avoid downloading references during the ci run

5b7000d

coderabbitai bot reviewed Oct 28, 2024

View reviewed changes

.test/config-simple/config.yml Show resolved Hide resolved

cmeesters added 2 commits October 29, 2024 10:31

feat: better differentiation within the error message for genome or a…

fa0dc99

…nnotation missing

feat: trying explicit none for accession number

383f475

coderabbitai bot reviewed Oct 29, 2024

View reviewed changes

workflow/rules/commons.smk Outdated Show resolved Hide resolved

cmeesters added 3 commits October 29, 2024 10:48

feat: clearer annotation

0e81aea

fix: refactoring raising value error - should work, now

3205b07

fix: format string for error message

d48deb8

fix: doubled code might have produced the error

faefa48

cmeesters approved these changes Oct 29, 2024

View reviewed changes

cmeesters merged commit 5fed515 into main Oct 29, 2024
6 checks passed

cmeesters deleted the feat-local-refs branch October 29, 2024 10:04

github-actions bot mentioned this pull request Oct 29, 2024

chore(main): release 1.5.0 #101

Closed

coderabbitai bot mentioned this pull request Oct 29, 2024

feat!: isoform analysis #99

Open

github-actions bot mentioned this pull request Nov 11, 2024

chore(main): release 1.5.0 #103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: local refs #100

feat: local refs #100

yeising commented Oct 25, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 25, 2024 •

edited

Loading

Rate limit exceeded

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

cmeesters left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

cmeesters left a comment

feat: local refs #100

feat: local refs #100

Conversation

yeising commented Oct 25, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Oct 25, 2024 • edited Loading

Rate limit exceeded

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

cmeesters left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

cmeesters left a comment

Choose a reason for hiding this comment

yeising commented Oct 25, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 25, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)