Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: local refs #100

Merged
merged 21 commits into from
Oct 29, 2024
Merged

feat: local refs #100

merged 21 commits into from
Oct 29, 2024

Conversation

yeising
Copy link
Collaborator

@yeising yeising commented Oct 25, 2024

Enables users to input their own reference data instead of providing an accession number, which will then be used to download reference data.

Summary by CodeRabbit

  • New Features

    • Added genome and annotation file parameters to configuration files for enhanced transcriptome analysis.
    • Introduced a new function to validate and retrieve reference file paths.
  • Improvements

    • Enhanced clarity and error handling in workflow rules related to genome and annotation file extraction.
    • Updated documentation to clarify usage of local and remote reference files.
    • Updated the accession parameter handling to improve user experience.

These changes improve the usability and reliability of the transcriptome analysis workflow.

Copy link
Contributor

coderabbitai bot commented Oct 25, 2024

Warning

Rate limit exceeded

@cmeesters has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 8 minutes and 8 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Files that changed from the base of the PR and between 383f475 and faefa48.

Walkthrough

The pull request introduces updates to several configuration and workflow files related to transcriptome analysis. New parameters for genome and annotation file paths are added in two configuration files, enhancing clarity and usability. Additionally, a new function for validating reference files is introduced in the workflow, alongside modifications to existing rules to utilize this function, improving error handling and specificity in file paths.

Changes

File Change Summary
.test/config-simple/config.yml Added genome and annotation parameters under ref section, specifying paths to genome and annotation files.
config/Mainz-MogonNHR/config.yml Added genome and annotation parameters under ref, specifying paths and allowed file extensions for reference files.
workflow/rules/commons.smk Introduced get_reference_files function for validating and retrieving reference file paths from the configuration.
workflow/rules/ref.smk Updated output path in get_references rule and modified input sources in get_genome and get_annotation rules to use the new function for file retrieval.

Possibly related PRs

  • feat: qm report #95: The changes in .test/config-simple/config.yml regarding the addition of new parameters for genome and annotation files are related to the modifications in the same file in PR feat: qm report #95, which also involves updates to configuration parameters relevant to transcriptome workflows.

Suggested reviewers

  • yeising

Poem

In the meadow, where data flows,
New paths for genomes and annotations grow.
With files in order, the workflow sings,
Error checks and clarity, oh what joy it brings!
Hop along, let’s analyze with glee,
For every rabbit knows, it’s data harmony! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@yeising yeising marked this pull request as ready for review October 25, 2024 10:35
@yeising yeising requested a review from cmeesters October 25, 2024 10:35
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (6)
workflow/rules/ref.smk (2)

10-10: Consider making the get_genome rule conditional.

While organizing downloads in a "references" directory is good, this rule should only run when using NCBI accessions, not when local references are provided.

Consider using a conditional workflow or making this rule optional when local references are specified in the config. This would prevent unnecessary downloads when local files are used.


Line range hint 1-61: Consider separating local and NCBI reference handling into distinct rules.

The current implementation combines local and NCBI reference handling within the same rules, which could make debugging and maintenance more challenging. Consider restructuring the workflow to have separate rules for each source type:

  1. Create distinct rules for local references:

    • use_local_genome
    • use_local_annotation
  2. Create distinct rules for NCBI references:

    • download_ncbi_genome
    • download_ncbi_annotation
  3. Use a rule selection mechanism based on config settings

This separation would make the workflow more maintainable and easier to debug.

Would you like me to provide a detailed example of this alternative structure?

.test/config-simple/config.yml (1)

19-22: Fix trailing whitespace and consider documenting supported file formats.

The new parameters are well-structured, but there's a minor formatting issue.

Apply this diff to remove trailing whitespace:

-    annotation: "./ngs-test-data/ref/annotation.chr21.gtf"    
+    annotation: "./ngs-test-data/ref/annotation.chr21.gtf"

Consider expanding the comments to list all supported file formats:

-    # Genome FA or FASTA or FNA
+    # Genome file path (supported formats: .fa, .fasta, .fna)
-    # Annotation GFF or GTF
+    # Annotation file path (supported formats: .gff, .gtf)
🧰 Tools
🪛 yamllint

[error] 22-22: trailing spaces

(trailing-spaces)

config/Mainz-MogonNHR/config.yml (2)

27-28: Remove extra blank line.

There are too many consecutive blank lines (3 > 2).

Remove one blank line to comply with YAML formatting guidelines.

🧰 Tools
🪛 yamllint

[warning] 28-28: too many blank lines

(3 > 2) (empty-lines)


17-23: Consider adding path validation in the workflow.

To ensure robustness, consider adding validation for the reference file paths:

  1. Verify file existence
  2. Validate file extensions match the allowed patterns
  3. Check file format integrity (e.g., valid FASTA/GFF format)

Would you like me to provide an example implementation of the validation logic?

🧰 Tools
🪛 yamllint

[error] 23-23: trailing spaces

(trailing-spaces)

workflow/rules/commons.smk (1)

63-68: Enhance error messages with specific requirements.

The error messages could be more helpful by including the supported file formats and providing guidance on how to fix the issue.

Consider updating the error messages:

     if genome or annotation:
         raise ValueError(
-            "Only one reference file provided, provide either both genome and annotation or an NCBI accession number."
+            f"Only one reference file provided. You must provide either:\n"
+            f"1. Both genome (.fa, .fna, .fasta) and annotation (.gtf, .gff) files, or\n"
+            f"2. An NCBI accession number with optional local files"
         )
-    raise ValueError("No valid reference files or accession number provided.")
+    raise ValueError(
+        "No valid reference files or accession number provided. Please check:\n"
+        "1. Genome file exists and has extension: .fa, .fna, or .fasta\n"
+        "2. Annotation file exists and has extension: .gtf or .gff\n"
+        "3. NCBI accession number is provided in the config"
+    )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e9f1495 and dea4c71.

📒 Files selected for processing (4)
  • .test/config-simple/config.yml (1 hunks)
  • config/Mainz-MogonNHR/config.yml (1 hunks)
  • workflow/rules/commons.smk (1 hunks)
  • workflow/rules/ref.smk (4 hunks)
🧰 Additional context used
📓 Learnings (1)
config/Mainz-MogonNHR/config.yml (1)
Learnt from: cmeesters
PR: snakemake-workflows/transcriptome-differential-expression#96
File: config/Mainz-MogonNHR/config.yml:69-69
Timestamp: 2024-10-11T15:29:48.149Z
Learning: Changes to DESeq2 parameters in `config/Mainz-MogonNHR/config.yml` might be addressed and documented in separate commits or PRs.
🪛 yamllint
.test/config-simple/config.yml

[error] 22-22: trailing spaces

(trailing-spaces)

config/Mainz-MogonNHR/config.yml

[error] 23-23: trailing spaces

(trailing-spaces)


[warning] 28-28: too many blank lines

(3 > 2) (empty-lines)

🪛 Gitleaks
config/Mainz-MogonNHR/config.yml

25-25: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🔇 Additional comments (4)
workflow/rules/ref.smk (1)

25-27: 🛠️ Refactor suggestion

Enhance input validation and error handling.

While the changes support both local and downloaded files, there are several areas for improvement:

  1. Add validation for the input genome file format (should be FASTA)
  2. Improve error handling in the shell command
  3. Consider moving the default path to config

Here's a suggested improvement for the shell command:

-[ -f "{input}" ] && mv {input} {output} || unzip -p {input} ncbi_dataset/data/{params.accession}/*.fna > {output} 2>> {log}
+if [ -f "{input}" ]; then
+    # Validate FASTA format
+    if head -n 1 "{input}" | grep -q '^>'; then
+        mv "{input}" "{output}"
+    else
+        echo "Error: Input file is not in FASTA format" >&2
+        exit 1
+    fi
+else
+    if ! unzip -p "{input}" ncbi_dataset/data/{params.accession}/*.fna > "{output}" 2>> "{log}"; then
+        echo "Error: Failed to extract genome from zip file" >&2
+        exit 1
+    fi
+fi

Also applies to: 40-40

.test/config-simple/config.yml (1)

19-22: Clarify precedence between local files and accession number.

The configuration now supports both local reference files and NCBI accession numbers, but it's unclear which takes precedence if both are provided. This could lead to confusion or unexpected behavior.

Let's check if this is handled in the workflow rules:

Consider:

  1. Making these parameters mutually exclusive in the schema
  2. Adding validation to ensure only one method is used
  3. Documenting the precedence rules in the configuration comments
🧰 Tools
🪛 yamllint

[error] 22-22: trailing spaces

(trailing-spaces)

config/Mainz-MogonNHR/config.yml (1)

17-17: LGTM! Clear documentation of reference data options.

The comment effectively communicates the two available options for providing reference data.

workflow/rules/commons.smk (1)

53-61: 🛠️ Refactor suggestion

Verify accession number format when only one reference file is provided.

When operating with a mix of local files and accession numbers, it would be beneficial to validate the accession number format to catch potential issues early.

Let's check if there are any existing accession number validations in the codebase:

Consider adding accession validation:

     if accession:
+        # Validate accession number format (e.g., GCF_000001405.39)
+        if not re.match(r'^GCF_\d{9}\.\d+$', accession):
+            raise ValueError(f"Invalid accession number format: {accession}")
         if genome:
             return {"genome": genome}
         if annotation:
             return {"annotation": annotation}
         return {}

Copy link
Collaborator

@cmeesters cmeesters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, sorry, the review must be finished to be visible!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (4)
workflow/rules/ref.smk (2)

Line range hint 1-4: Update localrules declaration and ensure consistency.

The localrules declaration contains outdated rule names:

  • extract_annotationget_annotation
  • extract_genomeget_genome

Also, consider adding get_references to the localrules if it doesn't require computational resources.

Apply this fix:

localrules:
    get_genome,
-    extract_annotation,
-    extract_genome,
+    get_annotation,
+    get_references,

Missing directory creation for references path

The verification confirms that:

  1. No explicit directory creation for "references/" exists in the workflow
  2. While Snakemake's directory() function is used in other rules (qc.smk), it's not used for the references directory
  3. The references directory is used as input/output in multiple rules (datamod.smk)

To prevent potential failures, add either:

  • A directory("references") output in an appropriate rule
  • Or use shell: "mkdir -p references" before the dataset download
🔗 Analysis chain

Line range hint 7-19: Consider directory creation and local reference handling.

While the rule handles NCBI dataset downloads, two concerns need attention:

  1. The output directory "references/" might not exist, potentially causing failures
  2. This rule seems unnecessary when using local references (as per PR objectives)

Let's verify if the directory is created elsewhere:

Consider:

  1. Adding directory creation if not handled elsewhere
  2. Making this rule conditional based on whether local references are used

The previous search might have been too specific. Let's broaden our search to understand:

  1. If there are any directory creation commands
  2. The main workflow files that might handle this
  3. Other mentions of the references directory
  4. Any shell scripts that might handle setup

This will help us determine if the directory creation is handled elsewhere or if it's indeed missing.


Let's examine if the references directory is handled in datamod.smk or if there are any Snakemake directory creation patterns. This will help us confirm if the directory creation is truly missing or if it's handled through Snakemake's built-in directory function or other rules.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for directory creation
rg -l 'mkdir.*references'

Length of output: 27


Script:

#!/bin/bash
# Let's try different approaches to find directory creation patterns

# Check for any mkdir commands in the workflow
rg -l 'mkdir'

# Look for directory creation in Snakefile or other workflow files
fd -e smk -e snakefile -i snakefile

# Check if references directory is mentioned in other rules
rg -l 'references/'

# Look for directory creation in shell commands or scripts
fd -e sh

Length of output: 132


Script:

#!/bin/bash
# Let's check the content of datamod.smk to see if it handles the references directory
cat workflow/rules/datamod.smk

# Also check if there are any directory creation patterns in Snakemake rules
rg -A 5 'directory\(' 

# Look for any shell commands that might create directories
rg -A 3 'shell:.*mkdir'

Length of output: 1221

config/Mainz-MogonNHR/config.yml (2)

16-19: Enhance documentation clarity for reference file requirements.

While the documentation explains the basic options, it would be helpful to clarify:

  1. Whether both genome and annotation files are required when providing local files
  2. The exact conditions when an accession number is required (e.g., "if either genome OR annotation is missing")
 # The reference genome respectively transcriptome, is defined here:
 # You can provide either 'genome' or 'annotation' reference files locally.
-# If both 'genome' and 'annotation' are available no 'accession' number is needed. 
-# In any other case an 'accession' number is needed to download missing reference data remotely.
+# If both 'genome' and 'annotation' are available locally, no 'accession' number is needed.
+# An 'accession' number is required if either the genome or annotation file is missing,
+# as it will be used to download the missing reference data remotely.
🧰 Tools
🪛 yamllint

[error] 18-18: trailing spaces

(trailing-spaces)


31-32: Enhance accession number documentation.

The comment could be more specific about the accession number format and provide an example.

-    # Remote reference data
-    # NCBI accession number of the reference data set; can be left empty if both reference files are available locally"
+    # Remote reference data
+    # NCBI GenBank assembly accession (e.g., GCA_XXXXXXXXX.X) for downloading missing reference files.
+    # Can be left empty if both genome and annotation files are provided locally.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between dea4c71 and 3d0abdf.

📒 Files selected for processing (2)
  • config/Mainz-MogonNHR/config.yml (1 hunks)
  • workflow/rules/ref.smk (3 hunks)
🧰 Additional context used
📓 Learnings (1)
config/Mainz-MogonNHR/config.yml (1)
Learnt from: cmeesters
PR: snakemake-workflows/transcriptome-differential-expression#96
File: config/Mainz-MogonNHR/config.yml:69-69
Timestamp: 2024-10-11T15:29:48.149Z
Learning: Changes to DESeq2 parameters in `config/Mainz-MogonNHR/config.yml` might be addressed and documented in separate commits or PRs.
🪛 Gitleaks
config/Mainz-MogonNHR/config.yml

33-33: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🪛 yamllint
config/Mainz-MogonNHR/config.yml

[error] 18-18: trailing spaces

(trailing-spaces)


[error] 22-22: trailing spaces

(trailing-spaces)


[error] 27-27: trailing spaces

(trailing-spaces)


[error] 30-30: trailing spaces

(trailing-spaces)


[warning] 36-36: too many blank lines

(3 > 2) (empty-lines)

🔇 Additional comments (2)
workflow/rules/ref.smk (1)

61-63: ⚠️ Potential issue

Previous review comments still apply; additional concerns noted.

The previous review comment about improving shell command reliability is still valid. Additionally:

  1. The trailing semicolon is unnecessary
  2. Error messages should be more descriptive
  3. The command structure is inconsistent with the get_genome rule

Please address the previous review comments and maintain consistency with the get_genome rule implementation.

config/Mainz-MogonNHR/config.yml (1)

23-29: ⚠️ Potential issue

Improve configuration portability and documentation.

  1. The absolute paths make the configuration less portable across different environments.
  2. The comments about empty fields could be more specific.
  3. There are trailing spaces that should be removed.

Apply these changes:

-    # Local reference data
-    # Genome file path; can be left empty if remote download is prefered.
-    # allowed extensions: .FA or .FNA or .FASTA
-    genome: "/lustre/project/nhr-zdvhpc/transcriptome_data/GCA_917627325.4_PGI_CHIRRI_v4_genomic.fa"
-    # Annotation file path; can be left empty if remote download is prefered. 
-    # allowed extensions: .GFF or .GTF
-    annotation: "/lustre/project/nhr-zdvhpc/transcriptome_data/GCA_917627325.4_PGI_CHIRRI_v4_genomic.gff"
+    # Local reference data
+    # Genome file path (can be empty if using accession for remote download)
+    # Allowed extensions: .FA, .FNA, or .FASTA
+    genome: "${REFERENCE_DATA}/GCA_917627325.4_PGI_CHIRRI_v4_genomic.fa"
+    # Annotation file path (can be empty if using accession for remote download)
+    # Allowed extensions: .GFF or .GTF
+    annotation: "${REFERENCE_DATA}/GCA_917627325.4_PGI_CHIRRI_v4_genomic.gff"

This addresses the same portability concerns raised in previous reviews.

🧰 Tools
🪛 yamllint

[error] 27-27: trailing spaces

(trailing-spaces)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
config/Mainz-MogonNHR/config.yml (2)

16-19: Enhance documentation clarity.

While the documentation clearly explains the relationship between local files and accession numbers, consider adding these clarifications:

  • Specify that both genome and annotation files must be provided together when using local files
  • Mention the supported file formats (.fa/.fna/.fasta for genome, .gff/.gtf for annotation) in this section
 # The reference genome respectively transcriptome, is defined here:
 # You can provide either 'genome' or 'annotation' reference files locally.
-# If both 'genome' and 'annotation' are available no 'accession' number is needed.
-# In any other case an 'accession' number is needed to download missing reference data remotely.
+# If both 'genome' (*.fa/*.fna/*.fasta) and 'annotation' (*.gff/*.gtf) files are provided locally,
+# no 'accession' number is needed. In any other case, an 'accession' number is required to
+# download missing reference data remotely.

31-33: Improve documentation for remote reference configuration.

The comment could be more specific about when the accession number is required.

 # Remote reference data
-# NCBI accession number of the reference data set; can be left empty if both reference files are available locally"
+# NCBI accession number of the reference data set (e.g., GCA_917627325.4).
+# Required only when either genome or annotation files are not provided locally.
+# Must be a valid NCBI GenBank/RefSeq assembly accession.

Note: The Gitleaks warning about the accession number being a potential API key can be safely ignored as this is a valid NCBI accession number format.

🧰 Tools
🪛 Gitleaks

33-33: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 3d0abdf and 6d223d7.

📒 Files selected for processing (1)
  • config/Mainz-MogonNHR/config.yml (1 hunks)
🧰 Additional context used
📓 Learnings (1)
config/Mainz-MogonNHR/config.yml (1)
Learnt from: cmeesters
PR: snakemake-workflows/transcriptome-differential-expression#96
File: config/Mainz-MogonNHR/config.yml:69-69
Timestamp: 2024-10-11T15:29:48.149Z
Learning: Changes to DESeq2 parameters in `config/Mainz-MogonNHR/config.yml` might be addressed and documented in separate commits or PRs.
🪛 Gitleaks
config/Mainz-MogonNHR/config.yml

33-33: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
workflow/rules/commons.smk (1)

36-38: Enhance function documentation for better maintainability.

The docstring should be expanded to include:

  • Parameter description and type
  • Return value structure
  • Possible exceptions and their conditions
 def get_reference_files(config):
     """
-    Get reference files from config file and validate them.
+    Validate and retrieve reference file paths from configuration.
+
+    Args:
+        config (dict): Configuration dictionary containing reference information
+            with optional keys: 'ref.genome', 'ref.annotation', 'ref.accession'
+
+    Returns:
+        dict: Dictionary containing validated paths with possible keys:
+            - {'genome': str, 'annotation': str} when both files are valid
+            - {'genome': str} or {'annotation': str} when one file is valid with accession
+            - {} when only accession is provided
+
+    Raises:
+        ValueError: If only one reference file is provided without accession,
+                   or if no valid reference files or accession are provided
     """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 6d223d7 and 7e9b794.

📒 Files selected for processing (1)
  • workflow/rules/commons.smk (1 hunks)
🔇 Additional comments (1)
workflow/rules/commons.smk (1)

56-73: Verify the reference file handling logic.

The error handling and return logic look good, with clear messages and proper handling of different input combinations. Let's verify that the implementation aligns with the PR objectives.

✅ Verification successful

Reference file handling logic is correctly implemented and used consistently

The verification confirms that:

  • The get_reference_files function is properly integrated with the workflow rules in ref.smk
  • The function's return values are correctly used in both get_genome and get_annotation rules
  • The shell commands handle both local files and NCBI dataset cases appropriately
  • Error handling aligns with the workflow's requirements, throwing descriptive errors for invalid combinations

The implementation correctly supports:

  • Using local genome and annotation files when both are provided
  • Using NCBI accession when local files aren't available
  • Proper validation of file extensions and existence
  • Clear error messages for invalid combinations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that the reference file handling is used consistently across the workflow

# Test 1: Check for usage of get_reference_files
echo "Checking for get_reference_files usage:"
rg "get_reference_files" -l

# Test 2: Check for related rule definitions
echo -e "\nChecking for related rules:"
rg "rule (get_genome|get_annotation)" -A 5

# Test 3: Check for config schema validation
echo -e "\nChecking config schema for reference file definitions:"
cat ../schemas/config.schema.yaml | rg -A 10 "ref:"

Length of output: 1132


Script:

#!/bin/bash
# Let's verify the reference handling implementation more thoroughly

# Check the full implementation of get_reference_files function
echo "Checking get_reference_files implementation:"
rg "def get_reference_files" -A 20 workflow/rules/commons.smk

# Check how the function is used in ref.smk
echo -e "\nChecking ref.smk implementation:"
cat workflow/rules/ref.smk

# Check for any config validation or schema in the repository
echo -e "\nChecking for config schema files:"
fd "config.*\.ya?ml"

# Check for test files related to reference handling
echo -e "\nChecking for related test files:"
fd "test.*" | rg "ref|reference"

Length of output: 3199

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
.test/config-simple/config.yml (2)

19-20: LGTM! Consider enhancing the file extension documentation.

The comment about file extensions is helpful, but could be more explicit about case sensitivity and whether compressed formats are supported.

Consider updating the comment to:

-    # Genome FA or FASTA or FNA
+    # Reference genome file (supported extensions: .fa, .fasta, .fna, case-insensitive)

24-25: LGTM! Consider clarifying the accession usage.

The comment explains why the accession is empty in the test configuration. However, it would be helpful to document the relationship between accession and local files.

Consider updating the comments to:

-    # empty, because every real reference is too big.
-    accession: ""
+    # NCBI accession number (leave empty when using local genome/annotation files)
+    # Empty in test config as real reference datasets are too large
+    accession: ""
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 7e9b794 and 5b7000d.

📒 Files selected for processing (1)
  • .test/config-simple/config.yml (1 hunks)
🧰 Additional context used
🪛 yamllint
.test/config-simple/config.yml

[error] 22-22: trailing spaces

(trailing-spaces)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
.test/config-simple/config.yml (2)

19-20: Enhance file extension documentation for genome configuration.

The comment should explicitly list all supported extensions with their case sensitivity.

-    # Genome FA or FASTA or FNA
+    # Genome file (supported extensions: .fa, .fasta, .fna, case-insensitive)

24-25: Clarify the relationship between local files and accession.

The comment could better explain that the accession is optional when local genome and annotation files are provided.

-    # empty, because every real reference is too big.
+    # Optional: NCBI accession number. Not required when local genome and annotation files are provided.
+    # Set to None in test config as real reference data would be too large.
workflow/rules/commons.smk (1)

35-38: Enhance function documentation with detailed docstring.

The docstring should include parameters, return values, and exceptions for better maintainability.

 def get_reference_files(config):
     """
-    Get reference files from config file and validate them.
+    Validate and retrieve reference file paths from configuration.
+    
+    Args:
+        config (dict): Configuration dictionary containing reference paths and accession
+            
+    Returns:
+        dict: Dictionary containing validated paths for 'genome' and/or 'annotation'
+            
+    Raises:
+        ValueError: If only one reference file is provided or if no valid references exist
+        KeyError: If accessing non-existent config keys
     """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 5b7000d and 383f475.

📒 Files selected for processing (2)
  • .test/config-simple/config.yml (1 hunks)
  • workflow/rules/commons.smk (1 hunks)
🧰 Additional context used
🪛 yamllint
.test/config-simple/config.yml

[error] 22-22: trailing spaces

(trailing-spaces)

🔇 Additional comments (2)
workflow/rules/commons.smk (2)

39-54: Past review comment is still applicable.

The validation logic issues highlighted in the previous review remain unaddressed.


59-65: Verify the partial return behavior with accession number.

The current logic allows returning partial reference data when an accession number is present. This might lead to unexpected behavior if the workflow expects both genome and annotation files.

✅ Verification successful

Let me gather more context about the workflow configuration and usage patterns.


Let me check one more aspect to understand the workflow's behavior with the reference files.


Based on the gathered context, I can now provide the final response:

The partial return behavior is intentional and correct.

The current implementation in get_reference_files is correct because:

  1. The workflow supports both local and remote reference files, as evidenced by the config schema and usage patterns.
  2. When an accession number is present, the workflow downloads missing files through the get_genome and get_annotation rules, which handle both local files and NCBI dataset downloads.
  3. The partial returns are properly handled by the dependent rules, which use .get() with a fallback to "references/ncbi_dataset.zip" when a specific reference file is not provided.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of get_reference_files to verify how partial returns are handled
rg -A 5 "get_reference_files"

Length of output: 1019


Script:

#!/bin/bash
# Check workflow configuration patterns and defaults
rg -A 10 "config\[.ref" 

# Look for any validation or error handling related to reference files
rg -B 3 -A 7 "raise.*reference"

Length of output: 2667


Script:

#!/bin/bash
# Check the rule dependencies and workflow structure
rg -B 2 -A 5 "rule.*genome|rule.*annotation"

# Look for any config schema or validation
fd -e yaml -e yml | xargs cat

Length of output: 23285

Copy link
Collaborator

@cmeesters cmeesters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, finally working

@cmeesters cmeesters merged commit 5fed515 into main Oct 29, 2024
6 checks passed
@cmeesters cmeesters deleted the feat-local-refs branch October 29, 2024 10:04
@coderabbitai coderabbitai bot mentioned this pull request Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants