-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: local refs #100
feat: local refs #100
Conversation
Warning Rate limit exceeded@cmeesters has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 8 minutes and 8 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughThe pull request introduces updates to several configuration and workflow files related to transcriptome analysis. New parameters for genome and annotation file paths are added in two configuration files, enhancing clarity and usability. Additionally, a new function for validating reference files is introduced in the workflow, alongside modifications to existing rules to utilize this function, improving error handling and specificity in file paths. Changes
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (6)
workflow/rules/ref.smk (2)
10-10
: Consider making theget_genome
rule conditional.While organizing downloads in a "references" directory is good, this rule should only run when using NCBI accessions, not when local references are provided.
Consider using a conditional workflow or making this rule optional when local references are specified in the config. This would prevent unnecessary downloads when local files are used.
Line range hint
1-61
: Consider separating local and NCBI reference handling into distinct rules.The current implementation combines local and NCBI reference handling within the same rules, which could make debugging and maintenance more challenging. Consider restructuring the workflow to have separate rules for each source type:
Create distinct rules for local references:
use_local_genome
use_local_annotation
Create distinct rules for NCBI references:
download_ncbi_genome
download_ncbi_annotation
Use a rule selection mechanism based on config settings
This separation would make the workflow more maintainable and easier to debug.
Would you like me to provide a detailed example of this alternative structure?
.test/config-simple/config.yml (1)
19-22
: Fix trailing whitespace and consider documenting supported file formats.The new parameters are well-structured, but there's a minor formatting issue.
Apply this diff to remove trailing whitespace:
- annotation: "./ngs-test-data/ref/annotation.chr21.gtf" + annotation: "./ngs-test-data/ref/annotation.chr21.gtf"Consider expanding the comments to list all supported file formats:
- # Genome FA or FASTA or FNA + # Genome file path (supported formats: .fa, .fasta, .fna) - # Annotation GFF or GTF + # Annotation file path (supported formats: .gff, .gtf)🧰 Tools
🪛 yamllint
[error] 22-22: trailing spaces
(trailing-spaces)
config/Mainz-MogonNHR/config.yml (2)
27-28
: Remove extra blank line.There are too many consecutive blank lines (3 > 2).
Remove one blank line to comply with YAML formatting guidelines.
🧰 Tools
🪛 yamllint
[warning] 28-28: too many blank lines
(3 > 2) (empty-lines)
17-23
: Consider adding path validation in the workflow.To ensure robustness, consider adding validation for the reference file paths:
- Verify file existence
- Validate file extensions match the allowed patterns
- Check file format integrity (e.g., valid FASTA/GFF format)
Would you like me to provide an example implementation of the validation logic?
🧰 Tools
🪛 yamllint
[error] 23-23: trailing spaces
(trailing-spaces)
workflow/rules/commons.smk (1)
63-68
: Enhance error messages with specific requirements.The error messages could be more helpful by including the supported file formats and providing guidance on how to fix the issue.
Consider updating the error messages:
if genome or annotation: raise ValueError( - "Only one reference file provided, provide either both genome and annotation or an NCBI accession number." + f"Only one reference file provided. You must provide either:\n" + f"1. Both genome (.fa, .fna, .fasta) and annotation (.gtf, .gff) files, or\n" + f"2. An NCBI accession number with optional local files" ) - raise ValueError("No valid reference files or accession number provided.") + raise ValueError( + "No valid reference files or accession number provided. Please check:\n" + "1. Genome file exists and has extension: .fa, .fna, or .fasta\n" + "2. Annotation file exists and has extension: .gtf or .gff\n" + "3. NCBI accession number is provided in the config" + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
- .test/config-simple/config.yml (1 hunks)
- config/Mainz-MogonNHR/config.yml (1 hunks)
- workflow/rules/commons.smk (1 hunks)
- workflow/rules/ref.smk (4 hunks)
🧰 Additional context used
📓 Learnings (1)
config/Mainz-MogonNHR/config.yml (1)
Learnt from: cmeesters PR: snakemake-workflows/transcriptome-differential-expression#96 File: config/Mainz-MogonNHR/config.yml:69-69 Timestamp: 2024-10-11T15:29:48.149Z Learning: Changes to DESeq2 parameters in `config/Mainz-MogonNHR/config.yml` might be addressed and documented in separate commits or PRs.
🪛 yamllint
.test/config-simple/config.yml
[error] 22-22: trailing spaces
(trailing-spaces)
config/Mainz-MogonNHR/config.yml
[error] 23-23: trailing spaces
(trailing-spaces)
[warning] 28-28: too many blank lines
(3 > 2) (empty-lines)
🪛 Gitleaks
config/Mainz-MogonNHR/config.yml
25-25: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🔇 Additional comments (4)
workflow/rules/ref.smk (1)
25-27
: 🛠️ Refactor suggestionEnhance input validation and error handling.
While the changes support both local and downloaded files, there are several areas for improvement:
- Add validation for the input genome file format (should be FASTA)
- Improve error handling in the shell command
- Consider moving the default path to config
Here's a suggested improvement for the shell command:
-[ -f "{input}" ] && mv {input} {output} || unzip -p {input} ncbi_dataset/data/{params.accession}/*.fna > {output} 2>> {log} +if [ -f "{input}" ]; then + # Validate FASTA format + if head -n 1 "{input}" | grep -q '^>'; then + mv "{input}" "{output}" + else + echo "Error: Input file is not in FASTA format" >&2 + exit 1 + fi +else + if ! unzip -p "{input}" ncbi_dataset/data/{params.accession}/*.fna > "{output}" 2>> "{log}"; then + echo "Error: Failed to extract genome from zip file" >&2 + exit 1 + fi +fiAlso applies to: 40-40
.test/config-simple/config.yml (1)
19-22
: Clarify precedence between local files and accession number.The configuration now supports both local reference files and NCBI accession numbers, but it's unclear which takes precedence if both are provided. This could lead to confusion or unexpected behavior.
Let's check if this is handled in the workflow rules:
Consider:
- Making these parameters mutually exclusive in the schema
- Adding validation to ensure only one method is used
- Documenting the precedence rules in the configuration comments
🧰 Tools
🪛 yamllint
[error] 22-22: trailing spaces
(trailing-spaces)
config/Mainz-MogonNHR/config.yml (1)
17-17
: LGTM! Clear documentation of reference data options.The comment effectively communicates the two available options for providing reference data.
workflow/rules/commons.smk (1)
53-61
: 🛠️ Refactor suggestionVerify accession number format when only one reference file is provided.
When operating with a mix of local files and accession numbers, it would be beneficial to validate the accession number format to catch potential issues early.
Let's check if there are any existing accession number validations in the codebase:
Consider adding accession validation:
if accession: + # Validate accession number format (e.g., GCF_000001405.39) + if not re.match(r'^GCF_\d{9}\.\d+$', accession): + raise ValueError(f"Invalid accession number format: {accession}") if genome: return {"genome": genome} if annotation: return {"annotation": annotation} return {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, sorry, the review must be finished to be visible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (4)
workflow/rules/ref.smk (2)
Line range hint
1-4
: Update localrules declaration and ensure consistency.The
localrules
declaration contains outdated rule names:
extract_annotation
→get_annotation
extract_genome
→get_genome
Also, consider adding
get_references
to the localrules if it doesn't require computational resources.Apply this fix:
localrules: get_genome, - extract_annotation, - extract_genome, + get_annotation, + get_references,
Missing directory creation for references path
The verification confirms that:
- No explicit directory creation for "references/" exists in the workflow
- While Snakemake's
directory()
function is used in other rules (qc.smk), it's not used for the references directory- The references directory is used as input/output in multiple rules (datamod.smk)
To prevent potential failures, add either:
- A
directory("references")
output in an appropriate rule- Or use
shell: "mkdir -p references"
before the dataset download🔗 Analysis chain
Line range hint
7-19
: Consider directory creation and local reference handling.While the rule handles NCBI dataset downloads, two concerns need attention:
- The output directory "references/" might not exist, potentially causing failures
- This rule seems unnecessary when using local references (as per PR objectives)
Let's verify if the directory is created elsewhere:
Consider:
- Adding directory creation if not handled elsewhere
- Making this rule conditional based on whether local references are used
The previous search might have been too specific. Let's broaden our search to understand:
- If there are any directory creation commands
- The main workflow files that might handle this
- Other mentions of the references directory
- Any shell scripts that might handle setup
This will help us determine if the directory creation is handled elsewhere or if it's indeed missing.
Let's examine if the references directory is handled in datamod.smk or if there are any Snakemake directory creation patterns. This will help us confirm if the directory creation is truly missing or if it's handled through Snakemake's built-in directory function or other rules.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for directory creation rg -l 'mkdir.*references'Length of output: 27
Script:
#!/bin/bash # Let's try different approaches to find directory creation patterns # Check for any mkdir commands in the workflow rg -l 'mkdir' # Look for directory creation in Snakefile or other workflow files fd -e smk -e snakefile -i snakefile # Check if references directory is mentioned in other rules rg -l 'references/' # Look for directory creation in shell commands or scripts fd -e shLength of output: 132
Script:
#!/bin/bash # Let's check the content of datamod.smk to see if it handles the references directory cat workflow/rules/datamod.smk # Also check if there are any directory creation patterns in Snakemake rules rg -A 5 'directory\(' # Look for any shell commands that might create directories rg -A 3 'shell:.*mkdir'Length of output: 1221
config/Mainz-MogonNHR/config.yml (2)
16-19
: Enhance documentation clarity for reference file requirements.While the documentation explains the basic options, it would be helpful to clarify:
- Whether both genome and annotation files are required when providing local files
- The exact conditions when an accession number is required (e.g., "if either genome OR annotation is missing")
# The reference genome respectively transcriptome, is defined here: # You can provide either 'genome' or 'annotation' reference files locally. -# If both 'genome' and 'annotation' are available no 'accession' number is needed. -# In any other case an 'accession' number is needed to download missing reference data remotely. +# If both 'genome' and 'annotation' are available locally, no 'accession' number is needed. +# An 'accession' number is required if either the genome or annotation file is missing, +# as it will be used to download the missing reference data remotely.🧰 Tools
🪛 yamllint
[error] 18-18: trailing spaces
(trailing-spaces)
31-32
: Enhance accession number documentation.The comment could be more specific about the accession number format and provide an example.
- # Remote reference data - # NCBI accession number of the reference data set; can be left empty if both reference files are available locally" + # Remote reference data + # NCBI GenBank assembly accession (e.g., GCA_XXXXXXXXX.X) for downloading missing reference files. + # Can be left empty if both genome and annotation files are provided locally.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- config/Mainz-MogonNHR/config.yml (1 hunks)
- workflow/rules/ref.smk (3 hunks)
🧰 Additional context used
📓 Learnings (1)
config/Mainz-MogonNHR/config.yml (1)
Learnt from: cmeesters PR: snakemake-workflows/transcriptome-differential-expression#96 File: config/Mainz-MogonNHR/config.yml:69-69 Timestamp: 2024-10-11T15:29:48.149Z Learning: Changes to DESeq2 parameters in `config/Mainz-MogonNHR/config.yml` might be addressed and documented in separate commits or PRs.
🪛 Gitleaks
config/Mainz-MogonNHR/config.yml
33-33: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🪛 yamllint
config/Mainz-MogonNHR/config.yml
[error] 18-18: trailing spaces
(trailing-spaces)
[error] 22-22: trailing spaces
(trailing-spaces)
[error] 27-27: trailing spaces
(trailing-spaces)
[error] 30-30: trailing spaces
(trailing-spaces)
[warning] 36-36: too many blank lines
(3 > 2) (empty-lines)
🔇 Additional comments (2)
workflow/rules/ref.smk (1)
61-63
:⚠️ Potential issuePrevious review comments still apply; additional concerns noted.
The previous review comment about improving shell command reliability is still valid. Additionally:
- The trailing semicolon is unnecessary
- Error messages should be more descriptive
- The command structure is inconsistent with the
get_genome
rulePlease address the previous review comments and maintain consistency with the
get_genome
rule implementation.config/Mainz-MogonNHR/config.yml (1)
23-29
:⚠️ Potential issueImprove configuration portability and documentation.
- The absolute paths make the configuration less portable across different environments.
- The comments about empty fields could be more specific.
- There are trailing spaces that should be removed.
Apply these changes:
- # Local reference data - # Genome file path; can be left empty if remote download is prefered. - # allowed extensions: .FA or .FNA or .FASTA - genome: "/lustre/project/nhr-zdvhpc/transcriptome_data/GCA_917627325.4_PGI_CHIRRI_v4_genomic.fa" - # Annotation file path; can be left empty if remote download is prefered. - # allowed extensions: .GFF or .GTF - annotation: "/lustre/project/nhr-zdvhpc/transcriptome_data/GCA_917627325.4_PGI_CHIRRI_v4_genomic.gff" + # Local reference data + # Genome file path (can be empty if using accession for remote download) + # Allowed extensions: .FA, .FNA, or .FASTA + genome: "${REFERENCE_DATA}/GCA_917627325.4_PGI_CHIRRI_v4_genomic.fa" + # Annotation file path (can be empty if using accession for remote download) + # Allowed extensions: .GFF or .GTF + annotation: "${REFERENCE_DATA}/GCA_917627325.4_PGI_CHIRRI_v4_genomic.gff"This addresses the same portability concerns raised in previous reviews.
🧰 Tools
🪛 yamllint
[error] 27-27: trailing spaces
(trailing-spaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
config/Mainz-MogonNHR/config.yml (2)
16-19
: Enhance documentation clarity.While the documentation clearly explains the relationship between local files and accession numbers, consider adding these clarifications:
- Specify that both genome and annotation files must be provided together when using local files
- Mention the supported file formats (.fa/.fna/.fasta for genome, .gff/.gtf for annotation) in this section
# The reference genome respectively transcriptome, is defined here: # You can provide either 'genome' or 'annotation' reference files locally. -# If both 'genome' and 'annotation' are available no 'accession' number is needed. -# In any other case an 'accession' number is needed to download missing reference data remotely. +# If both 'genome' (*.fa/*.fna/*.fasta) and 'annotation' (*.gff/*.gtf) files are provided locally, +# no 'accession' number is needed. In any other case, an 'accession' number is required to +# download missing reference data remotely.
31-33
: Improve documentation for remote reference configuration.The comment could be more specific about when the accession number is required.
# Remote reference data -# NCBI accession number of the reference data set; can be left empty if both reference files are available locally" +# NCBI accession number of the reference data set (e.g., GCA_917627325.4). +# Required only when either genome or annotation files are not provided locally. +# Must be a valid NCBI GenBank/RefSeq assembly accession.Note: The Gitleaks warning about the accession number being a potential API key can be safely ignored as this is a valid NCBI accession number format.
🧰 Tools
🪛 Gitleaks
33-33: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- config/Mainz-MogonNHR/config.yml (1 hunks)
🧰 Additional context used
📓 Learnings (1)
config/Mainz-MogonNHR/config.yml (1)
Learnt from: cmeesters PR: snakemake-workflows/transcriptome-differential-expression#96 File: config/Mainz-MogonNHR/config.yml:69-69 Timestamp: 2024-10-11T15:29:48.149Z Learning: Changes to DESeq2 parameters in `config/Mainz-MogonNHR/config.yml` might be addressed and documented in separate commits or PRs.
🪛 Gitleaks
config/Mainz-MogonNHR/config.yml
33-33: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
workflow/rules/commons.smk (1)
36-38
: Enhance function documentation for better maintainability.The docstring should be expanded to include:
- Parameter description and type
- Return value structure
- Possible exceptions and their conditions
def get_reference_files(config): """ - Get reference files from config file and validate them. + Validate and retrieve reference file paths from configuration. + + Args: + config (dict): Configuration dictionary containing reference information + with optional keys: 'ref.genome', 'ref.annotation', 'ref.accession' + + Returns: + dict: Dictionary containing validated paths with possible keys: + - {'genome': str, 'annotation': str} when both files are valid + - {'genome': str} or {'annotation': str} when one file is valid with accession + - {} when only accession is provided + + Raises: + ValueError: If only one reference file is provided without accession, + or if no valid reference files or accession are provided """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- workflow/rules/commons.smk (1 hunks)
🔇 Additional comments (1)
workflow/rules/commons.smk (1)
56-73
: Verify the reference file handling logic.The error handling and return logic look good, with clear messages and proper handling of different input combinations. Let's verify that the implementation aligns with the PR objectives.
✅ Verification successful
Reference file handling logic is correctly implemented and used consistently
The verification confirms that:
- The
get_reference_files
function is properly integrated with the workflow rules inref.smk
- The function's return values are correctly used in both
get_genome
andget_annotation
rules- The shell commands handle both local files and NCBI dataset cases appropriately
- Error handling aligns with the workflow's requirements, throwing descriptive errors for invalid combinations
The implementation correctly supports:
- Using local genome and annotation files when both are provided
- Using NCBI accession when local files aren't available
- Proper validation of file extensions and existence
- Clear error messages for invalid combinations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that the reference file handling is used consistently across the workflow # Test 1: Check for usage of get_reference_files echo "Checking for get_reference_files usage:" rg "get_reference_files" -l # Test 2: Check for related rule definitions echo -e "\nChecking for related rules:" rg "rule (get_genome|get_annotation)" -A 5 # Test 3: Check for config schema validation echo -e "\nChecking config schema for reference file definitions:" cat ../schemas/config.schema.yaml | rg -A 10 "ref:"Length of output: 1132
Script:
#!/bin/bash # Let's verify the reference handling implementation more thoroughly # Check the full implementation of get_reference_files function echo "Checking get_reference_files implementation:" rg "def get_reference_files" -A 20 workflow/rules/commons.smk # Check how the function is used in ref.smk echo -e "\nChecking ref.smk implementation:" cat workflow/rules/ref.smk # Check for any config validation or schema in the repository echo -e "\nChecking for config schema files:" fd "config.*\.ya?ml" # Check for test files related to reference handling echo -e "\nChecking for related test files:" fd "test.*" | rg "ref|reference"Length of output: 3199
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
.test/config-simple/config.yml (2)
19-20
: LGTM! Consider enhancing the file extension documentation.The comment about file extensions is helpful, but could be more explicit about case sensitivity and whether compressed formats are supported.
Consider updating the comment to:
- # Genome FA or FASTA or FNA + # Reference genome file (supported extensions: .fa, .fasta, .fna, case-insensitive)
24-25
: LGTM! Consider clarifying the accession usage.The comment explains why the accession is empty in the test configuration. However, it would be helpful to document the relationship between accession and local files.
Consider updating the comments to:
- # empty, because every real reference is too big. - accession: "" + # NCBI accession number (leave empty when using local genome/annotation files) + # Empty in test config as real reference datasets are too large + accession: ""
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- .test/config-simple/config.yml (1 hunks)
🧰 Additional context used
🪛 yamllint
.test/config-simple/config.yml
[error] 22-22: trailing spaces
(trailing-spaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
.test/config-simple/config.yml (2)
19-20
: Enhance file extension documentation for genome configuration.
The comment should explicitly list all supported extensions with their case sensitivity.
- # Genome FA or FASTA or FNA
+ # Genome file (supported extensions: .fa, .fasta, .fna, case-insensitive)
24-25
: Clarify the relationship between local files and accession.
The comment could better explain that the accession is optional when local genome and annotation files are provided.
- # empty, because every real reference is too big.
+ # Optional: NCBI accession number. Not required when local genome and annotation files are provided.
+ # Set to None in test config as real reference data would be too large.
workflow/rules/commons.smk (1)
35-38
: Enhance function documentation with detailed docstring.
The docstring should include parameters, return values, and exceptions for better maintainability.
def get_reference_files(config):
"""
- Get reference files from config file and validate them.
+ Validate and retrieve reference file paths from configuration.
+
+ Args:
+ config (dict): Configuration dictionary containing reference paths and accession
+
+ Returns:
+ dict: Dictionary containing validated paths for 'genome' and/or 'annotation'
+
+ Raises:
+ ValueError: If only one reference file is provided or if no valid references exist
+ KeyError: If accessing non-existent config keys
"""
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- .test/config-simple/config.yml (1 hunks)
- workflow/rules/commons.smk (1 hunks)
🧰 Additional context used
🪛 yamllint
.test/config-simple/config.yml
[error] 22-22: trailing spaces
(trailing-spaces)
🔇 Additional comments (2)
workflow/rules/commons.smk (2)
39-54
: Past review comment is still applicable.
The validation logic issues highlighted in the previous review remain unaddressed.
59-65
: Verify the partial return behavior with accession number.
The current logic allows returning partial reference data when an accession number is present. This might lead to unexpected behavior if the workflow expects both genome and annotation files.
✅ Verification successful
Let me gather more context about the workflow configuration and usage patterns.
Let me check one more aspect to understand the workflow's behavior with the reference files.
Based on the gathered context, I can now provide the final response:
The partial return behavior is intentional and correct.
The current implementation in get_reference_files
is correct because:
- The workflow supports both local and remote reference files, as evidenced by the config schema and usage patterns.
- When an accession number is present, the workflow downloads missing files through the
get_genome
andget_annotation
rules, which handle both local files and NCBI dataset downloads. - The partial returns are properly handled by the dependent rules, which use
.get()
with a fallback to "references/ncbi_dataset.zip" when a specific reference file is not provided.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for usage of get_reference_files to verify how partial returns are handled
rg -A 5 "get_reference_files"
Length of output: 1019
Script:
#!/bin/bash
# Check workflow configuration patterns and defaults
rg -A 10 "config\[.ref"
# Look for any validation or error handling related to reference files
rg -B 3 -A 7 "raise.*reference"
Length of output: 2667
Script:
#!/bin/bash
# Check the rule dependencies and workflow structure
rg -B 2 -A 5 "rule.*genome|rule.*annotation"
# Look for any config schema or validation
fd -e yaml -e yml | xargs cat
Length of output: 23285
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, finally working
Enables users to input their own reference data instead of providing an accession number, which will then be used to download reference data.
Summary by CodeRabbit
New Features
Improvements
These changes improve the usability and reliability of the transcriptome analysis workflow.