Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DI-920 Changes for Medicover internal AF VCF #19

Merged
merged 8 commits into from
Nov 29, 2024
Merged

Conversation

rklocke
Copy link
Collaborator

@rklocke rklocke commented Nov 26, 2024

  • Add run mode choices where in mode 'no_qc' we can find Medicover VCFs and not find the QC status files in the GRCh37 project like we would for CEN or TWE
  • Small changes to bash script to give better logging and check memory usage etc, use fields 2 and 3 of input file (since we longer output the pandas index from the Python file) and name output file by input file
  • Update README for changes

This change is Reviewable

Summary by CodeRabbit

  • New Features

    • Introduced a new command-line argument --run_mode for the VCF merging script, allowing users to choose between find_qc and no_qc modes.
    • Added functionality to locate and process VCF files directly.
    • Enhanced logging and resource usage reporting in the merging script.
  • Documentation

    • Updated README to clarify operational modes and provide example commands, improving user understanding of the scripts.
  • Bug Fixes

    • Improved error handling and output messaging for project handling in the VCF processing script.
  • Chores

    • Updated output file naming conventions for clarity and consistency.

@pep8speaks
Copy link

pep8speaks commented Nov 26, 2024

Hello @rklocke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 514:80: E501 line too long (88 > 79 characters)

Comment last updated at 2024-11-29 09:42:59 UTC

Copy link

coderabbitai bot commented Nov 26, 2024

Walkthrough

This pull request introduces significant updates to the documentation and scripts involved in processing VCF files. The README.md now delineates two operational modes for the find_vcfs_to_merge.py script: find_qc and no_qc. The script has been enhanced with a new --run_mode parameter, allowing users to specify how VCF files are processed. Additionally, the merge_VCF_AF.sh script has been improved with better logging and memory usage reporting. These changes collectively enhance clarity, usability, and functionality.

Changes

File Change Summary
DI-435/README.md Updated to clarify operational modes (find_qc, no_qc), added example commands, reformatted inputs, and updated output descriptions.
DI-435/find_vcfs_to_merge.py Added --run_mode parameter, restructured main function for mode handling, refined get_qc_files logic, and introduced find_medicover_vcf_files function.
DI-435/merge_VCF_AF.sh Enhanced logging with timestamps, added _get_peak_usage function for resource reporting, modified input handling, and updated output filenames.

Possibly related PRs

  • Di 435 bash improvements #17: The changes in merge_VCF_AF.sh regarding the handling of VCF files and improvements in error handling are relevant as they complement the updates made in the main PR to the same script, enhancing its overall functionality and robustness.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE

📥 Commits

Reviewing files that changed from the base of the PR and between fcf4410 and 2a8721e.

📒 Files selected for processing (3)
  • DI-435/README.md (1 hunks)
  • DI-435/find_vcfs_to_merge.py (5 hunks)
  • DI-435/merge_VCF_AF.sh (4 hunks)
🧰 Additional context used
📓 Learnings (1)
DI-435/merge_VCF_AF.sh (1)
Learnt from: rklocke
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:58-63
Timestamp: 2024-11-22T12:19:56.287Z
Learning: When improving bash scripts, prefer adding `set -exo pipefail` at the top to enhance error handling and robustness.
🪛 LanguageTool
DI-435/README.md

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e.g. "*CEN38". - -o --outfile_prefix: Prefix to use to name the output files,...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...t files, e.g. CEN38. - -r --run_mode: A choice of whether to find and use QC ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~41-~41: Loose punctuation mark.
Context: ...section below. - -e --end (optional): A date used to find DNAnexus (GRCh37) p...

(UNLIKELY_OPENING_PUNCTUATION)


[style] ~49-~49: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...

(ALL_OF_THE)

🪛 Markdownlint (0.35.0)
DI-435/README.md

59-59: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


8-8: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


9-9: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


9-9: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


35-35: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


46-46: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


54-54: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


54-54: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


60-60: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


66-66: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


66-66: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


70-70: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


11-11: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


16-16: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


18-18: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


28-28: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


76-76: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


5-5: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


36-36: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


48-48: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


53-53: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


55-55: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


62-62: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


65-65: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


67-67: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


72-72: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


28-28: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


76-76: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


83-83: null
Files should end with a single newline character

(MD047, single-trailing-newline)

🪛 Ruff (0.8.0)
DI-435/find_vcfs_to_merge.py

60-60: Trailing comma missing

Add trailing comma

(COM812)


218-218: Trailing comma missing

Add trailing comma

(COM812)


355-355: Missing return type annotation for public function find_medicover_vcf_files

(ANN201)


355-355: Missing type annotation for function argument projects

(ANN001)


377-377: Trailing comma missing

Add trailing comma

(COM812)


384-384: Trailing comma missing

Add trailing comma

(COM812)


391-391: Trailing comma missing

Add trailing comma

(COM812)


392-392: Trailing comma missing

Add trailing comma

(COM812)


396-396: Trailing comma missing

Add trailing comma

(COM812)


477-477: Trailing comma missing

Add trailing comma

(COM812)


483-483: Trailing comma missing

Add trailing comma

(COM812)


496-496: Trailing comma missing

Add trailing comma

(COM812)


513-513: Trailing comma missing

Add trailing comma

(COM812)


523-523: Trailing comma missing

Add trailing comma

(COM812)


532-532: Trailing comma missing

Add trailing comma

(COM812)


538-538: Trailing comma missing

Add trailing comma

(COM812)


543-543: Trailing comma missing

Add trailing comma

(COM812)


548-548: Trailing comma missing

Add trailing comma

(COM812)


556-556: Trailing comma missing

Add trailing comma

(COM812)


569-569: Trailing comma missing

Add trailing comma

(COM812)


576-576: Trailing comma missing

Add trailing comma

(COM812)


585-585: Trailing comma missing

Add trailing comma

(COM812)


591-591: Trailing comma missing

Add trailing comma

(COM812)


595-595: Trailing comma missing

Add trailing comma

(COM812)


598-598: Trailing comma missing

Add trailing comma

(COM812)


604-604: Trailing comma missing

Add trailing comma

(COM812)

🪛 Shellcheck (0.10.0)
DI-435/merge_VCF_AF.sh

[warning] 23-23: Quote this to prevent word splitting.

(SC2046)


[info] 23-23: Consider using pgrep instead of grepping ps output.

(SC2009)


[info] 30-30: Double quote to prevent globbing and word splitting.

(SC2086)

🔇 Additional comments (1)
DI-435/merge_VCF_AF.sh (1)

15-17: Logging output is properly directed for traceability

Excellent work directing stdout and stderr to log files. This will help us keep track of the script's execution, much like keeping a meticulous recipe book.

DI-435/README.md Outdated
Comment on lines 11 to 33
```
python3 find_vcfs_to_merge.py \
--assay "*CEN38" \
--outfile_prefix CEN38 \
--run_mode find_qc
```
Example command to find TWE GRCh38 VCFs where the GRCh37 runs were between two dates:
```
python3 find_vcfs_to_merge.py \
--assay "*TWE38" \
--outfile_prefix TWE38 \
--run_mode find_qc \
--start 2024-03-15 \
--end 2024-08-04
```

**Searching Dates**
These dates restrict the projects collated to only GRCh38 projects
which have corresponding GRCh37 projects which were created within the specified dates.
Example command to find Medicover VCFs with no QC status files:
```
python3 find_vcfs_to_merge.py \
--assay "*TWE38M" \
--outfile_prefix TWE38M \
--run_mode no_qc
```
Copy link

@coderabbitai coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Ah, these examples need the proper garnish, mon ami!

Just like we label our mise en place, we must label our code blocks! They're missing their language specifiers, and the presentation needs proper spacing.

Add language specifiers and proper spacing to the code blocks:

-```
+```bash
 python3 find_vcfs_to_merge.py \

Apply this to all three code blocks and add blank lines before and after each block for proper presentation.

🧰 Tools
🪛 Markdownlint (0.35.0)

11-11: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


16-16: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


18-18: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


28-28: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


28-28: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklocke Thanks for making the updates.


If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

DI-435/README.md Outdated
Comment on lines 35 to 41
#### Inputs
- `-a --assay`: The GRCh38 project search term for in DNAnexus, e.g. `"*CEN38"`.
- `-o --outfile_prefix`: Prefix to use to name the output files, e.g. `CEN38`.
- `-r --run_mode`: A choice of whether to find and use QC status files (`find_qc`) or not (`no_qc`).
In `find_qc` mode the following inputs are allowed:
- `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below.
- `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below.
Copy link

@coderabbitai coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

The ingredients list is perfect, but the presentation needs finesse!

Your parameter descriptions are clear as consommé, but the formatting could use a touch of refinement. Let's add proper spacing around the headings and lists, just as we maintain proper spacing between courses.

Add blank lines:

  • Before and after the "Inputs" heading
  • Before and after the parameter list
  • Between the main parameters and the find_qc mode specific parameters
🧰 Tools
🪛 LanguageTool

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e.g. "*CEN38". - -o --outfile_prefix: Prefix to use to name the output files,...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...t files, e.g. CEN38. - -r --run_mode: A choice of whether to find and use QC ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~41-~41: Loose punctuation mark.
Context: ...section below. - -e --end (optional): A date used to find DNAnexus (GRCh37) p...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Markdownlint (0.35.0)

35-35: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


36-36: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklocke, thanks for making the changes!

(^_^)/


If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

DI-435/README.md Outdated
Comment on lines 46 to 68
#### find_qc mode
How the script works in `find_qc` mode:
1. Finds all DNAnexus projects with suffix `--assay`.
2. Finds the related GRCh37 project (and between `--start` and `--end` dates, if provided) for each of these projects and reads in all of the QC status files into one merged dataframe. If multiple QC status files exist in a project, the one created last is used.
3. Finds all raw VCFs in each of the DNAnexus projects.
4. Splits these VCFs into a list of validation (including control) samples and non-validation samples based on naming conventions
5. Removes any samples which are duplicates or any which failed QC at any time based on information within the QC status files.
6. Creates a final list of all VCFs to merge and writes this out to file with`--outfile_prefix`.
5. Removes the first instance of a sample which is duplicated or any which failed QC at any time based on information within the QC status files.
6. Creates a final list of all VCFs to merge and writes this out to file with `--outfile_prefix`.
##### Outputs
- A TSV listing the VCF files for all non-validation samples to merge (named by `{outfile_prefix}_files_to_merge.txt`)
- A CSV of all validation samples found (`{outfile_prefix}_validation_samples.csv`)
- A CSV of all projects within search but missing QC file in DNAnexus and therefore not included (`{outfile_prefix}_projects_missing_QC.csv`).


#### no_qc mode
How the script works in `no_qc` mode:
1. Finds all DNAnexus projects with suffix `--assay`
2. Finds all raw VCFs in each of the DNAnexus projects.
3. Removes all samples which are duplicates.
4. Creates a final list of all VCFs to merge and writes this to file with `--outfile_prefix`.
##### Outputs
- A TSV listing the VCF files to merge (named by `{outfile_prefix}_files_to_merge.txt`)
- A TSV of all duplicate files which have been removed (`{outfile_prefix}_all_dup_rows.txt`)
Copy link

@coderabbitai coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

The recipe is perfect, but the plating needs attention!

Your mode descriptions are as detailed as a complex reduction sauce, but we've got some unnecessary blank lines that would never pass in my kitchen! And those headings need proper spacing, just like we space our courses!

  1. Remove the double blank lines around line 59
  2. Add proper spacing around all headings and lists
  3. Ensure consistent formatting between both mode sections
🧰 Tools
🪛 LanguageTool

[style] ~49-~49: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...

(ALL_OF_THE)

🪛 Markdownlint (0.35.0)

59-59: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)


46-46: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


54-54: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


54-54: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


60-60: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


66-66: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


66-66: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


48-48: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


53-53: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


55-55: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


62-62: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


65-65: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


67-67: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@rklocke have exceeded the limit for the number of chat messages per hour. Please wait 2 minutes and 16 seconds before sending another message.

Comment on lines +4 to +6
This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:
- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples.
- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Magnifique! The description is perfectly seasoned, but let's garnish it properly!

The explanation of the two modes is chef's kiss perfect, but just like a properly plated dish needs proper spacing, our markdown needs proper formatting.

Add a blank line before the list for proper markdown presentation:

 This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:
+
 - `find_qc` searches for GRCh38 projects...
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:
- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples.
- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely.
This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:
- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples.
- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely.
🧰 Tools
🪛 Markdownlint (0.35.0)

5-5: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

DI-435/README.md Outdated
Comment on lines 40 to 83
```
bash merge_VCF_AF.sh \
CEN38_files_to_merge.txt \
GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta
```

Output:
Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. No newline at end of file
Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.
Copy link

@coderabbitai coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

The final course needs its proper presentation, non?

The bash script example is like a perfect soufflé, but it's missing its proper garnish! And where's our proper ending? Every perfect meal needs a proper finish!

  1. Add the bash language specifier to the code block:
-```
+```bash
 bash merge_VCF_AF.sh \
  1. Ensure the file ends with a single newline character
  2. Add proper spacing around the code block
🧰 Tools
🪛 Markdownlint (0.35.0)

76-76: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


76-76: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


83-83: null
Files should end with a single newline character

(MD047, single-trailing-newline)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklocke, great, thanks for implementing the changes promptly!

(_/)
( ^.^ )
/(")(")


If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

Comment on lines +19 to +20
PS4='\000[$(date)]\011'
export TZ=Europe/London
Copy link

@coderabbitai coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refine the PS4 prompt for accurate command tracing

Darling, while you've seasoned the script with timestamps for command tracing, the use of \000 and \011 may not serve us the rich flavour we're aiming for. Let's adjust the PS4 variable to ensure our logs are as impeccable as our finest dishes.

Apply this diff to correct the PS4 prompt:

-PS4='\000[$(date)]\011'
+PS4='[$(date +"%Y-%m-%d %H:%M:%S")] '
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
PS4='\000[$(date)]\011'
export TZ=Europe/London
PS4='[$(date +"%Y-%m-%d %H:%M:%S")] '
export TZ=Europe/London

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works fine as is

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklocke, understood. I'll note this for future reviews.


✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/merge_VCF_AF.sh:19-20
Timestamp: 2024-11-29T09:56:53.623Z
Learning: In `DI-435/merge_VCF_AF.sh`, the PS4 variable is set to `PS4='\000[$(date)]\011'` and should be kept as is.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

export TZ=Europe/London

# set frequency of instance usage in logs to 10 seconds
sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')
Copy link

@coderabbitai coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve process termination using 'pkill' and eliminate 'sudo'

Listen, we must be precise when pruning our processes. Using pkill tidies up our approach wonderfully, and let's avoid sudo to keep our kitchen safe from unintended side effects.

Apply this diff to improve the command:

-sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')
+pkill pcp-dstat
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')
pkill pcp-dstat
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 23-23: Quote this to prevent word splitting.

(SC2046)


[info] 23-23: Consider using pgrep instead of grepping ps output.

(SC2009)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be nice if it works?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RSWilson1, absolutely, it works like a charm! Replacing sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}') with pkill pcp-dstat is a more elegant and efficient way to terminate the pcp-dstat process. It simplifies the script and enhances safety by eliminating the need for sudo. In my kitchen, we always strive for precision and efficiency. Let's incorporate this refined approach to keep our script running smoothly!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this part as advised by Jethro and DNAnexus didn't allow kill without sudo - I'm not sure I understand enough to know if changing this is fine..

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's fine then

Comment on lines +476 to +477
print(
f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}"
Copy link

@coderabbitai coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add a trailing comma after the print arguments

Including a trailing comma enhances code readability and complies with style guidelines.

Suggested change:

        print(
            f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}"
+        )

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.8.0)

477-477: Trailing comma missing

Add trailing comma

(COM812)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, @rklocke.

(_/)
( ;_;)


✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/find_vcfs_to_merge.py:480-481
Timestamp: 2024-11-29T09:56:58.290Z
Learning: Trailing commas after print function arguments are not needed in this codebase.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines 51 to 61
parser.add_argument(
"-r",
"--run_mode",
choices=['find_qc', 'no_qc'],
required=True,
help=(
"Runmode - either 'find_qc' where QC status files are "
"searched for and used to find failed samples, or 'no_qc'"
"where no QC status files are searched for"
)
)
Copy link

@coderabbitai coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Mind the spacing and add a trailing comma in the help message

The concatenated strings in the help parameter are missing a space, causing words to run together in the command-line help output. Additionally, a trailing comma is missing after the help parameter, which improves code style compliance.

Please apply the following fix:

            help=(
                "Runmode - either 'find_qc' where QC status files are "
                "searched for and used to find failed samples, or 'no_qc'"
-               "where no QC status files are searched for"
+               " where no QC status files are searched for"
            ),

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.8.0)

60-60: Trailing comma missing

Add trailing comma

(COM812)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@rklocke have exceeded the limit for the number of chat messages per hour. Please wait 2 minutes and 21 seconds before sending another message.

Comment on lines 355 to 400
def find_medicover_vcf_files(projects):
"""
Find VCF files in a project
Parameters
----------
project_id : str
DX project ID
Returns
-------
vcf_files : list
list of dicts, each representing a VCF file in DX
"""
sub_remove = {
'_Wdh': '',
'_Wdh2': '',
'_Whd3': '',
}
all_sample_vcfs = []
for project in projects:
vcf_files = find_data(
"*_markdup_recalibrated_Haplotyper.vcf.gz", project['id']
)
for vcf_file in vcf_files:
file_id = vcf_file["describe"]["id"]
file_name = vcf_file["describe"]["name"]
sample_name = re.sub("|".join(
sub_remove), lambda x: sub_remove[x.group(0)],
file_name.split('-TwistWE')[0]
)

all_sample_vcfs.append(
{
"sample": sample_name,
"project": project["describe"]["id"],
"file_id": file_id
}
)

print(
f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects"
)

return all_sample_vcfs

Copy link

@coderabbitai coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add type annotations and correct the docstring in find_medicover_vcf_files

The function find_medicover_vcf_files is missing type annotations for its parameter and return type, and the docstring contains inaccuracies. Adding type annotations improves code clarity and assists with static analysis. Updating the docstring ensures accurate documentation.

Please consider the following changes:

+from typing import List, Dict

-def find_medicover_vcf_files(projects):
+def find_medicover_vcf_files(projects: List[dict]) -> List[dict]:
     """
-    Find VCF files in a project
+    Find VCF files in specified projects

     Parameters
     ----------
-    project_id : str
-        DX project ID
+    projects : List[dict]
+        List of DNAnexus project dictionaries

     Returns
     -------
-    vcf_files : list
+    all_sample_vcfs : List[dict]
         List of dictionaries, each representing a VCF file in DNAnexus
     """
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def find_medicover_vcf_files(projects):
"""
Find VCF files in a project
Parameters
----------
project_id : str
DX project ID
Returns
-------
vcf_files : list
list of dicts, each representing a VCF file in DX
"""
sub_remove = {
'_Wdh': '',
'_Wdh2': '',
'_Whd3': '',
}
all_sample_vcfs = []
for project in projects:
vcf_files = find_data(
"*_markdup_recalibrated_Haplotyper.vcf.gz", project['id']
)
for vcf_file in vcf_files:
file_id = vcf_file["describe"]["id"]
file_name = vcf_file["describe"]["name"]
sample_name = re.sub("|".join(
sub_remove), lambda x: sub_remove[x.group(0)],
file_name.split('-TwistWE')[0]
)
all_sample_vcfs.append(
{
"sample": sample_name,
"project": project["describe"]["id"],
"file_id": file_id
}
)
print(
f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects"
)
return all_sample_vcfs
from typing import List, Dict
def find_medicover_vcf_files(projects: List[dict]) -> List[dict]:
"""
Find VCF files in specified projects
Parameters
----------
projects : List[dict]
List of DNAnexus project dictionaries
Returns
-------
all_sample_vcfs : List[dict]
List of dictionaries, each representing a VCF file in DNAnexus
"""
sub_remove = {
'_Wdh': '',
'_Wdh2': '',
'_Whd3': '',
}
all_sample_vcfs = []
for project in projects:
vcf_files = find_data(
"*_markdup_recalibrated_Haplotyper.vcf.gz", project['id']
)
for vcf_file in vcf_files:
file_id = vcf_file["describe"]["id"]
file_name = vcf_file["describe"]["name"]
sample_name = re.sub("|".join(
sub_remove), lambda x: sub_remove[x.group(0)],
file_name.split('-TwistWE')[0]
)
all_sample_vcfs.append(
{
"sample": sample_name,
"project": project["describe"]["id"],
"file_id": file_id
}
)
print(
f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects"
)
return all_sample_vcfs
🧰 Tools
🪛 Ruff (0.8.0)

355-355: Missing return type annotation for public function find_medicover_vcf_files

(ANN201)


355-355: Missing type annotation for function argument projects

(ANN001)


377-377: Trailing comma missing

Add trailing comma

(COM812)


384-384: Trailing comma missing

Add trailing comma

(COM812)


391-391: Trailing comma missing

Add trailing comma

(COM812)


392-392: Trailing comma missing

Add trailing comma

(COM812)


396-396: Trailing comma missing

Add trailing comma

(COM812)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added extra detail but for consistency I will leave the type hints out

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@rklocke have exceeded the limit for the number of chat messages per hour. Please wait 2 minutes and 20 seconds before sending another message.

Copy link
Contributor

@RSWilson1 RSWilson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @coderabbitai[bot] and @rklocke)


DI-435/find_vcfs_to_merge.py line 357 at r1 (raw file):

def find_medicover_vcf_files(projects):
    """
    Find VCF files in a project

maybe add some more context to this docstring


DI-435/find_vcfs_to_merge.py line 369 at r1 (raw file):

        list of dicts, each representing a VCF file in DX
    """
    sub_remove = {

this naming could be clearer


DI-435/find_vcfs_to_merge.py line 381 at r1 (raw file):

        for vcf_file in vcf_files:
            file_id = vcf_file["describe"]["id"]
            file_name = vcf_file["describe"]["name"]

maybe a comment to explain the logic?


DI-435/merge_VCF_AF.sh line 94 at r1 (raw file):

mkdir -p norm
echo "Normalising VCFs"
find . -maxdepth 1 -name '*.vcf.gz' -print0 | xargs -0 -P "${num_proc}" -I{} bcftools norm -m -any -f "${genome}" -Oz "{}" -o norm/"$(basename {})"

NICE! I was trying to get this to work but kept getting odd errors.


DI-435/merge_VCF_AF.sh line 134 at r1 (raw file):

dx upload "final_merged_${file_prefix}.vcf.gz"
dx upload "final_merged_${file_prefix}.vcf.gz.tbi"
_get_peak_usage

This is nifty 👍

export TZ=Europe/London

# set frequency of instance usage in logs to 10 seconds
sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be nice if it works?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range comments (1)
DI-435/merge_VCF_AF.sh (1)

Line range hint 92-133: Add error handling for critical file operations

The file operations lack proper error checking, which could lead to silent failures.

Consider adding checks for critical operations:

 mkdir -p norm
+if [ ! -d "norm" ]; then
+    echo "ERROR: Failed to create normalisation directory"
+    exit 1
+fi
+
 echo "Normalising VCFs"
-find . -maxdepth 1 -name '*.vcf.gz' -print0 | xargs -0 -P "${num_proc}" -I{} bcftools norm -m -any -f "${genome}" -Oz "{}" -o norm/"$(basename {})"
+find . -maxdepth 1 -name '*.vcf.gz' -print0 | xargs -0 -P "${num_proc}" -I{} bash -c '
+    if ! bcftools norm -m -any -f "${genome}" -Oz "{}" -o norm/"$(basename {})"; then
+        echo "ERROR: Failed to normalise {}"
+        exit 1
+    fi
+'
+
+# Verify files were created
+if [ ! "$(ls -A norm/*.vcf.gz 2>/dev/null)" ]; then
+    echo "ERROR: No normalised VCF files were created"
+    exit 1
+fi
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE

📥 Commits

Reviewing files that changed from the base of the PR and between 2a8721e and 9a6fddd.

📒 Files selected for processing (3)
  • DI-435/README.md (1 hunks)
  • DI-435/find_vcfs_to_merge.py (5 hunks)
  • DI-435/merge_VCF_AF.sh (4 hunks)
🧰 Additional context used
📓 Learnings (1)
DI-435/merge_VCF_AF.sh (4)
Learnt from: rklocke
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:58-63
Timestamp: 2024-11-22T12:19:56.287Z
Learning: When improving bash scripts, prefer adding `set -exo pipefail` at the top to enhance error handling and robustness.
Learnt from: RSWilson1
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:60-60
Timestamp: 2024-11-25T16:12:16.377Z
Learning: In `DI-435/merge_VCF_AF.sh`, the script handles cases where no VCF files are found by validating the input file (`input_file`) at the beginning, so additional checks later in the script are unnecessary.
Learnt from: RSWilson1
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:76-78
Timestamp: 2024-11-25T16:01:30.595Z
Learning: In the script `DI-435/merge_VCF_AF.sh`, the presence of VCF files is already ensured by checking during the file download process.
Learnt from: RSWilson1
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:60-60
Timestamp: 2024-11-25T16:01:34.002Z
Learning: In `DI-435/merge_VCF_AF.sh`, the script handles cases where no VCF files are found by checking during the file download step, so additional checks later in the script are unnecessary.
🪛 LanguageTool
DI-435/README.md

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e no_qc ``` #### Inputs - -a --assay: The GRCh38 project search term for in D...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...e.g. "*CEN38". - -o --outfile_prefix: Prefix to use to name the output files,...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~39-~39: Loose punctuation mark.
Context: ...t files, e.g. CEN38. - -r --run_mode: A choice of whether to find and use QC ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...section below. - -e --end (optional): A date used to find DNAnexus (GRCh37) p...

(UNLIKELY_OPENING_PUNCTUATION)


[style] ~51-~51: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...

(ALL_OF_THE)


[uncategorized] ~72-~72: The preposition “on” seems more likely in this position than the preposition “in”.
Context: ...pt to merge VCFs The bash script is run in a DNAnexus cloud workstation and requir...

(AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)

🪛 Markdownlint (0.35.0)
DI-435/README.md

42-42: Expected: 0; Actual: 2
Unordered list indentation

(MD007, ul-indent)


43-43: Expected: 0; Actual: 2
Unordered list indentation

(MD007, ul-indent)


8-8: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


9-9: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


9-9: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


48-48: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


56-56: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


56-56: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


61-61: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


67-67: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


67-67: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


71-71: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


11-11: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


16-16: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


18-18: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


28-28: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


77-77: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


5-5: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


42-42: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


50-50: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


55-55: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


57-57: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


63-63: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


66-66: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


68-68: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


73-73: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


84-84: null
Files should end with a single newline character

(MD047, single-trailing-newline)

🪛 Ruff (0.8.0)
DI-435/find_vcfs_to_merge.py

60-60: Trailing comma missing

Add trailing comma

(COM812)


218-218: Trailing comma missing

Add trailing comma

(COM812)


355-355: Missing return type annotation for public function find_medicover_vcf_files

(ANN201)


355-355: Missing type annotation for function argument projects

(ANN001)


378-378: Trailing comma missing

Add trailing comma

(COM812)


388-388: Trailing comma missing

Add trailing comma

(COM812)


395-395: Trailing comma missing

Add trailing comma

(COM812)


396-396: Trailing comma missing

Add trailing comma

(COM812)


400-400: Trailing comma missing

Add trailing comma

(COM812)


481-481: Trailing comma missing

Add trailing comma

(COM812)


487-487: Trailing comma missing

Add trailing comma

(COM812)


500-500: Trailing comma missing

Add trailing comma

(COM812)


517-517: Trailing comma missing

Add trailing comma

(COM812)


527-527: Trailing comma missing

Add trailing comma

(COM812)


536-536: Trailing comma missing

Add trailing comma

(COM812)


542-542: Trailing comma missing

Add trailing comma

(COM812)


547-547: Trailing comma missing

Add trailing comma

(COM812)


552-552: Trailing comma missing

Add trailing comma

(COM812)


560-560: Trailing comma missing

Add trailing comma

(COM812)


573-573: Trailing comma missing

Add trailing comma

(COM812)


580-580: Trailing comma missing

Add trailing comma

(COM812)


589-589: Trailing comma missing

Add trailing comma

(COM812)


595-595: Trailing comma missing

Add trailing comma

(COM812)


599-599: Trailing comma missing

Add trailing comma

(COM812)


602-602: Trailing comma missing

Add trailing comma

(COM812)


608-608: Trailing comma missing

Add trailing comma

(COM812)

🪛 Shellcheck (0.10.0)
DI-435/merge_VCF_AF.sh

[warning] 23-23: Quote this to prevent word splitting.

(SC2046)


[info] 23-23: Consider using pgrep instead of grepping ps output.

(SC2009)

🔇 Additional comments (7)
DI-435/merge_VCF_AF.sh (2)

15-20: 🛠️ Refactor suggestion

Optimise logging setup for improved reliability

The logging configuration requires refinement:

  1. The PS4 format using octal escapes (\000, \011) may not be portable across all environments
  2. The timestamp format lacks millisecond precision, which could be crucial for debugging

Apply this diff to enhance logging reliability:

-PS4='\000[$(date)]\011'
+PS4='[$(date +"%Y-%m-%d %H:%M:%S.%3N")] '

22-24: 🛠️ Refactor suggestion

Improve process termination using 'pkill'

The current process termination approach is overly complex and potentially fragile.

🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 23-23: Quote this to prevent word splitting.

(SC2046)


[info] 23-23: Consider using pgrep instead of grepping ps output.

(SC2009)

DI-435/README.md (2)

4-6: Implementation modes are well-documented.

The distinction between find_qc and no_qc modes is clearly explained, providing users with a solid understanding of the operational differences.

🧰 Tools
🪛 Markdownlint (0.35.0)

5-5: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


48-69: Mode-specific documentation is thorough and well-structured.

The step-by-step process and output documentation for both modes provide clear guidance for users. The consistent file naming convention across modes is particularly helpful.

🧰 Tools
🪛 LanguageTool

[style] ~51-~51: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...

(ALL_OF_THE)

🪛 Markdownlint (0.35.0)

48-48: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


56-56: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


56-56: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


61-61: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


67-67: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


67-67: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


50-50: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


55-55: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


57-57: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


63-63: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


66-66: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


68-68: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

DI-435/find_vcfs_to_merge.py (3)

60-60: 🛠️ Refactor suggestion

Add missing trailing comma after the help parameter

Adding a trailing comma improves code style compliance and may prevent potential issues.

Apply this diff:

             )
+        ,
🧰 Tools
🪛 Ruff (0.8.0)

60-60: Trailing comma missing

Add trailing comma

(COM812)


355-355: 🛠️ Refactor suggestion

Add type annotations to find_medicover_vcf_files

Adding type annotations improves code clarity and assists with static analysis.

Apply these changes:

  • Import List from typing at the top of the file:
+from typing import List
  • Update the function definition:
-def find_medicover_vcf_files(projects):
+def find_medicover_vcf_files(projects: List[dict]) -> List[dict]:
🧰 Tools
🪛 Ruff (0.8.0)

355-355: Missing return type annotation for public function find_medicover_vcf_files

(ANN201)


355-355: Missing type annotation for function argument projects

(ANN001)


481-481: 🛠️ Refactor suggestion

Add missing trailing comma after the print argument

Including a trailing comma enhances code readability and complies with style guidelines.

Apply this diff:

     print(
         f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}"
+    ,
     )
🧰 Tools
🪛 Ruff (0.8.0)

481-481: Trailing comma missing

Add trailing comma

(COM812)

Comment on lines +66 to +81
while IFS=$'\t' read -r _ field2 field3; do
((line_number++))

# Validate fields are not empty
if [ -z "$field2" ] || [ -z "$field3" ]; then
echo "ERROR: Missing project or file ID at line $line_number"
exit 1
fi

# Validate field format (assuming they should match dx:// format)
if [[ ! "$field2" =~ ^project-.* ]] || [[ ! "$field3" =~ ^file-.* ]]; then
echo "ERROR: Invalid project or file ID format at line $line_number"
exit 1
fi
project_files+=("${field2}:${field3}")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Enhance input validation robustness

While the validation logic is sound, we can improve its maintainability and clarity.

Consider this enhancement:

+# Define validation patterns
+readonly PROJECT_ID_PATTERN='^project-[a-zA-Z0-9]{24}$'
+readonly FILE_ID_PATTERN='^file-[a-zA-Z0-9]{24}$'
+
 while IFS=$'\t' read -r _ field2 field3; do
     ((line_number++))
 
     # Validate fields are not empty
     if [ -z "$field2" ] || [ -z "$field3" ]; then
         echo "ERROR: Missing project or file ID at line $line_number"
         exit 1
     fi
 
     # Validate field format
-    if [[ ! "$field2" =~ ^project-.* ]] || [[ ! "$field3" =~ ^file-.* ]]; then
+    if [[ ! "$field2" =~ $PROJECT_ID_PATTERN ]] || [[ ! "$field3" =~ $FILE_ID_PATTERN ]]; then
         echo "ERROR: Invalid project or file ID format at line $line_number"
         exit 1
     fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while IFS=$'\t' read -r _ field2 field3; do
((line_number++))
# Validate fields are not empty
if [ -z "$field2" ] || [ -z "$field3" ]; then
echo "ERROR: Missing project or file ID at line $line_number"
exit 1
fi
# Validate field format (assuming they should match dx:// format)
if [[ ! "$field2" =~ ^project-.* ]] || [[ ! "$field3" =~ ^file-.* ]]; then
echo "ERROR: Invalid project or file ID format at line $line_number"
exit 1
fi
project_files+=("${field2}:${field3}")
# Define validation patterns
readonly PROJECT_ID_PATTERN='^project-[a-zA-Z0-9]{24}$'
readonly FILE_ID_PATTERN='^file-[a-zA-Z0-9]{24}$'
while IFS=$'\t' read -r _ field2 field3; do
((line_number++))
# Validate fields are not empty
if [ -z "$field2" ] || [ -z "$field3" ]; then
echo "ERROR: Missing project or file ID at line $line_number"
exit 1
fi
# Validate field format
if [[ ! "$field2" =~ $PROJECT_ID_PATTERN ]] || [[ ! "$field3" =~ $FILE_ID_PATTERN ]]; then
echo "ERROR: Invalid project or file ID format at line $line_number"
exit 1
fi
project_files+=("${field2}:${field3}")

Comment on lines +26 to +40
_get_peak_usage() {
: '''
Reports the peak memory and storage usage from dstat, to be called at end of script
'''
dx watch "$DX_JOB_ID" --no-follow --quiet > job.log

peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1)
total_mem="$(($(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*')/1024))"

peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1)
total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}')

echo "Memory usage peaked at ${peak_mem}/${total_mem}MB"
echo "Storage usage peaked at ${peak_storage}/${total_storage}GB"
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance error handling in peak usage monitoring

The peak usage function requires additional error handling to ensure reliability:

  1. No validation of dx watch command success
  2. Missing checks for memory info availability
  3. No cleanup of temporary job.log file

Apply this enhancement:

 _get_peak_usage() {
     : '''
     Reports the peak memory and storage usage from dstat, to be called at end of script
     '''
-    dx watch "$DX_JOB_ID" --no-follow --quiet > job.log
+    if ! dx watch "$DX_JOB_ID" --no-follow --quiet > job.log; then
+        echo "ERROR: Failed to retrieve job metrics"
+        return 1
+    fi
+
+    if [ ! -s job.log ]; then
+        echo "ERROR: Job metrics log is empty"
+        return 1
+    fi
 
     peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1)
-    total_mem="$(($(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*')/1024))"
+    mem_total=$(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*' || echo 0)
+    if [ "$mem_total" -eq 0 ]; then
+        echo "ERROR: Failed to retrieve total memory"
+        return 1
+    fi
+    total_mem="$((mem_total/1024))"
 
     peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1)
     total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}')
 
     echo "Memory usage peaked at ${peak_mem}/${total_mem}MB"
     echo "Storage usage peaked at ${peak_storage}/${total_storage}GB"
+    rm -f job.log
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_get_peak_usage() {
: '''
Reports the peak memory and storage usage from dstat, to be called at end of script
'''
dx watch "$DX_JOB_ID" --no-follow --quiet > job.log
peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1)
total_mem="$(($(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*')/1024))"
peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1)
total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}')
echo "Memory usage peaked at ${peak_mem}/${total_mem}MB"
echo "Storage usage peaked at ${peak_storage}/${total_storage}GB"
}
_get_peak_usage() {
: '''
Reports the peak memory and storage usage from dstat, to be called at end of script
'''
if ! dx watch "$DX_JOB_ID" --no-follow --quiet > job.log; then
echo "ERROR: Failed to retrieve job metrics"
return 1
fi
if [ ! -s job.log ]; then
echo "ERROR: Job metrics log is empty"
return 1
fi
peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1)
mem_total=$(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*' || echo 0)
if [ "$mem_total" -eq 0 ]; then
echo "ERROR: Failed to retrieve total memory"
return 1
fi
total_mem="$((mem_total/1024))"
peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1)
total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}')
echo "Memory usage peaked at ${peak_mem}/${total_mem}MB"
echo "Storage usage peaked at ${peak_storage}/${total_storage}GB"
rm -f job.log
}

Comment on lines +27 to +33
Example command to find Medicover VCFs with no QC status files:
```bash
python3 find_vcfs_to_merge.py \
--assay "*TWE38M" \
--outfile_prefix TWE38M \
--run_mode no_qc
```
Copy link

@coderabbitai coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Consider clarifying date parameter usage in no_qc mode.

While the Medicover example effectively demonstrates the no_qc mode, it would be helpful to explicitly state whether date parameters are supported or ignored in this mode.

🧰 Tools
🪛 Markdownlint (0.35.0)

28-28: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already clarified that the inputs are only allowed in find_qc mode

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklocke, understood. Thanks for clarifying.


✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/README.md:27-33
Timestamp: 2024-11-29T09:56:48.931Z
Learning: In the `find_vcfs_to_merge.py` script within `DI-435/`, the date parameters `--start` and `--end` are only allowed in the `find_qc` mode.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +37 to +43
- `-a --assay`: The GRCh38 project search term for in DNAnexus, e.g. `"*CEN38"`.
- `-o --outfile_prefix`: Prefix to use to name the output files, e.g. `CEN38`.
- `-r --run_mode`: A choice of whether to find and use QC status files (`find_qc`) or not (`no_qc`).

In `find_qc` mode the following inputs are allowed:
- `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below.
- `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below.
Copy link

@coderabbitai coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Consider marking required parameters.

While the documentation clearly explains each parameter, it would be beneficial to explicitly mark which parameters are required for each mode. For example:

  • -a --assay: [REQUIRED] The GRCh38 project search term...
  • -o --outfile_prefix: [REQUIRED] Prefix to use...
  • -r --run_mode: [REQUIRED] A choice of whether...
🧰 Tools
🪛 LanguageTool

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e no_qc ``` #### Inputs - -a --assay: The GRCh38 project search term for in D...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...e.g. "*CEN38". - -o --outfile_prefix: Prefix to use to name the output files,...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~39-~39: Loose punctuation mark.
Context: ...t files, e.g. CEN38. - -r --run_mode: A choice of whether to find and use QC ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...section below. - -e --end (optional): A date used to find DNAnexus (GRCh37) p...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Markdownlint (0.35.0)

42-42: Expected: 0; Actual: 2
Unordered list indentation

(MD007, ul-indent)


43-43: Expected: 0; Actual: 2
Unordered list indentation

(MD007, ul-indent)


42-42: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clear already

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklocke, understood.

(_/)
( ._.)
('')('')

Comment on lines 70 to 84
### Bash script to merge VCFs
The bash script is run in a DNAnexus cloud workstation and requires positional inputs of:
- The output file generated from the Python script above
- The job ID for the cloud workstation running
- The reference genome for GRCh38

Example bash script command:
`bash merge_VCF_AF.sh CEN38_vcf_to_merge.txt job-Gjb39Z04bxf82XZ12gPJ2bbV GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta`
```bash
bash merge_VCF_AF.sh \
CEN38_files_to_merge.txt \
GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta
```

Output:
Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. No newline at end of file
Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.
Copy link

@coderabbitai coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Consider adding operational prerequisites.

While the bash script documentation is clear, it would be beneficial to add:

  1. Required permissions or roles for DNAnexus cloud workstation access
  2. Estimated runtime or resource requirements
  3. Any cleanup steps needed after execution
🧰 Tools
🪛 LanguageTool

[uncategorized] ~72-~72: The preposition “on” seems more likely in this position than the preposition “in”.
Context: ...pt to merge VCFs The bash script is run in a DNAnexus cloud workstation and requir...

(AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)

🪛 Markdownlint (0.35.0)

71-71: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)


77-77: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


73-73: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


84-84: null
Files should end with a single newline character

(MD047, single-trailing-newline)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklocke, understood.


✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/README.md:84-84
Timestamp: 2024-11-29T09:56:55.375Z
Learning: In `DI-435/README.md`, operational prerequisites for the bash script `merge_VCF_AF.sh`, such as required permissions, estimated runtime, resource requirements, and cleanup steps, are not required in the documentation.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +371 to +374
'_Wdh': '',
'_Wdh2': '',
'_Whd3': '',
}
Copy link

@coderabbitai coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct the potential typo in strings_to_remove dictionary

The key '_Whd3' might be a typo and should perhaps be '_Wdh3' to maintain consistency with '_Wdh' and '_Wdh2'.

Apply this diff to correct the typo:

     strings_to_remove = {
         '_Wdh': '',
         '_Wdh2': '',
-        '_Whd3': '',
+        '_Wdh3': '',
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
'_Wdh': '',
'_Wdh2': '',
'_Whd3': '',
}
'_Wdh': '',
'_Wdh2': '',
'_Wdh3': '',
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not a typo

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklocke, understood. Thanks for the clarification—I'll note that '_Whd3' is correct.


✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/find_vcfs_to_merge.py:371-374
Timestamp: 2024-11-29T09:57:02.044Z
Learning: In `DI-435/find_vcfs_to_merge.py`, within the `strings_to_remove` dictionary in the `find_medicover_vcf_files` function, the key `'_Whd3'` is correct and intentional.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Collaborator Author

@rklocke rklocke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Dismissed @coderabbitai[bot] from a discussion.
Reviewable status: 0 of 3 files reviewed, 19 unresolved discussions (waiting on @coderabbitai[bot] and @RSWilson1)


DI-435/find_vcfs_to_merge.py line 357 at r1 (raw file):

Previously, RSWilson1 (Robert Wilson) wrote…

maybe add some more context to this docstring

Done.


DI-435/find_vcfs_to_merge.py line 369 at r1 (raw file):

Previously, RSWilson1 (Robert Wilson) wrote…

this naming could be clearer

Done.


DI-435/find_vcfs_to_merge.py line 381 at r1 (raw file):

Previously, RSWilson1 (Robert Wilson) wrote…

maybe a comment to explain the logic?

Done.

Comment on lines 51 to 61
parser.add_argument(
"-r",
"--run_mode",
choices=['find_qc', 'no_qc'],
required=True,
help=(
"Runmode - either 'find_qc' where QC status files are "
"searched for and used to find failed samples, or 'no_qc'"
"where no QC status files are searched for"
)
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 355 to 400
def find_medicover_vcf_files(projects):
"""
Find VCF files in a project
Parameters
----------
project_id : str
DX project ID
Returns
-------
vcf_files : list
list of dicts, each representing a VCF file in DX
"""
sub_remove = {
'_Wdh': '',
'_Wdh2': '',
'_Whd3': '',
}
all_sample_vcfs = []
for project in projects:
vcf_files = find_data(
"*_markdup_recalibrated_Haplotyper.vcf.gz", project['id']
)
for vcf_file in vcf_files:
file_id = vcf_file["describe"]["id"]
file_name = vcf_file["describe"]["name"]
sample_name = re.sub("|".join(
sub_remove), lambda x: sub_remove[x.group(0)],
file_name.split('-TwistWE')[0]
)

all_sample_vcfs.append(
{
"sample": sample_name,
"project": project["describe"]["id"],
"file_id": file_id
}
)

print(
f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects"
)

return all_sample_vcfs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added extra detail but for consistency I will leave the type hints out

Comment on lines +476 to +477
print(
f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

Comment on lines +371 to +374
'_Wdh': '',
'_Wdh2': '',
'_Whd3': '',
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not a typo

Comment on lines +19 to +20
PS4='\000[$(date)]\011'
export TZ=Europe/London
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works fine as is

DI-435/README.md Outdated
Comment on lines 46 to 68
#### find_qc mode
How the script works in `find_qc` mode:
1. Finds all DNAnexus projects with suffix `--assay`.
2. Finds the related GRCh37 project (and between `--start` and `--end` dates, if provided) for each of these projects and reads in all of the QC status files into one merged dataframe. If multiple QC status files exist in a project, the one created last is used.
3. Finds all raw VCFs in each of the DNAnexus projects.
4. Splits these VCFs into a list of validation (including control) samples and non-validation samples based on naming conventions
5. Removes any samples which are duplicates or any which failed QC at any time based on information within the QC status files.
6. Creates a final list of all VCFs to merge and writes this out to file with`--outfile_prefix`.
5. Removes the first instance of a sample which is duplicated or any which failed QC at any time based on information within the QC status files.
6. Creates a final list of all VCFs to merge and writes this out to file with `--outfile_prefix`.
##### Outputs
- A TSV listing the VCF files for all non-validation samples to merge (named by `{outfile_prefix}_files_to_merge.txt`)
- A CSV of all validation samples found (`{outfile_prefix}_validation_samples.csv`)
- A CSV of all projects within search but missing QC file in DNAnexus and therefore not included (`{outfile_prefix}_projects_missing_QC.csv`).


#### no_qc mode
How the script works in `no_qc` mode:
1. Finds all DNAnexus projects with suffix `--assay`
2. Finds all raw VCFs in each of the DNAnexus projects.
3. Removes all samples which are duplicates.
4. Creates a final list of all VCFs to merge and writes this to file with `--outfile_prefix`.
##### Outputs
- A TSV listing the VCF files to merge (named by `{outfile_prefix}_files_to_merge.txt`)
- A TSV of all duplicate files which have been removed (`{outfile_prefix}_all_dup_rows.txt`)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

DI-435/README.md Outdated
Comment on lines 40 to 83
```
bash merge_VCF_AF.sh \
CEN38_files_to_merge.txt \
GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta
```

Output:
Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. No newline at end of file
Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +27 to +33
Example command to find Medicover VCFs with no QC status files:
```bash
python3 find_vcfs_to_merge.py \
--assay "*TWE38M" \
--outfile_prefix TWE38M \
--run_mode no_qc
```
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already clarified that the inputs are only allowed in find_qc mode

Comment on lines +37 to +43
- `-a --assay`: The GRCh38 project search term for in DNAnexus, e.g. `"*CEN38"`.
- `-o --outfile_prefix`: Prefix to use to name the output files, e.g. `CEN38`.
- `-r --run_mode`: A choice of whether to find and use QC status files (`find_qc`) or not (`no_qc`).

In `find_qc` mode the following inputs are allowed:
- `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below.
- `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clear already

Comment on lines 70 to 84
### Bash script to merge VCFs
The bash script is run in a DNAnexus cloud workstation and requires positional inputs of:
- The output file generated from the Python script above
- The job ID for the cloud workstation running
- The reference genome for GRCh38

Example bash script command:
`bash merge_VCF_AF.sh CEN38_vcf_to_merge.txt job-Gjb39Z04bxf82XZ12gPJ2bbV GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta`
```bash
bash merge_VCF_AF.sh \
CEN38_files_to_merge.txt \
GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta
```

Output:
Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. No newline at end of file
Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required

Copy link
Contributor

@RSWilson1 RSWilson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, dismissed some coderabbit comments as not needed.

Reviewed 3 of 3 files at r2, all commit messages.
Dismissed @coderabbitai[bot] from a discussion.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @rklocke)

export TZ=Europe/London

# set frequency of instance usage in logs to 10 seconds
sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's fine then

Copy link
Contributor

@RSWilson1 RSWilson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @rklocke)

Copy link
Contributor

@RSWilson1 RSWilson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @coderabbitai[bot] from a discussion.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @rklocke)

Copy link
Contributor

@RSWilson1 RSWilson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @coderabbitai[bot] from 14 discussions.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @rklocke)

@RSWilson1 RSWilson1 merged commit 97df255 into main Nov 29, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants