DI-920 Changes for Medicover internal AF VCF #19

rklocke · 2024-11-26T15:55:36Z

Add run mode choices where in mode 'no_qc' we can find Medicover VCFs and not find the QC status files in the GRCh37 project like we would for CEN or TWE
Small changes to bash script to give better logging and check memory usage etc, use fields 2 and 3 of input file (since we longer output the pandas index from the Python file) and name output file by input file
Update README for changes

This change is

Summary by CodeRabbit

New Features
- Introduced a new command-line argument --run_mode for the VCF merging script, allowing users to choose between find_qc and no_qc modes.
- Added functionality to locate and process VCF files directly.
- Enhanced logging and resource usage reporting in the merging script.
Documentation
- Updated README to clarify operational modes and provide example commands, improving user understanding of the scripts.
Bug Fixes
- Improved error handling and output messaging for project handling in the VCF processing script.
Chores
- Updated output file naming conventions for clarity and consistency.

pep8speaks · 2024-11-26T15:55:43Z

Hello @rklocke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file DI-435/find_vcfs_to_merge.py:

Line 514:80: E501 line too long (88 > 79 characters)

Comment last updated at 2024-11-29 09:42:59 UTC

coderabbitai · 2024-11-26T15:55:49Z

Walkthrough

This pull request introduces significant updates to the documentation and scripts involved in processing VCF files. The README.md now delineates two operational modes for the find_vcfs_to_merge.py script: find_qc and no_qc. The script has been enhanced with a new --run_mode parameter, allowing users to specify how VCF files are processed. Additionally, the merge_VCF_AF.sh script has been improved with better logging and memory usage reporting. These changes collectively enhance clarity, usability, and functionality.

Changes

File	Change Summary
DI-435/README.md	Updated to clarify operational modes (`find_qc`, `no_qc`), added example commands, reformatted inputs, and updated output descriptions.
DI-435/find_vcfs_to_merge.py	Added `--run_mode` parameter, restructured `main` function for mode handling, refined `get_qc_files` logic, and introduced `find_medicover_vcf_files` function.
DI-435/merge_VCF_AF.sh	Enhanced logging with timestamps, added `_get_peak_usage` function for resource reporting, modified input handling, and updated output filenames.

Possibly related PRs

Di 435 bash improvements #17: The changes in merge_VCF_AF.sh regarding the handling of VCF files and improvements in error handling are relevant as they complement the updates made in the main PR to the same script, enhancing its overall functionality and robustness.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 11

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE

📥 Commits

Reviewing files that changed from the base of the PR and between fcf4410 and 2a8721e.

📒 Files selected for processing (3)

DI-435/README.md (1 hunks)
DI-435/find_vcfs_to_merge.py (5 hunks)
DI-435/merge_VCF_AF.sh (4 hunks)

🧰 Additional context used

📓 Learnings (1)

DI-435/merge_VCF_AF.sh (1)

Learnt from: rklocke
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:58-63
Timestamp: 2024-11-22T12:19:56.287Z
Learning: When improving bash scripts, prefer adding `set -exo pipefail` at the top to enhance error handling and robustness.

🪛 LanguageTool

DI-435/README.md

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e.g. "*CEN38". - -o --outfile_prefix: Prefix to use to name the output files,...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...t files, e.g. CEN38. - -r --run_mode: A choice of whether to find and use QC ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~41-~41: Loose punctuation mark.
Context: ...section below. - -e --end (optional): A date used to find DNAnexus (GRCh37) p...

(UNLIKELY_OPENING_PUNCTUATION)

[style] ~49-~49: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...

(ALL_OF_THE)

🪛 Markdownlint (0.35.0)

DI-435/README.md

59-59: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

8-8: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

9-9: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

9-9: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

35-35: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

46-46: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

54-54: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

54-54: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

60-60: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

66-66: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

66-66: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

70-70: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

11-11: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

16-16: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

18-18: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

28-28: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

76-76: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

5-5: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

36-36: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

48-48: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

53-53: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

55-55: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

62-62: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

65-65: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

67-67: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

72-72: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

28-28: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

76-76: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

83-83: null
Files should end with a single newline character

(MD047, single-trailing-newline)

🪛 Ruff (0.8.0)

DI-435/find_vcfs_to_merge.py

60-60: Trailing comma missing

Add trailing comma

(COM812)

218-218: Trailing comma missing

Add trailing comma

(COM812)

355-355: Missing return type annotation for public function find_medicover_vcf_files

(ANN201)

355-355: Missing type annotation for function argument projects

(ANN001)

377-377: Trailing comma missing

Add trailing comma

(COM812)

384-384: Trailing comma missing

Add trailing comma

(COM812)

391-391: Trailing comma missing

Add trailing comma

(COM812)

392-392: Trailing comma missing

Add trailing comma

(COM812)

396-396: Trailing comma missing

Add trailing comma

(COM812)

477-477: Trailing comma missing

Add trailing comma

(COM812)

483-483: Trailing comma missing

Add trailing comma

(COM812)

496-496: Trailing comma missing

Add trailing comma

(COM812)

513-513: Trailing comma missing

Add trailing comma

(COM812)

523-523: Trailing comma missing

Add trailing comma

(COM812)

532-532: Trailing comma missing

Add trailing comma

(COM812)

538-538: Trailing comma missing

Add trailing comma

(COM812)

543-543: Trailing comma missing

Add trailing comma

(COM812)

548-548: Trailing comma missing

Add trailing comma

(COM812)

556-556: Trailing comma missing

Add trailing comma

(COM812)

569-569: Trailing comma missing

Add trailing comma

(COM812)

576-576: Trailing comma missing

Add trailing comma

(COM812)

585-585: Trailing comma missing

Add trailing comma

(COM812)

591-591: Trailing comma missing

Add trailing comma

(COM812)

595-595: Trailing comma missing

Add trailing comma

(COM812)

598-598: Trailing comma missing

Add trailing comma

(COM812)

604-604: Trailing comma missing

Add trailing comma

(COM812)

🪛 Shellcheck (0.10.0)

DI-435/merge_VCF_AF.sh

[warning] 23-23: Quote this to prevent word splitting.

(SC2046)

[info] 23-23: Consider using pgrep instead of grepping ps output.

(SC2009)

[info] 30-30: Double quote to prevent globbing and word splitting.

(SC2086)

🔇 Additional comments (1)

DI-435/merge_VCF_AF.sh (1)

15-17: Logging output is properly directed for traceability

Excellent work directing stdout and stderr to log files. This will help us keep track of the script's execution, much like keeping a meticulous recipe book.

coderabbitai · 2024-11-26T15:58:52Z

DI-435/README.md

+```
+python3 find_vcfs_to_merge.py \
+    --assay "*CEN38" \
+    --outfile_prefix CEN38 \
+    --run_mode find_qc
+```
+Example command to find TWE GRCh38 VCFs where the GRCh37 runs were between two dates:
+```
+python3 find_vcfs_to_merge.py \
+    --assay "*TWE38" \
+    --outfile_prefix TWE38 \
+    --run_mode find_qc \
+    --start 2024-03-15 \
+    --end 2024-08-04
+```

-**Searching Dates**
-These dates restrict the projects collated to only GRCh38 projects
-which have corresponding GRCh37 projects which were created within the specified dates.
+Example command to find Medicover VCFs with no QC status files:
+```
+python3 find_vcfs_to_merge.py \
+    --assay "*TWE38M" \
+    --outfile_prefix TWE38M \
+    --run_mode no_qc
+```


🧹 Nitpick (assertive)

Ah, these examples need the proper garnish, mon ami!

Just like we label our mise en place, we must label our code blocks! They're missing their language specifiers, and the presentation needs proper spacing.

Add language specifiers and proper spacing to the code blocks:

-``` +```bash python3 find_vcfs_to_merge.py \

Apply this to all three code blocks and add blank lines before and after each block for proper presentation.

🧰 Tools

🪛 Markdownlint (0.35.0)

11-11: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

16-16: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

18-18: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

28-28: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

11-11: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

18-18: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

28-28: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

@rklocke Thanks for making the updates.

If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

coderabbitai · 2024-11-26T15:58:52Z

DI-435/README.md

+#### Inputs
+- `-a --assay`: The GRCh38 project search term for in DNAnexus, e.g. `"*CEN38"`.
+- `-o --outfile_prefix`: Prefix to use to name the output files, e.g. `CEN38`.
+- `-r --run_mode`: A choice of whether to find and use QC status files (`find_qc`) or not (`no_qc`).
+In `find_qc` mode the following inputs are allowed:
+  - `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below.
+  - `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below.


🧹 Nitpick (assertive)

The ingredients list is perfect, but the presentation needs finesse!

Your parameter descriptions are clear as consommé, but the formatting could use a touch of refinement. Let's add proper spacing around the headings and lists, just as we maintain proper spacing between courses.

Add blank lines:

Before and after the "Inputs" heading

Before and after the parameter list

Between the main parameters and the find_qc mode specific parameters

🧰 Tools

🪛 LanguageTool

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e.g. "*CEN38". - -o --outfile_prefix: Prefix to use to name the output files,...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...t files, e.g. CEN38. - -r --run_mode: A choice of whether to find and use QC ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~41-~41: Loose punctuation mark.
Context: ...section below. - -e --end (optional): A date used to find DNAnexus (GRCh37) p...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Markdownlint (0.35.0)

35-35: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

36-36: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

@rklocke, thanks for making the changes!

(^_^)/

If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

coderabbitai · 2024-11-26T15:58:52Z

DI-435/README.md

+#### find_qc mode
+How the script works in `find_qc` mode:
 1. Finds all DNAnexus projects with suffix `--assay`.
 2. Finds the related GRCh37 project (and between `--start` and `--end` dates, if provided) for each of these projects and reads in all of the QC status files into one merged dataframe. If multiple QC status files exist in a project, the one created last is used.
 3. Finds all raw VCFs in each of the DNAnexus projects.
 4. Splits these VCFs into a list of validation (including control) samples and non-validation samples based on naming conventions
-5. Removes any samples which are duplicates or any which failed QC at any time based on information within the QC status files.
-6. Creates a final list of all VCFs to merge and writes this out to file with`--outfile_prefix`.
+5. Removes the first instance of a sample which is duplicated or any which failed QC at any time based on information within the QC status files.
+6. Creates a final list of all VCFs to merge and writes this out to file with `--outfile_prefix`.
+##### Outputs
+- A TSV listing the VCF files for all non-validation samples to merge (named by `{outfile_prefix}_files_to_merge.txt`)
+- A CSV of all validation samples found (`{outfile_prefix}_validation_samples.csv`)
+- A CSV of all projects within search but missing QC file in DNAnexus and therefore not included (`{outfile_prefix}_projects_missing_QC.csv`).
+
+
+#### no_qc mode
+How the script works in `no_qc` mode:
+1. Finds all DNAnexus projects with suffix `--assay`
+2. Finds all raw VCFs in each of the DNAnexus projects.
+3. Removes all samples which are duplicates.
+4. Creates a final list of all VCFs to merge and writes this to file with `--outfile_prefix`.
+##### Outputs
+- A TSV listing the VCF files to merge (named by `{outfile_prefix}_files_to_merge.txt`)
+- A TSV of all duplicate files which have been removed (`{outfile_prefix}_all_dup_rows.txt`)


🧹 Nitpick (assertive)

The recipe is perfect, but the plating needs attention!

Your mode descriptions are as detailed as a complex reduction sauce, but we've got some unnecessary blank lines that would never pass in my kitchen! And those headings need proper spacing, just like we space our courses!

Remove the double blank lines around line 59

Add proper spacing around all headings and lists

Ensure consistent formatting between both mode sections

🧰 Tools

🪛 LanguageTool

[style] ~49-~49: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...

(ALL_OF_THE)

🪛 Markdownlint (0.35.0)

59-59: Expected: 1; Actual: 2
Multiple consecutive blank lines

(MD012, no-multiple-blanks)

46-46: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

54-54: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

54-54: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

60-60: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

66-66: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

66-66: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

48-48: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

53-53: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

55-55: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

62-62: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

65-65: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

67-67: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

Rate Limit Exceeded

@rklocke have exceeded the limit for the number of chat messages per hour. Please wait 2 minutes and 16 seconds before sending another message.

coderabbitai · 2024-11-26T15:58:52Z

DI-435/README.md

+This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:
+- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples.
+- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely.


🧹 Nitpick (assertive)

Magnifique! The description is perfectly seasoned, but let's garnish it properly!

The explanation of the two modes is chef's kiss perfect, but just like a properly plated dish needs proper spacing, our markdown needs proper formatting.

Add a blank line before the list for proper markdown presentation:

This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented: + - `find_qc` searches for GRCh38 projects...

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:

- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples.

- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely.

This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:

- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples.

- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely.

🧰 Tools

🪛 Markdownlint (0.35.0)

5-5: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

coderabbitai · 2024-11-26T15:58:53Z

DI-435/README.md

+```
+bash merge_VCF_AF.sh \
+    CEN38_files_to_merge.txt \
+    GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta
+```

 Output:
-Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.
+Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.


🧹 Nitpick (assertive)

The final course needs its proper presentation, non?

The bash script example is like a perfect soufflé, but it's missing its proper garnish! And where's our proper ending? Every perfect meal needs a proper finish!

Add the bash language specifier to the code block:

-``` +```bash bash merge_VCF_AF.sh \

Ensure the file ends with a single newline character

Add proper spacing around the code block

🧰 Tools

🪛 Markdownlint (0.35.0)

76-76: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

76-76: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

83-83: null
Files should end with a single newline character

(MD047, single-trailing-newline)

@rklocke, great, thanks for implementing the changes promptly!

(_/)
( ^.^ )
/(")(")

If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!

coderabbitai · 2024-11-26T15:58:53Z

DI-435/merge_VCF_AF.sh

+PS4='\000[$(date)]\011'
+export TZ=Europe/London


🛠️ Refactor suggestion

Refine the PS4 prompt for accurate command tracing

Darling, while you've seasoned the script with timestamps for command tracing, the use of \000 and \011 may not serve us the rich flavour we're aiming for. Let's adjust the PS4 variable to ensure our logs are as impeccable as our finest dishes.

Apply this diff to correct the PS4 prompt:

-PS4='\000[$(date)]\011' +PS4='[$(date +"%Y-%m-%d %H:%M:%S")] '

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

PS4='\000[$(date)]\011'

export TZ=Europe/London

PS4='[$(date +"%Y-%m-%d %H:%M:%S")] '

export TZ=Europe/London

This works fine as is

@rklocke, understood. I'll note this for future reviews.

✏️ Learnings added

Learnt from: rklocke PR: eastgenomics/RD_requests#19 File: DI-435/merge_VCF_AF.sh:19-20 Timestamp: 2024-11-29T09:56:53.623Z Learning: In `DI-435/merge_VCF_AF.sh`, the PS4 variable is set to `PS4='\000[$(date)]\011'` and should be kept as is.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2024-11-26T15:58:53Z

DI-435/merge_VCF_AF.sh

+export TZ=Europe/London
+
+# set frequency of instance usage in logs to 10 seconds
+sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')


🛠️ Refactor suggestion

Improve process termination using 'pkill' and eliminate 'sudo'

Listen, we must be precise when pruning our processes. Using pkill tidies up our approach wonderfully, and let's avoid sudo to keep our kitchen safe from unintended side effects.

Apply this diff to improve the command:

-sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}') +pkill pcp-dstat

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')

pkill pcp-dstat

🧰 Tools

🪛 Shellcheck (0.10.0)

[warning] 23-23: Quote this to prevent word splitting.

(SC2046)

[info] 23-23: Consider using pgrep instead of grepping ps output.

(SC2009)

This could be nice if it works?

@RSWilson1, absolutely, it works like a charm! Replacing sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}') with pkill pcp-dstat is a more elegant and efficient way to terminate the pcp-dstat process. It simplifies the script and enhances safety by eliminating the need for sudo. In my kitchen, we always strive for precision and efficiency. Let's incorporate this refined approach to keep our script running smoothly!

I added this part as advised by Jethro and DNAnexus didn't allow kill without sudo - I'm not sure I understand enough to know if changing this is fine..

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

yeah that's fine then

coderabbitai · 2024-11-26T15:58:53Z

DI-435/find_vcfs_to_merge.py

+    print(
+        f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}"


🧹 Nitpick (assertive)

Add a trailing comma after the print arguments

Including a trailing comma enhances code readability and complies with style guidelines.

Suggested change:

print( f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}" + )

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.8.0)

477-477: Trailing comma missing

Add trailing comma

(COM812)

Understood, @rklocke.

(_/)
( ;_;)

✏️ Learnings added

Learnt from: rklocke PR: eastgenomics/RD_requests#19 File: DI-435/find_vcfs_to_merge.py:480-481 Timestamp: 2024-11-29T09:56:58.290Z Learning: Trailing commas after print function arguments are not needed in this codebase.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2024-11-26T15:58:53Z

DI-435/find_vcfs_to_merge.py

+    parser.add_argument(
+        "-r",
+        "--run_mode",
+        choices=['find_qc', 'no_qc'],
+        required=True,
+        help=(
+            "Runmode - either 'find_qc' where QC status files are "
+            "searched for and used to find failed samples, or 'no_qc'"
+            "where no QC status files are searched for"
+        )
+    )


⚠️ Potential issue

Mind the spacing and add a trailing comma in the help message

The concatenated strings in the help parameter are missing a space, causing words to run together in the command-line help output. Additionally, a trailing comma is missing after the help parameter, which improves code style compliance.

Please apply the following fix:

help=( "Runmode - either 'find_qc' where QC status files are " "searched for and used to find failed samples, or 'no_qc'" - "where no QC status files are searched for" + " where no QC status files are searched for" ),

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.8.0)

60-60: Trailing comma missing

Add trailing comma

(COM812)

Rate Limit Exceeded

@rklocke have exceeded the limit for the number of chat messages per hour. Please wait 2 minutes and 21 seconds before sending another message.

coderabbitai · 2024-11-26T15:58:53Z

DI-435/find_vcfs_to_merge.py

+def find_medicover_vcf_files(projects):
+    """
+    Find VCF files in a project
+
+    Parameters
+    ----------
+    project_id : str
+        DX project ID
+
+    Returns
+    -------
+    vcf_files : list
+        list of dicts, each representing a VCF file in DX
+    """
+    sub_remove = {
+        '_Wdh': '',
+        '_Wdh2': '',
+        '_Whd3': '',
+    }
+    all_sample_vcfs = []
+    for project in projects:
+        vcf_files = find_data(
+            "*_markdup_recalibrated_Haplotyper.vcf.gz", project['id']
+        )
+        for vcf_file in vcf_files:
+            file_id = vcf_file["describe"]["id"]
+            file_name = vcf_file["describe"]["name"]
+            sample_name = re.sub("|".join(
+                sub_remove), lambda x: sub_remove[x.group(0)],
+                file_name.split('-TwistWE')[0]
+            )
+
+            all_sample_vcfs.append(
+                {
+                    "sample": sample_name,
+                    "project": project["describe"]["id"],
+                    "file_id": file_id
+                }
+            )
+
+    print(
+        f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects"
+    )
+
+    return all_sample_vcfs
+


🛠️ Refactor suggestion

Add type annotations and correct the docstring in find_medicover_vcf_files

The function find_medicover_vcf_files is missing type annotations for its parameter and return type, and the docstring contains inaccuracies. Adding type annotations improves code clarity and assists with static analysis. Updating the docstring ensures accurate documentation.

Please consider the following changes:

+from typing import List, Dict -def find_medicover_vcf_files(projects): +def find_medicover_vcf_files(projects: List[dict]) -> List[dict]: """ - Find VCF files in a project + Find VCF files in specified projects Parameters ---------- - project_id : str - DX project ID + projects : List[dict] + List of DNAnexus project dictionaries Returns ------- - vcf_files : list + all_sample_vcfs : List[dict] List of dictionaries, each representing a VCF file in DNAnexus """

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def find_medicover_vcf_files(projects):

"""

Find VCF files in a project

Parameters

----------

project_id : str

DX project ID

Returns

-------

vcf_files : list

list of dicts, each representing a VCF file in DX

"""

sub_remove = {

'_Wdh': '',

'_Wdh2': '',

'_Whd3': '',

}

all_sample_vcfs = []

for project in projects:

vcf_files = find_data(

"*_markdup_recalibrated_Haplotyper.vcf.gz", project['id']

)

for vcf_file in vcf_files:

file_id = vcf_file["describe"]["id"]

file_name = vcf_file["describe"]["name"]

sample_name = re.sub("|".join(

sub_remove), lambda x: sub_remove[x.group(0)],

file_name.split('-TwistWE')[0]

)

all_sample_vcfs.append(

{

"sample": sample_name,

"project": project["describe"]["id"],

"file_id": file_id

}

)

print(

f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects"

)

return all_sample_vcfs

from typing import List, Dict

def find_medicover_vcf_files(projects: List[dict]) -> List[dict]:

"""

Find VCF files in specified projects

Parameters

----------

projects : List[dict]

List of DNAnexus project dictionaries

Returns

-------

all_sample_vcfs : List[dict]

List of dictionaries, each representing a VCF file in DNAnexus

"""

sub_remove = {

'_Wdh': '',

'_Wdh2': '',

'_Whd3': '',

}

all_sample_vcfs = []

for project in projects:

vcf_files = find_data(

"*_markdup_recalibrated_Haplotyper.vcf.gz", project['id']

)

for vcf_file in vcf_files:

file_id = vcf_file["describe"]["id"]

file_name = vcf_file["describe"]["name"]

sample_name = re.sub("|".join(

sub_remove), lambda x: sub_remove[x.group(0)],

file_name.split('-TwistWE')[0]

)

all_sample_vcfs.append(

{

"sample": sample_name,

"project": project["describe"]["id"],

"file_id": file_id

}

)

print(

f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects"

)

return all_sample_vcfs

🧰 Tools

🪛 Ruff (0.8.0)

355-355: Missing return type annotation for public function find_medicover_vcf_files

(ANN201)

355-355: Missing type annotation for function argument projects

(ANN001)

377-377: Trailing comma missing

Add trailing comma

(COM812)

384-384: Trailing comma missing

Add trailing comma

(COM812)

391-391: Trailing comma missing

Add trailing comma

(COM812)

392-392: Trailing comma missing

Add trailing comma

(COM812)

396-396: Trailing comma missing

Add trailing comma

(COM812)

I have added extra detail but for consistency I will leave the type hints out

Rate Limit Exceeded

@rklocke have exceeded the limit for the number of chat messages per hour. Please wait 2 minutes and 20 seconds before sending another message.

RSWilson1

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @coderabbitai[bot] and @rklocke)

DI-435/find_vcfs_to_merge.py line 357 at r1 (raw file):

def find_medicover_vcf_files(projects):
    """
    Find VCF files in a project

maybe add some more context to this docstring

DI-435/find_vcfs_to_merge.py line 369 at r1 (raw file):

        list of dicts, each representing a VCF file in DX
    """
    sub_remove = {

this naming could be clearer

DI-435/find_vcfs_to_merge.py line 381 at r1 (raw file):

        for vcf_file in vcf_files:
            file_id = vcf_file["describe"]["id"]
            file_name = vcf_file["describe"]["name"]

maybe a comment to explain the logic?

DI-435/merge_VCF_AF.sh line 94 at r1 (raw file):

mkdir -p norm
echo "Normalising VCFs"
find . -maxdepth 1 -name '*.vcf.gz' -print0 | xargs -0 -P "${num_proc}" -I{} bcftools norm -m -any -f "${genome}" -Oz "{}" -o norm/"$(basename {})"

NICE! I was trying to get this to work but kept getting odd errors.

DI-435/merge_VCF_AF.sh line 134 at r1 (raw file):

dx upload "final_merged_${file_prefix}.vcf.gz"
dx upload "final_merged_${file_prefix}.vcf.gz.tbi"
_get_peak_usage

This is nifty 👍

RSWilson1 · 2024-11-27T13:05:45Z

DI-435/merge_VCF_AF.sh

+export TZ=Europe/London
+
+# set frequency of instance usage in logs to 10 seconds
+sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')


This could be nice if it works?

coderabbitai

Actionable comments posted: 6

🧹 Outside diff range comments (1)

DI-435/merge_VCF_AF.sh (1)

Line range hint 92-133: Add error handling for critical file operations

The file operations lack proper error checking, which could lead to silent failures.

Consider adding checks for critical operations:

 mkdir -p norm
+if [ ! -d "norm" ]; then
+    echo "ERROR: Failed to create normalisation directory"
+    exit 1
+fi
+
 echo "Normalising VCFs"
-find . -maxdepth 1 -name '*.vcf.gz' -print0 | xargs -0 -P "${num_proc}" -I{} bcftools norm -m -any -f "${genome}" -Oz "{}" -o norm/"$(basename {})"
+find . -maxdepth 1 -name '*.vcf.gz' -print0 | xargs -0 -P "${num_proc}" -I{} bash -c '
+    if ! bcftools norm -m -any -f "${genome}" -Oz "{}" -o norm/"$(basename {})"; then
+        echo "ERROR: Failed to normalise {}"
+        exit 1
+    fi
+'
+
+# Verify files were created
+if [ ! "$(ls -A norm/*.vcf.gz 2>/dev/null)" ]; then
+    echo "ERROR: No normalised VCF files were created"
+    exit 1
+fi

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE

📥 Commits

Reviewing files that changed from the base of the PR and between 2a8721e and 9a6fddd.

📒 Files selected for processing (3)

DI-435/README.md (1 hunks)
DI-435/find_vcfs_to_merge.py (5 hunks)
DI-435/merge_VCF_AF.sh (4 hunks)

🧰 Additional context used

📓 Learnings (1)

DI-435/merge_VCF_AF.sh (4)

Learnt from: rklocke
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:58-63
Timestamp: 2024-11-22T12:19:56.287Z
Learning: When improving bash scripts, prefer adding `set -exo pipefail` at the top to enhance error handling and robustness.

Learnt from: RSWilson1
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:60-60
Timestamp: 2024-11-25T16:12:16.377Z
Learning: In `DI-435/merge_VCF_AF.sh`, the script handles cases where no VCF files are found by validating the input file (`input_file`) at the beginning, so additional checks later in the script are unnecessary.

Learnt from: RSWilson1
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:76-78
Timestamp: 2024-11-25T16:01:30.595Z
Learning: In the script `DI-435/merge_VCF_AF.sh`, the presence of VCF files is already ensured by checking during the file download process.

Learnt from: RSWilson1
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:60-60
Timestamp: 2024-11-25T16:01:34.002Z
Learning: In `DI-435/merge_VCF_AF.sh`, the script handles cases where no VCF files are found by checking during the file download step, so additional checks later in the script are unnecessary.

🪛 LanguageTool

DI-435/README.md

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e no_qc ``` #### Inputs - -a --assay: The GRCh38 project search term for in D...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...e.g. "*CEN38". - -o --outfile_prefix: Prefix to use to name the output files,...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~39-~39: Loose punctuation mark.
Context: ...t files, e.g. CEN38. - -r --run_mode: A choice of whether to find and use QC ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...section below. - -e --end (optional): A date used to find DNAnexus (GRCh37) p...

(UNLIKELY_OPENING_PUNCTUATION)

[style] ~51-~51: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...

(ALL_OF_THE)

[uncategorized] ~72-~72: The preposition “on” seems more likely in this position than the preposition “in”.
Context: ...pt to merge VCFs The bash script is run in a DNAnexus cloud workstation and requir...

(AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)

🪛 Markdownlint (0.35.0)

DI-435/README.md

42-42: Expected: 0; Actual: 2
Unordered list indentation

(MD007, ul-indent)

43-43: Expected: 0; Actual: 2
Unordered list indentation

(MD007, ul-indent)

8-8: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

9-9: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

9-9: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

48-48: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

56-56: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

56-56: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

61-61: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

67-67: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

67-67: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

71-71: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

11-11: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

16-16: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

18-18: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

28-28: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

77-77: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

5-5: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

42-42: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

50-50: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

55-55: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

57-57: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

63-63: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

66-66: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

68-68: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

73-73: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

84-84: null
Files should end with a single newline character

(MD047, single-trailing-newline)

🪛 Ruff (0.8.0)

DI-435/find_vcfs_to_merge.py

60-60: Trailing comma missing

Add trailing comma

(COM812)

218-218: Trailing comma missing

Add trailing comma

(COM812)

355-355: Missing return type annotation for public function find_medicover_vcf_files

(ANN201)

355-355: Missing type annotation for function argument projects

(ANN001)

378-378: Trailing comma missing

Add trailing comma

(COM812)

388-388: Trailing comma missing

Add trailing comma

(COM812)

395-395: Trailing comma missing

Add trailing comma

(COM812)

396-396: Trailing comma missing

Add trailing comma

(COM812)

400-400: Trailing comma missing

Add trailing comma

(COM812)

481-481: Trailing comma missing

Add trailing comma

(COM812)

487-487: Trailing comma missing

Add trailing comma

(COM812)

500-500: Trailing comma missing

Add trailing comma

(COM812)

517-517: Trailing comma missing

Add trailing comma

(COM812)

527-527: Trailing comma missing

Add trailing comma

(COM812)

536-536: Trailing comma missing

Add trailing comma

(COM812)

542-542: Trailing comma missing

Add trailing comma

(COM812)

547-547: Trailing comma missing

Add trailing comma

(COM812)

552-552: Trailing comma missing

Add trailing comma

(COM812)

560-560: Trailing comma missing

Add trailing comma

(COM812)

573-573: Trailing comma missing

Add trailing comma

(COM812)

580-580: Trailing comma missing

Add trailing comma

(COM812)

589-589: Trailing comma missing

Add trailing comma

(COM812)

595-595: Trailing comma missing

Add trailing comma

(COM812)

599-599: Trailing comma missing

Add trailing comma

(COM812)

602-602: Trailing comma missing

Add trailing comma

(COM812)

608-608: Trailing comma missing

Add trailing comma

(COM812)

🪛 Shellcheck (0.10.0)

DI-435/merge_VCF_AF.sh

[warning] 23-23: Quote this to prevent word splitting.

(SC2046)

[info] 23-23: Consider using pgrep instead of grepping ps output.

(SC2009)

🔇 Additional comments (7)

DI-435/merge_VCF_AF.sh (2)

15-20: 🛠️ Refactor suggestion

Optimise logging setup for improved reliability

The logging configuration requires refinement:

The PS4 format using octal escapes (\000, \011) may not be portable across all environments
The timestamp format lacks millisecond precision, which could be crucial for debugging

Apply this diff to enhance logging reliability:

-PS4='\000[$(date)]\011'
+PS4='[$(date +"%Y-%m-%d %H:%M:%S.%3N")] '

22-24: 🛠️ Refactor suggestion

Improve process termination using 'pkill'

The current process termination approach is overly complex and potentially fragile.

🧰 Tools

🪛 Shellcheck (0.10.0)

[warning] 23-23: Quote this to prevent word splitting.

(SC2046)

[info] 23-23: Consider using pgrep instead of grepping ps output.

(SC2009)

DI-435/README.md (2)

4-6: Implementation modes are well-documented.

The distinction between find_qc and no_qc modes is clearly explained, providing users with a solid understanding of the operational differences.

🧰 Tools

🪛 Markdownlint (0.35.0)

5-5: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

48-69: Mode-specific documentation is thorough and well-structured.

The step-by-step process and output documentation for both modes provide clear guidance for users. The consistent file naming convention across modes is particularly helpful.

🧰 Tools

🪛 LanguageTool

[style] ~51-~51: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...

(ALL_OF_THE)

🪛 Markdownlint (0.35.0)

48-48: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

56-56: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

56-56: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

61-61: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

67-67: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

67-67: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

50-50: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

55-55: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

57-57: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

63-63: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

66-66: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

68-68: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

DI-435/find_vcfs_to_merge.py (3)

60-60: 🛠️ Refactor suggestion

Add missing trailing comma after the help parameter

Adding a trailing comma improves code style compliance and may prevent potential issues.

Apply this diff:

             )
+        ,

🧰 Tools

🪛 Ruff (0.8.0)

60-60: Trailing comma missing

Add trailing comma

(COM812)

355-355: 🛠️ Refactor suggestion

Add type annotations to find_medicover_vcf_files

Adding type annotations improves code clarity and assists with static analysis.

Apply these changes:

Import List from typing at the top of the file:

+from typing import List

Update the function definition:

-def find_medicover_vcf_files(projects):
+def find_medicover_vcf_files(projects: List[dict]) -> List[dict]:

🧰 Tools

🪛 Ruff (0.8.0)

355-355: Missing return type annotation for public function find_medicover_vcf_files

(ANN201)

355-355: Missing type annotation for function argument projects

(ANN001)

481-481: 🛠️ Refactor suggestion

Add missing trailing comma after the print argument

Including a trailing comma enhances code readability and complies with style guidelines.

Apply this diff:

     print(
         f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}"
+    ,
     )

🧰 Tools

🪛 Ruff (0.8.0)

481-481: Trailing comma missing

Add trailing comma

(COM812)

coderabbitai · 2024-11-29T09:46:24Z

DI-435/merge_VCF_AF.sh

+while IFS=$'\t' read -r _ field2 field3; do
+    ((line_number++))
+
+    # Validate fields are not empty
+    if [ -z "$field2" ] || [ -z "$field3" ]; then
+        echo "ERROR: Missing project or file ID at line $line_number"
+        exit 1
+    fi
+
+    # Validate field format (assuming they should match dx:// format)
+    if [[ ! "$field2" =~ ^project-.* ]] || [[ ! "$field3" =~ ^file-.* ]]; then
+        echo "ERROR: Invalid project or file ID format at line $line_number"
+        exit 1
+    fi
+    project_files+=("${field2}:${field3}")
+


🧹 Nitpick (assertive)

Enhance input validation robustness

While the validation logic is sound, we can improve its maintainability and clarity.

Consider this enhancement:

+# Define validation patterns +readonly PROJECT_ID_PATTERN='^project-[a-zA-Z0-9]{24}$' +readonly FILE_ID_PATTERN='^file-[a-zA-Z0-9]{24}$' + while IFS=$'\t' read -r _ field2 field3; do ((line_number++)) # Validate fields are not empty if [ -z "$field2" ] || [ -z "$field3" ]; then echo "ERROR: Missing project or file ID at line $line_number" exit 1 fi # Validate field format - if [[ ! "$field2" =~ ^project-.* ]] || [[ ! "$field3" =~ ^file-.* ]]; then + if [[ ! "$field2" =~ $PROJECT_ID_PATTERN ]] || [[ ! "$field3" =~ $FILE_ID_PATTERN ]]; then echo "ERROR: Invalid project or file ID format at line $line_number" exit 1 fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

while IFS=$'\t' read -r _ field2 field3; do

((line_number++))

# Validate fields are not empty

if [ -z "$field2" ] || [ -z "$field3" ]; then

echo "ERROR: Missing project or file ID at line $line_number"

exit 1

fi

# Validate field format (assuming they should match dx:// format)

if [[ ! "$field2" =~ ^project-.* ]] || [[ ! "$field3" =~ ^file-.* ]]; then

echo "ERROR: Invalid project or file ID format at line $line_number"

exit 1

fi

project_files+=("${field2}:${field3}")

# Define validation patterns

readonly PROJECT_ID_PATTERN='^project-[a-zA-Z0-9]{24}$'

readonly FILE_ID_PATTERN='^file-[a-zA-Z0-9]{24}$'

while IFS=$'\t' read -r _ field2 field3; do

((line_number++))

# Validate fields are not empty

if [ -z "$field2" ] || [ -z "$field3" ]; then

echo "ERROR: Missing project or file ID at line $line_number"

exit 1

fi

# Validate field format

if [[ ! "$field2" =~ $PROJECT_ID_PATTERN ]] || [[ ! "$field3" =~ $FILE_ID_PATTERN ]]; then

echo "ERROR: Invalid project or file ID format at line $line_number"

exit 1

fi

project_files+=("${field2}:${field3}")

coderabbitai · 2024-11-29T09:46:24Z

DI-435/merge_VCF_AF.sh

+_get_peak_usage() {
+    : '''
+    Reports the peak memory and storage usage from dstat, to be called at end of script
+    '''
+    dx watch "$DX_JOB_ID" --no-follow --quiet > job.log
+
+    peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1)
+    total_mem="$(($(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*')/1024))"
+
+    peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1)
+    total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}')
+
+    echo "Memory usage peaked at ${peak_mem}/${total_mem}MB"
+    echo "Storage usage peaked at ${peak_storage}/${total_storage}GB"
+}


🛠️ Refactor suggestion

Enhance error handling in peak usage monitoring

The peak usage function requires additional error handling to ensure reliability:

No validation of dx watch command success

Missing checks for memory info availability

No cleanup of temporary job.log file

Apply this enhancement:

_get_peak_usage() { : ''' Reports the peak memory and storage usage from dstat, to be called at end of script ''' - dx watch "$DX_JOB_ID" --no-follow --quiet > job.log + if ! dx watch "$DX_JOB_ID" --no-follow --quiet > job.log; then + echo "ERROR: Failed to retrieve job metrics" + return 1 + fi + + if [ ! -s job.log ]; then + echo "ERROR: Job metrics log is empty" + return 1 + fi peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1) - total_mem="$(($(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*')/1024))" + mem_total=$(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*' || echo 0) + if [ "$mem_total" -eq 0 ]; then + echo "ERROR: Failed to retrieve total memory" + return 1 + fi + total_mem="$((mem_total/1024))" peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1) total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}') echo "Memory usage peaked at ${peak_mem}/${total_mem}MB" echo "Storage usage peaked at ${peak_storage}/${total_storage}GB" + rm -f job.log }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_get_peak_usage() {

: '''

Reports the peak memory and storage usage from dstat, to be called at end of script

'''

dx watch "$DX_JOB_ID" --no-follow --quiet > job.log

peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1)

total_mem="$(($(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*')/1024))"

peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1)

total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}')

echo "Memory usage peaked at ${peak_mem}/${total_mem}MB"

echo "Storage usage peaked at ${peak_storage}/${total_storage}GB"

}

_get_peak_usage() {

: '''

Reports the peak memory and storage usage from dstat, to be called at end of script

'''

if ! dx watch "$DX_JOB_ID" --no-follow --quiet > job.log; then

echo "ERROR: Failed to retrieve job metrics"

return 1

fi

if [ ! -s job.log ]; then

echo "ERROR: Job metrics log is empty"

return 1

fi

peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1)

mem_total=$(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*' || echo 0)

if [ "$mem_total" -eq 0 ]; then

echo "ERROR: Failed to retrieve total memory"

return 1

fi

total_mem="$((mem_total/1024))"

peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1)

total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}')

echo "Memory usage peaked at ${peak_mem}/${total_mem}MB"

echo "Storage usage peaked at ${peak_storage}/${total_storage}GB"

rm -f job.log

}

coderabbitai · 2024-11-29T09:46:24Z

DI-435/README.md

+Example command to find Medicover VCFs with no QC status files:
+```bash
+python3 find_vcfs_to_merge.py \
+    --assay "*TWE38M" \
+    --outfile_prefix TWE38M \
+    --run_mode no_qc
+```


🧹 Nitpick (assertive)

Consider clarifying date parameter usage in no_qc mode.

While the Medicover example effectively demonstrates the no_qc mode, it would be helpful to explicitly state whether date parameters are supported or ignored in this mode.

🧰 Tools

🪛 Markdownlint (0.35.0)

28-28: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

Already clarified that the inputs are only allowed in find_qc mode

@rklocke, understood. Thanks for clarifying.

✏️ Learnings added

Learnt from: rklocke PR: eastgenomics/RD_requests#19 File: DI-435/README.md:27-33 Timestamp: 2024-11-29T09:56:48.931Z Learning: In the `find_vcfs_to_merge.py` script within `DI-435/`, the date parameters `--start` and `--end` are only allowed in the `find_qc` mode.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2024-11-29T09:46:24Z

DI-435/README.md

+- `-a --assay`: The GRCh38 project search term for in DNAnexus, e.g. `"*CEN38"`.
+- `-o --outfile_prefix`: Prefix to use to name the output files, e.g. `CEN38`.
+- `-r --run_mode`: A choice of whether to find and use QC status files (`find_qc`) or not (`no_qc`).
+
+In `find_qc` mode the following inputs are allowed:
+  - `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below.
+  - `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below.


🧹 Nitpick (assertive)

Consider marking required parameters.

While the documentation clearly explains each parameter, it would be beneficial to explicitly mark which parameters are required for each mode. For example:

-a --assay: [REQUIRED] The GRCh38 project search term...

-o --outfile_prefix: [REQUIRED] Prefix to use...

-r --run_mode: [REQUIRED] A choice of whether...

🧰 Tools

🪛 LanguageTool

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e no_qc ``` #### Inputs - -a --assay: The GRCh38 project search term for in D...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...e.g. "*CEN38". - -o --outfile_prefix: Prefix to use to name the output files,...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~39-~39: Loose punctuation mark.
Context: ...t files, e.g. CEN38. - -r --run_mode: A choice of whether to find and use QC ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...section below. - -e --end (optional): A date used to find DNAnexus (GRCh37) p...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Markdownlint (0.35.0)

42-42: Expected: 0; Actual: 2
Unordered list indentation

(MD007, ul-indent)

43-43: Expected: 0; Actual: 2
Unordered list indentation

(MD007, ul-indent)

42-42: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

It's clear already

@rklocke, understood.

(_/)
( ._.)
('')('')

coderabbitai · 2024-11-29T09:46:24Z

DI-435/README.md

 ### Bash script to merge VCFs
 The bash script is run in a DNAnexus cloud workstation and requires positional inputs of:
 - The output file generated from the Python script above
- The job ID for the cloud workstation running
 - The reference genome for GRCh38

 Example bash script command:
-`bash merge_VCF_AF.sh CEN38_vcf_to_merge.txt job-Gjb39Z04bxf82XZ12gPJ2bbV GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta`
+```bash
+bash merge_VCF_AF.sh \
+    CEN38_files_to_merge.txt \
+    GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta
+```

 Output:
-Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.
+Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.


🧹 Nitpick (assertive)

Consider adding operational prerequisites.

While the bash script documentation is clear, it would be beneficial to add:

Required permissions or roles for DNAnexus cloud workstation access

Estimated runtime or resource requirements

Any cleanup steps needed after execution

🧰 Tools

🪛 LanguageTool

[uncategorized] ~72-~72: The preposition “on” seems more likely in this position than the preposition “in”.
Context: ...pt to merge VCFs The bash script is run in a DNAnexus cloud workstation and requir...

(AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)

🪛 Markdownlint (0.35.0)

71-71: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines

(MD022, blanks-around-headings)

77-77: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

73-73: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

84-84: null
Files should end with a single newline character

(MD047, single-trailing-newline)

Not required

@rklocke, understood.

✏️ Learnings added

Learnt from: rklocke PR: eastgenomics/RD_requests#19 File: DI-435/README.md:84-84 Timestamp: 2024-11-29T09:56:55.375Z Learning: In `DI-435/README.md`, operational prerequisites for the bash script `merge_VCF_AF.sh`, such as required permissions, estimated runtime, resource requirements, and cleanup steps, are not required in the documentation.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2024-11-29T09:46:25Z

DI-435/find_vcfs_to_merge.py

+        '_Wdh': '',
+        '_Wdh2': '',
+        '_Whd3': '',
+    }


⚠️ Potential issue

Correct the potential typo in strings_to_remove dictionary

The key '_Whd3' might be a typo and should perhaps be '_Wdh3' to maintain consistency with '_Wdh' and '_Wdh2'.

Apply this diff to correct the typo:

strings_to_remove = { '_Wdh': '', '_Wdh2': '', - '_Whd3': '', + '_Wdh3': '', }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

'_Wdh': '',

'_Wdh2': '',

'_Whd3': '',

}

'_Wdh': '',

'_Wdh2': '',

'_Wdh3': '',

}

Actually not a typo

@rklocke, understood. Thanks for the clarification—I'll note that '_Whd3' is correct.

✏️ Learnings added

Learnt from: rklocke PR: eastgenomics/RD_requests#19 File: DI-435/find_vcfs_to_merge.py:371-374 Timestamp: 2024-11-29T09:57:02.044Z Learning: In `DI-435/find_vcfs_to_merge.py`, within the `strings_to_remove` dictionary in the `find_medicover_vcf_files` function, the key `'_Whd3'` is correct and intentional.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

rklocke

Dismissed @coderabbitai[bot] from a discussion.
Reviewable status: 0 of 3 files reviewed, 19 unresolved discussions (waiting on @coderabbitai[bot] and @RSWilson1)

DI-435/find_vcfs_to_merge.py line 357 at r1 (raw file):

Previously, RSWilson1 (Robert Wilson) wrote…

maybe add some more context to this docstring

Done.

DI-435/find_vcfs_to_merge.py line 369 at r1 (raw file):

Previously, RSWilson1 (Robert Wilson) wrote…

this naming could be clearer

Done.

DI-435/find_vcfs_to_merge.py line 381 at r1 (raw file):

Previously, RSWilson1 (Robert Wilson) wrote…

maybe a comment to explain the logic?

Done.

rklocke · 2024-11-29T09:56:16Z

DI-435/find_vcfs_to_merge.py

+    parser.add_argument(
+        "-r",
+        "--run_mode",
+        choices=['find_qc', 'no_qc'],
+        required=True,
+        help=(
+            "Runmode - either 'find_qc' where QC status files are "
+            "searched for and used to find failed samples, or 'no_qc'"
+            "where no QC status files are searched for"
+        )
+    )


rklocke · 2024-11-29T09:56:16Z

DI-435/find_vcfs_to_merge.py

+def find_medicover_vcf_files(projects):
+    """
+    Find VCF files in a project
+
+    Parameters
+    ----------
+    project_id : str
+        DX project ID
+
+    Returns
+    -------
+    vcf_files : list
+        list of dicts, each representing a VCF file in DX
+    """
+    sub_remove = {
+        '_Wdh': '',
+        '_Wdh2': '',
+        '_Whd3': '',
+    }
+    all_sample_vcfs = []
+    for project in projects:
+        vcf_files = find_data(
+            "*_markdup_recalibrated_Haplotyper.vcf.gz", project['id']
+        )
+        for vcf_file in vcf_files:
+            file_id = vcf_file["describe"]["id"]
+            file_name = vcf_file["describe"]["name"]
+            sample_name = re.sub("|".join(
+                sub_remove), lambda x: sub_remove[x.group(0)],
+                file_name.split('-TwistWE')[0]
+            )
+
+            all_sample_vcfs.append(
+                {
+                    "sample": sample_name,
+                    "project": project["describe"]["id"],
+                    "file_id": file_id
+                }
+            )
+
+    print(
+        f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects"
+    )
+
+    return all_sample_vcfs
+


I have added extra detail but for consistency I will leave the type hints out

rklocke · 2024-11-29T09:56:16Z

DI-435/find_vcfs_to_merge.py

+    print(
+        f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}"


rklocke · 2024-11-29T09:56:16Z

DI-435/find_vcfs_to_merge.py

+        '_Wdh': '',
+        '_Wdh2': '',
+        '_Whd3': '',
+    }


Actually not a typo

rklocke · 2024-11-29T09:56:17Z

DI-435/merge_VCF_AF.sh

+PS4='\000[$(date)]\011'
+export TZ=Europe/London


This works fine as is

rklocke · 2024-11-29T09:56:18Z

DI-435/README.md

+#### find_qc mode
+How the script works in `find_qc` mode:
 1. Finds all DNAnexus projects with suffix `--assay`.
 2. Finds the related GRCh37 project (and between `--start` and `--end` dates, if provided) for each of these projects and reads in all of the QC status files into one merged dataframe. If multiple QC status files exist in a project, the one created last is used.
 3. Finds all raw VCFs in each of the DNAnexus projects.
 4. Splits these VCFs into a list of validation (including control) samples and non-validation samples based on naming conventions
-5. Removes any samples which are duplicates or any which failed QC at any time based on information within the QC status files.
-6. Creates a final list of all VCFs to merge and writes this out to file with`--outfile_prefix`.
+5. Removes the first instance of a sample which is duplicated or any which failed QC at any time based on information within the QC status files.
+6. Creates a final list of all VCFs to merge and writes this out to file with `--outfile_prefix`.
+##### Outputs
+- A TSV listing the VCF files for all non-validation samples to merge (named by `{outfile_prefix}_files_to_merge.txt`)
+- A CSV of all validation samples found (`{outfile_prefix}_validation_samples.csv`)
+- A CSV of all projects within search but missing QC file in DNAnexus and therefore not included (`{outfile_prefix}_projects_missing_QC.csv`).
+
+
+#### no_qc mode
+How the script works in `no_qc` mode:
+1. Finds all DNAnexus projects with suffix `--assay`
+2. Finds all raw VCFs in each of the DNAnexus projects.
+3. Removes all samples which are duplicates.
+4. Creates a final list of all VCFs to merge and writes this to file with `--outfile_prefix`.
+##### Outputs
+- A TSV listing the VCF files to merge (named by `{outfile_prefix}_files_to_merge.txt`)
+- A TSV of all duplicate files which have been removed (`{outfile_prefix}_all_dup_rows.txt`)


rklocke · 2024-11-29T09:56:18Z

DI-435/README.md

+```
+bash merge_VCF_AF.sh \
+    CEN38_files_to_merge.txt \
+    GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta
+```

 Output:
-Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.
+Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.


rklocke · 2024-11-29T09:56:18Z

DI-435/README.md

+Example command to find Medicover VCFs with no QC status files:
+```bash
+python3 find_vcfs_to_merge.py \
+    --assay "*TWE38M" \
+    --outfile_prefix TWE38M \
+    --run_mode no_qc
+```


Already clarified that the inputs are only allowed in find_qc mode

rklocke · 2024-11-29T09:56:19Z

DI-435/README.md

+- `-a --assay`: The GRCh38 project search term for in DNAnexus, e.g. `"*CEN38"`.
+- `-o --outfile_prefix`: Prefix to use to name the output files, e.g. `CEN38`.
+- `-r --run_mode`: A choice of whether to find and use QC status files (`find_qc`) or not (`no_qc`).
+
+In `find_qc` mode the following inputs are allowed:
+  - `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below.
+  - `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below.


It's clear already

rklocke · 2024-11-29T09:56:19Z

DI-435/README.md

 ### Bash script to merge VCFs
 The bash script is run in a DNAnexus cloud workstation and requires positional inputs of:
 - The output file generated from the Python script above
- The job ID for the cloud workstation running
 - The reference genome for GRCh38

 Example bash script command:
-`bash merge_VCF_AF.sh CEN38_vcf_to_merge.txt job-Gjb39Z04bxf82XZ12gPJ2bbV GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta`
+```bash
+bash merge_VCF_AF.sh \
+    CEN38_files_to_merge.txt \
+    GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta
+```

 Output:
-Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.
+Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.


Not required

RSWilson1

Looks good, dismissed some coderabbit comments as not needed.

Reviewed 3 of 3 files at r2, all commit messages.
Dismissed @coderabbitai[bot] from a discussion.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @rklocke)

RSWilson1 · 2024-11-29T11:15:54Z

DI-435/merge_VCF_AF.sh

+export TZ=Europe/London
+
+# set frequency of instance usage in logs to 10 seconds
+sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')


yeah that's fine then

RSWilson1

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @rklocke)

RSWilson1

Dismissed @coderabbitai[bot] from a discussion.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @rklocke)

RSWilson1

Dismissed @coderabbitai[bot] from 14 discussions.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @rklocke)

rklocke added 7 commits November 12, 2024 16:00

Update script for TWE runs and for Medicover no_qc runmode

2f9e690

Update writing out dups

7658a3f

Merge TWE changes into branch

b57215e

Update for finding Medicover VCFs and update merging script

f74539d

Merge bash script, add Medicover finding, update README

4a58018

Update bash script and Python script to remove index from output CSV

b9e2804

Small formatting changes

2a8721e

rklocke requested a review from RSWilson1 November 26, 2024 15:55

coderabbitai bot reviewed Nov 26, 2024

View reviewed changes

RSWilson1 requested changes Nov 27, 2024

View reviewed changes

PR changes

9a6fddd

coderabbitai bot reviewed Nov 29, 2024

View reviewed changes

rklocke commented Nov 29, 2024

View reviewed changes

RSWilson1 approved these changes Nov 29, 2024

View reviewed changes

RSWilson1 merged commit 97df255 into main Nov 29, 2024
4 checks passed

	sudo kill $(ps aux \| grep pcp-dstat \| head -n1 \| awk '{print $2}')
	pkill pcp-dstat

		print(
		f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}"

DI-920 Changes for Medicover internal AF VCF #19

DI-920 Changes for Medicover internal AF VCF #19

Conversation

rklocke commented Nov 26, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

pep8speaks commented Nov 26, 2024 • edited Loading

Comment last updated at 2024-11-29 09:42:59 UTC

coderabbitai bot commented Nov 26, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

Rate Limit Exceeded

coderabbitai bot Nov 26, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

Rate Limit Exceeded

coderabbitai bot Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

Rate Limit Exceeded

RSWilson1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 29, 2024

rklocke commented Nov 26, 2024 •

edited by coderabbitai bot

Loading

pep8speaks commented Nov 26, 2024 •

edited

Loading

coderabbitai bot commented Nov 26, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot Nov 26, 2024 •

edited

Loading

coderabbitai bot Nov 26, 2024 •

edited

Loading

coderabbitai bot Nov 26, 2024 •

edited

Loading

coderabbitai bot Nov 26, 2024 •

edited

Loading

coderabbitai bot Nov 26, 2024 •

edited

Loading

coderabbitai bot Nov 26, 2024 •

edited

Loading

coderabbitai bot Nov 26, 2024 •

edited

Loading

coderabbitai bot Nov 26, 2024 •

edited

Loading

coderabbitai bot Nov 26, 2024 •

edited

Loading

coderabbitai bot Nov 29, 2024 •

edited

Loading

coderabbitai bot Nov 29, 2024 •

edited

Loading

coderabbitai bot Nov 29, 2024 •

edited

Loading

coderabbitai bot Nov 29, 2024 •

edited

Loading