-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DI-920 Changes for Medicover internal AF VCF #19
Conversation
Hello @rklocke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-11-29 09:42:59 UTC |
WalkthroughThis pull request introduces significant updates to the documentation and scripts involved in processing VCF files. The Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
📒 Files selected for processing (3)
DI-435/README.md
(1 hunks)DI-435/find_vcfs_to_merge.py
(5 hunks)DI-435/merge_VCF_AF.sh
(4 hunks)
🧰 Additional context used
📓 Learnings (1)
DI-435/merge_VCF_AF.sh (1)
Learnt from: rklocke
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:58-63
Timestamp: 2024-11-22T12:19:56.287Z
Learning: When improving bash scripts, prefer adding `set -exo pipefail` at the top to enhance error handling and robustness.
🪛 LanguageTool
DI-435/README.md
[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e.g. "*CEN38"
. - -o --outfile_prefix
: Prefix to use to name the output files,...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...t files, e.g. CEN38
. - -r --run_mode
: A choice of whether to find and use QC ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~41-~41: Loose punctuation mark.
Context: ...section below. - -e --end (optional)
: A date used to find DNAnexus (GRCh37) p...
(UNLIKELY_OPENING_PUNCTUATION)
[style] ~49-~49: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...
(ALL_OF_THE)
🪛 Markdownlint (0.35.0)
DI-435/README.md
59-59: Expected: 1; Actual: 2
Multiple consecutive blank lines
(MD012, no-multiple-blanks)
8-8: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
9-9: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
9-9: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
35-35: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
46-46: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
54-54: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
54-54: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
60-60: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
66-66: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
66-66: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
70-70: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
11-11: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
16-16: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
18-18: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
28-28: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
76-76: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
5-5: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
36-36: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
48-48: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
53-53: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
55-55: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
62-62: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
65-65: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
67-67: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
72-72: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
11-11: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
18-18: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
28-28: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
76-76: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
83-83: null
Files should end with a single newline character
(MD047, single-trailing-newline)
🪛 Ruff (0.8.0)
DI-435/find_vcfs_to_merge.py
60-60: Trailing comma missing
Add trailing comma
(COM812)
218-218: Trailing comma missing
Add trailing comma
(COM812)
355-355: Missing return type annotation for public function find_medicover_vcf_files
(ANN201)
355-355: Missing type annotation for function argument projects
(ANN001)
377-377: Trailing comma missing
Add trailing comma
(COM812)
384-384: Trailing comma missing
Add trailing comma
(COM812)
391-391: Trailing comma missing
Add trailing comma
(COM812)
392-392: Trailing comma missing
Add trailing comma
(COM812)
396-396: Trailing comma missing
Add trailing comma
(COM812)
477-477: Trailing comma missing
Add trailing comma
(COM812)
483-483: Trailing comma missing
Add trailing comma
(COM812)
496-496: Trailing comma missing
Add trailing comma
(COM812)
513-513: Trailing comma missing
Add trailing comma
(COM812)
523-523: Trailing comma missing
Add trailing comma
(COM812)
532-532: Trailing comma missing
Add trailing comma
(COM812)
538-538: Trailing comma missing
Add trailing comma
(COM812)
543-543: Trailing comma missing
Add trailing comma
(COM812)
548-548: Trailing comma missing
Add trailing comma
(COM812)
556-556: Trailing comma missing
Add trailing comma
(COM812)
569-569: Trailing comma missing
Add trailing comma
(COM812)
576-576: Trailing comma missing
Add trailing comma
(COM812)
585-585: Trailing comma missing
Add trailing comma
(COM812)
591-591: Trailing comma missing
Add trailing comma
(COM812)
595-595: Trailing comma missing
Add trailing comma
(COM812)
598-598: Trailing comma missing
Add trailing comma
(COM812)
604-604: Trailing comma missing
Add trailing comma
(COM812)
🪛 Shellcheck (0.10.0)
DI-435/merge_VCF_AF.sh
[warning] 23-23: Quote this to prevent word splitting.
(SC2046)
[info] 23-23: Consider using pgrep instead of grepping ps output.
(SC2009)
[info] 30-30: Double quote to prevent globbing and word splitting.
(SC2086)
🔇 Additional comments (1)
DI-435/merge_VCF_AF.sh (1)
15-17
: Logging output is properly directed for traceability
Excellent work directing stdout
and stderr
to log files. This will help us keep track of the script's execution, much like keeping a meticulous recipe book.
DI-435/README.md
Outdated
``` | ||
python3 find_vcfs_to_merge.py \ | ||
--assay "*CEN38" \ | ||
--outfile_prefix CEN38 \ | ||
--run_mode find_qc | ||
``` | ||
Example command to find TWE GRCh38 VCFs where the GRCh37 runs were between two dates: | ||
``` | ||
python3 find_vcfs_to_merge.py \ | ||
--assay "*TWE38" \ | ||
--outfile_prefix TWE38 \ | ||
--run_mode find_qc \ | ||
--start 2024-03-15 \ | ||
--end 2024-08-04 | ||
``` | ||
|
||
**Searching Dates** | ||
These dates restrict the projects collated to only GRCh38 projects | ||
which have corresponding GRCh37 projects which were created within the specified dates. | ||
Example command to find Medicover VCFs with no QC status files: | ||
``` | ||
python3 find_vcfs_to_merge.py \ | ||
--assay "*TWE38M" \ | ||
--outfile_prefix TWE38M \ | ||
--run_mode no_qc | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Ah, these examples need the proper garnish, mon ami!
Just like we label our mise en place, we must label our code blocks! They're missing their language specifiers, and the presentation needs proper spacing.
Add language specifiers and proper spacing to the code blocks:
-```
+```bash
python3 find_vcfs_to_merge.py \
Apply this to all three code blocks and add blank lines before and after each block for proper presentation.
🧰 Tools
🪛 Markdownlint (0.35.0)
11-11: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
16-16: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
18-18: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
28-28: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
11-11: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
18-18: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
28-28: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DI-435/README.md
Outdated
#### Inputs | ||
- `-a --assay`: The GRCh38 project search term for in DNAnexus, e.g. `"*CEN38"`. | ||
- `-o --outfile_prefix`: Prefix to use to name the output files, e.g. `CEN38`. | ||
- `-r --run_mode`: A choice of whether to find and use QC status files (`find_qc`) or not (`no_qc`). | ||
In `find_qc` mode the following inputs are allowed: | ||
- `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below. | ||
- `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
The ingredients list is perfect, but the presentation needs finesse!
Your parameter descriptions are clear as consommé, but the formatting could use a touch of refinement. Let's add proper spacing around the headings and lists, just as we maintain proper spacing between courses.
Add blank lines:
- Before and after the "Inputs" heading
- Before and after the parameter list
- Between the main parameters and the
find_qc
mode specific parameters
🧰 Tools
🪛 LanguageTool
[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e.g. "*CEN38"
. - -o --outfile_prefix
: Prefix to use to name the output files,...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...t files, e.g. CEN38
. - -r --run_mode
: A choice of whether to find and use QC ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~41-~41: Loose punctuation mark.
Context: ...section below. - -e --end (optional)
: A date used to find DNAnexus (GRCh37) p...
(UNLIKELY_OPENING_PUNCTUATION)
🪛 Markdownlint (0.35.0)
35-35: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
36-36: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DI-435/README.md
Outdated
#### find_qc mode | ||
How the script works in `find_qc` mode: | ||
1. Finds all DNAnexus projects with suffix `--assay`. | ||
2. Finds the related GRCh37 project (and between `--start` and `--end` dates, if provided) for each of these projects and reads in all of the QC status files into one merged dataframe. If multiple QC status files exist in a project, the one created last is used. | ||
3. Finds all raw VCFs in each of the DNAnexus projects. | ||
4. Splits these VCFs into a list of validation (including control) samples and non-validation samples based on naming conventions | ||
5. Removes any samples which are duplicates or any which failed QC at any time based on information within the QC status files. | ||
6. Creates a final list of all VCFs to merge and writes this out to file with`--outfile_prefix`. | ||
5. Removes the first instance of a sample which is duplicated or any which failed QC at any time based on information within the QC status files. | ||
6. Creates a final list of all VCFs to merge and writes this out to file with `--outfile_prefix`. | ||
##### Outputs | ||
- A TSV listing the VCF files for all non-validation samples to merge (named by `{outfile_prefix}_files_to_merge.txt`) | ||
- A CSV of all validation samples found (`{outfile_prefix}_validation_samples.csv`) | ||
- A CSV of all projects within search but missing QC file in DNAnexus and therefore not included (`{outfile_prefix}_projects_missing_QC.csv`). | ||
|
||
|
||
#### no_qc mode | ||
How the script works in `no_qc` mode: | ||
1. Finds all DNAnexus projects with suffix `--assay` | ||
2. Finds all raw VCFs in each of the DNAnexus projects. | ||
3. Removes all samples which are duplicates. | ||
4. Creates a final list of all VCFs to merge and writes this to file with `--outfile_prefix`. | ||
##### Outputs | ||
- A TSV listing the VCF files to merge (named by `{outfile_prefix}_files_to_merge.txt`) | ||
- A TSV of all duplicate files which have been removed (`{outfile_prefix}_all_dup_rows.txt`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
The recipe is perfect, but the plating needs attention!
Your mode descriptions are as detailed as a complex reduction sauce, but we've got some unnecessary blank lines that would never pass in my kitchen! And those headings need proper spacing, just like we space our courses!
- Remove the double blank lines around line 59
- Add proper spacing around all headings and lists
- Ensure consistent formatting between both mode sections
🧰 Tools
🪛 LanguageTool
[style] ~49-~49: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...
(ALL_OF_THE)
🪛 Markdownlint (0.35.0)
59-59: Expected: 1; Actual: 2
Multiple consecutive blank lines
(MD012, no-multiple-blanks)
46-46: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
54-54: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
54-54: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
60-60: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
66-66: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
66-66: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
48-48: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
53-53: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
55-55: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
62-62: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
65-65: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
67-67: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rate Limit Exceeded
@rklocke have exceeded the limit for the number of chat messages per hour. Please wait 2 minutes and 16 seconds before sending another message.
This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented: | ||
- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples. | ||
- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Magnifique! The description is perfectly seasoned, but let's garnish it properly!
The explanation of the two modes is chef's kiss perfect, but just like a properly plated dish needs proper spacing, our markdown needs proper formatting.
Add a blank line before the list for proper markdown presentation:
This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:
+
- `find_qc` searches for GRCh38 projects...
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented: | |
- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples. | |
- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely. | |
This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented: | |
- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples. | |
- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely. |
🧰 Tools
🪛 Markdownlint (0.35.0)
5-5: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
DI-435/README.md
Outdated
``` | ||
bash merge_VCF_AF.sh \ | ||
CEN38_files_to_merge.txt \ | ||
GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta | ||
``` | ||
|
||
Output: | ||
Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. No newline at end of file | ||
Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
The final course needs its proper presentation, non?
The bash script example is like a perfect soufflé, but it's missing its proper garnish! And where's our proper ending? Every perfect meal needs a proper finish!
- Add the bash language specifier to the code block:
-```
+```bash
bash merge_VCF_AF.sh \
- Ensure the file ends with a single newline character
- Add proper spacing around the code block
🧰 Tools
🪛 Markdownlint (0.35.0)
76-76: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
76-76: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
83-83: null
Files should end with a single newline character
(MD047, single-trailing-newline)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS4='\000[$(date)]\011' | ||
export TZ=Europe/London |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refine the PS4 prompt for accurate command tracing
Darling, while you've seasoned the script with timestamps for command tracing, the use of \000
and \011
may not serve us the rich flavour we're aiming for. Let's adjust the PS4
variable to ensure our logs are as impeccable as our finest dishes.
Apply this diff to correct the PS4
prompt:
-PS4='\000[$(date)]\011'
+PS4='[$(date +"%Y-%m-%d %H:%M:%S")] '
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
PS4='\000[$(date)]\011' | |
export TZ=Europe/London | |
PS4='[$(date +"%Y-%m-%d %H:%M:%S")] ' | |
export TZ=Europe/London |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works fine as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rklocke, understood. I'll note this for future reviews.
✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/merge_VCF_AF.sh:19-20
Timestamp: 2024-11-29T09:56:53.623Z
Learning: In `DI-435/merge_VCF_AF.sh`, the PS4 variable is set to `PS4='\000[$(date)]\011'` and should be kept as is.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
export TZ=Europe/London | ||
|
||
# set frequency of instance usage in logs to 10 seconds | ||
sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve process termination using 'pkill' and eliminate 'sudo'
Listen, we must be precise when pruning our processes. Using pkill
tidies up our approach wonderfully, and let's avoid sudo
to keep our kitchen safe from unintended side effects.
Apply this diff to improve the command:
-sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')
+pkill pcp-dstat
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}') | |
pkill pcp-dstat |
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 23-23: Quote this to prevent word splitting.
(SC2046)
[info] 23-23: Consider using pgrep instead of grepping ps output.
(SC2009)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be nice if it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RSWilson1, absolutely, it works like a charm! Replacing sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}')
with pkill pcp-dstat
is a more elegant and efficient way to terminate the pcp-dstat
process. It simplifies the script and enhances safety by eliminating the need for sudo
. In my kitchen, we always strive for precision and efficiency. Let's incorporate this refined approach to keep our script running smoothly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this part as advised by Jethro and DNAnexus didn't allow kill without sudo - I'm not sure I understand enough to know if changing this is fine..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's fine then
print( | ||
f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Add a trailing comma after the print arguments
Including a trailing comma enhances code readability and complies with style guidelines.
Suggested change:
print(
f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}"
+ )
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.8.0)
477-477: Trailing comma missing
Add trailing comma
(COM812)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, @rklocke.
(_/)
( ;_;)
✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/find_vcfs_to_merge.py:480-481
Timestamp: 2024-11-29T09:56:58.290Z
Learning: Trailing commas after print function arguments are not needed in this codebase.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
DI-435/find_vcfs_to_merge.py
Outdated
parser.add_argument( | ||
"-r", | ||
"--run_mode", | ||
choices=['find_qc', 'no_qc'], | ||
required=True, | ||
help=( | ||
"Runmode - either 'find_qc' where QC status files are " | ||
"searched for and used to find failed samples, or 'no_qc'" | ||
"where no QC status files are searched for" | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind the spacing and add a trailing comma in the help message
The concatenated strings in the help
parameter are missing a space, causing words to run together in the command-line help output. Additionally, a trailing comma is missing after the help
parameter, which improves code style compliance.
Please apply the following fix:
help=(
"Runmode - either 'find_qc' where QC status files are "
"searched for and used to find failed samples, or 'no_qc'"
- "where no QC status files are searched for"
+ " where no QC status files are searched for"
),
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.8.0)
60-60: Trailing comma missing
Add trailing comma
(COM812)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rate Limit Exceeded
@rklocke have exceeded the limit for the number of chat messages per hour. Please wait 2 minutes and 21 seconds before sending another message.
DI-435/find_vcfs_to_merge.py
Outdated
def find_medicover_vcf_files(projects): | ||
""" | ||
Find VCF files in a project | ||
Parameters | ||
---------- | ||
project_id : str | ||
DX project ID | ||
Returns | ||
------- | ||
vcf_files : list | ||
list of dicts, each representing a VCF file in DX | ||
""" | ||
sub_remove = { | ||
'_Wdh': '', | ||
'_Wdh2': '', | ||
'_Whd3': '', | ||
} | ||
all_sample_vcfs = [] | ||
for project in projects: | ||
vcf_files = find_data( | ||
"*_markdup_recalibrated_Haplotyper.vcf.gz", project['id'] | ||
) | ||
for vcf_file in vcf_files: | ||
file_id = vcf_file["describe"]["id"] | ||
file_name = vcf_file["describe"]["name"] | ||
sample_name = re.sub("|".join( | ||
sub_remove), lambda x: sub_remove[x.group(0)], | ||
file_name.split('-TwistWE')[0] | ||
) | ||
|
||
all_sample_vcfs.append( | ||
{ | ||
"sample": sample_name, | ||
"project": project["describe"]["id"], | ||
"file_id": file_id | ||
} | ||
) | ||
|
||
print( | ||
f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects" | ||
) | ||
|
||
return all_sample_vcfs | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add type annotations and correct the docstring in find_medicover_vcf_files
The function find_medicover_vcf_files
is missing type annotations for its parameter and return type, and the docstring contains inaccuracies. Adding type annotations improves code clarity and assists with static analysis. Updating the docstring ensures accurate documentation.
Please consider the following changes:
+from typing import List, Dict
-def find_medicover_vcf_files(projects):
+def find_medicover_vcf_files(projects: List[dict]) -> List[dict]:
"""
- Find VCF files in a project
+ Find VCF files in specified projects
Parameters
----------
- project_id : str
- DX project ID
+ projects : List[dict]
+ List of DNAnexus project dictionaries
Returns
-------
- vcf_files : list
+ all_sample_vcfs : List[dict]
List of dictionaries, each representing a VCF file in DNAnexus
"""
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def find_medicover_vcf_files(projects): | |
""" | |
Find VCF files in a project | |
Parameters | |
---------- | |
project_id : str | |
DX project ID | |
Returns | |
------- | |
vcf_files : list | |
list of dicts, each representing a VCF file in DX | |
""" | |
sub_remove = { | |
'_Wdh': '', | |
'_Wdh2': '', | |
'_Whd3': '', | |
} | |
all_sample_vcfs = [] | |
for project in projects: | |
vcf_files = find_data( | |
"*_markdup_recalibrated_Haplotyper.vcf.gz", project['id'] | |
) | |
for vcf_file in vcf_files: | |
file_id = vcf_file["describe"]["id"] | |
file_name = vcf_file["describe"]["name"] | |
sample_name = re.sub("|".join( | |
sub_remove), lambda x: sub_remove[x.group(0)], | |
file_name.split('-TwistWE')[0] | |
) | |
all_sample_vcfs.append( | |
{ | |
"sample": sample_name, | |
"project": project["describe"]["id"], | |
"file_id": file_id | |
} | |
) | |
print( | |
f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects" | |
) | |
return all_sample_vcfs | |
from typing import List, Dict | |
def find_medicover_vcf_files(projects: List[dict]) -> List[dict]: | |
""" | |
Find VCF files in specified projects | |
Parameters | |
---------- | |
projects : List[dict] | |
List of DNAnexus project dictionaries | |
Returns | |
------- | |
all_sample_vcfs : List[dict] | |
List of dictionaries, each representing a VCF file in DNAnexus | |
""" | |
sub_remove = { | |
'_Wdh': '', | |
'_Wdh2': '', | |
'_Whd3': '', | |
} | |
all_sample_vcfs = [] | |
for project in projects: | |
vcf_files = find_data( | |
"*_markdup_recalibrated_Haplotyper.vcf.gz", project['id'] | |
) | |
for vcf_file in vcf_files: | |
file_id = vcf_file["describe"]["id"] | |
file_name = vcf_file["describe"]["name"] | |
sample_name = re.sub("|".join( | |
sub_remove), lambda x: sub_remove[x.group(0)], | |
file_name.split('-TwistWE')[0] | |
) | |
all_sample_vcfs.append( | |
{ | |
"sample": sample_name, | |
"project": project["describe"]["id"], | |
"file_id": file_id | |
} | |
) | |
print( | |
f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects" | |
) | |
return all_sample_vcfs |
🧰 Tools
🪛 Ruff (0.8.0)
355-355: Missing return type annotation for public function find_medicover_vcf_files
(ANN201)
355-355: Missing type annotation for function argument projects
(ANN001)
377-377: Trailing comma missing
Add trailing comma
(COM812)
384-384: Trailing comma missing
Add trailing comma
(COM812)
391-391: Trailing comma missing
Add trailing comma
(COM812)
392-392: Trailing comma missing
Add trailing comma
(COM812)
396-396: Trailing comma missing
Add trailing comma
(COM812)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added extra detail but for consistency I will leave the type hints out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rate Limit Exceeded
@rklocke have exceeded the limit for the number of chat messages per hour. Please wait 2 minutes and 20 seconds before sending another message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @coderabbitai[bot] and @rklocke)
DI-435/find_vcfs_to_merge.py
line 357 at r1 (raw file):
def find_medicover_vcf_files(projects): """ Find VCF files in a project
maybe add some more context to this docstring
DI-435/find_vcfs_to_merge.py
line 369 at r1 (raw file):
list of dicts, each representing a VCF file in DX """ sub_remove = {
this naming could be clearer
DI-435/find_vcfs_to_merge.py
line 381 at r1 (raw file):
for vcf_file in vcf_files: file_id = vcf_file["describe"]["id"] file_name = vcf_file["describe"]["name"]
maybe a comment to explain the logic?
DI-435/merge_VCF_AF.sh
line 94 at r1 (raw file):
mkdir -p norm echo "Normalising VCFs" find . -maxdepth 1 -name '*.vcf.gz' -print0 | xargs -0 -P "${num_proc}" -I{} bcftools norm -m -any -f "${genome}" -Oz "{}" -o norm/"$(basename {})"
NICE! I was trying to get this to work but kept getting odd errors.
DI-435/merge_VCF_AF.sh
line 134 at r1 (raw file):
dx upload "final_merged_${file_prefix}.vcf.gz" dx upload "final_merged_${file_prefix}.vcf.gz.tbi" _get_peak_usage
This is nifty 👍
export TZ=Europe/London | ||
|
||
# set frequency of instance usage in logs to 10 seconds | ||
sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be nice if it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range comments (1)
DI-435/merge_VCF_AF.sh (1)
Line range hint
92-133
: Add error handling for critical file operationsThe file operations lack proper error checking, which could lead to silent failures.
Consider adding checks for critical operations:
mkdir -p norm +if [ ! -d "norm" ]; then + echo "ERROR: Failed to create normalisation directory" + exit 1 +fi + echo "Normalising VCFs" -find . -maxdepth 1 -name '*.vcf.gz' -print0 | xargs -0 -P "${num_proc}" -I{} bcftools norm -m -any -f "${genome}" -Oz "{}" -o norm/"$(basename {})" +find . -maxdepth 1 -name '*.vcf.gz' -print0 | xargs -0 -P "${num_proc}" -I{} bash -c ' + if ! bcftools norm -m -any -f "${genome}" -Oz "{}" -o norm/"$(basename {})"; then + echo "ERROR: Failed to normalise {}" + exit 1 + fi +' + +# Verify files were created +if [ ! "$(ls -A norm/*.vcf.gz 2>/dev/null)" ]; then + echo "ERROR: No normalised VCF files were created" + exit 1 +fi
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
📒 Files selected for processing (3)
DI-435/README.md
(1 hunks)DI-435/find_vcfs_to_merge.py
(5 hunks)DI-435/merge_VCF_AF.sh
(4 hunks)
🧰 Additional context used
📓 Learnings (1)
DI-435/merge_VCF_AF.sh (4)
Learnt from: rklocke
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:58-63
Timestamp: 2024-11-22T12:19:56.287Z
Learning: When improving bash scripts, prefer adding `set -exo pipefail` at the top to enhance error handling and robustness.
Learnt from: RSWilson1
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:60-60
Timestamp: 2024-11-25T16:12:16.377Z
Learning: In `DI-435/merge_VCF_AF.sh`, the script handles cases where no VCF files are found by validating the input file (`input_file`) at the beginning, so additional checks later in the script are unnecessary.
Learnt from: RSWilson1
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:76-78
Timestamp: 2024-11-25T16:01:30.595Z
Learning: In the script `DI-435/merge_VCF_AF.sh`, the presence of VCF files is already ensured by checking during the file download process.
Learnt from: RSWilson1
PR: eastgenomics/RD_requests#17
File: DI-435/merge_VCF_AF.sh:60-60
Timestamp: 2024-11-25T16:01:34.002Z
Learning: In `DI-435/merge_VCF_AF.sh`, the script handles cases where no VCF files are found by checking during the file download step, so additional checks later in the script are unnecessary.
🪛 LanguageTool
DI-435/README.md
[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e no_qc ``` #### Inputs - -a --assay
: The GRCh38 project search term for in D...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...e.g. "*CEN38"
. - -o --outfile_prefix
: Prefix to use to name the output files,...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~39-~39: Loose punctuation mark.
Context: ...t files, e.g. CEN38
. - -r --run_mode
: A choice of whether to find and use QC ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...section below. - -e --end (optional)
: A date used to find DNAnexus (GRCh37) p...
(UNLIKELY_OPENING_PUNCTUATION)
[style] ~51-~51: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...
(ALL_OF_THE)
[uncategorized] ~72-~72: The preposition “on” seems more likely in this position than the preposition “in”.
Context: ...pt to merge VCFs The bash script is run in a DNAnexus cloud workstation and requir...
(AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)
🪛 Markdownlint (0.35.0)
DI-435/README.md
42-42: Expected: 0; Actual: 2
Unordered list indentation
(MD007, ul-indent)
43-43: Expected: 0; Actual: 2
Unordered list indentation
(MD007, ul-indent)
8-8: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
9-9: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
9-9: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
48-48: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
56-56: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
56-56: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
61-61: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
67-67: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
67-67: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
71-71: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
11-11: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
16-16: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
18-18: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
28-28: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
77-77: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
5-5: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
42-42: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
50-50: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
55-55: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
57-57: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
63-63: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
66-66: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
68-68: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
73-73: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
84-84: null
Files should end with a single newline character
(MD047, single-trailing-newline)
🪛 Ruff (0.8.0)
DI-435/find_vcfs_to_merge.py
60-60: Trailing comma missing
Add trailing comma
(COM812)
218-218: Trailing comma missing
Add trailing comma
(COM812)
355-355: Missing return type annotation for public function find_medicover_vcf_files
(ANN201)
355-355: Missing type annotation for function argument projects
(ANN001)
378-378: Trailing comma missing
Add trailing comma
(COM812)
388-388: Trailing comma missing
Add trailing comma
(COM812)
395-395: Trailing comma missing
Add trailing comma
(COM812)
396-396: Trailing comma missing
Add trailing comma
(COM812)
400-400: Trailing comma missing
Add trailing comma
(COM812)
481-481: Trailing comma missing
Add trailing comma
(COM812)
487-487: Trailing comma missing
Add trailing comma
(COM812)
500-500: Trailing comma missing
Add trailing comma
(COM812)
517-517: Trailing comma missing
Add trailing comma
(COM812)
527-527: Trailing comma missing
Add trailing comma
(COM812)
536-536: Trailing comma missing
Add trailing comma
(COM812)
542-542: Trailing comma missing
Add trailing comma
(COM812)
547-547: Trailing comma missing
Add trailing comma
(COM812)
552-552: Trailing comma missing
Add trailing comma
(COM812)
560-560: Trailing comma missing
Add trailing comma
(COM812)
573-573: Trailing comma missing
Add trailing comma
(COM812)
580-580: Trailing comma missing
Add trailing comma
(COM812)
589-589: Trailing comma missing
Add trailing comma
(COM812)
595-595: Trailing comma missing
Add trailing comma
(COM812)
599-599: Trailing comma missing
Add trailing comma
(COM812)
602-602: Trailing comma missing
Add trailing comma
(COM812)
608-608: Trailing comma missing
Add trailing comma
(COM812)
🪛 Shellcheck (0.10.0)
DI-435/merge_VCF_AF.sh
[warning] 23-23: Quote this to prevent word splitting.
(SC2046)
[info] 23-23: Consider using pgrep instead of grepping ps output.
(SC2009)
🔇 Additional comments (7)
DI-435/merge_VCF_AF.sh (2)
15-20
: 🛠️ Refactor suggestion
Optimise logging setup for improved reliability
The logging configuration requires refinement:
- The PS4 format using octal escapes (\000, \011) may not be portable across all environments
- The timestamp format lacks millisecond precision, which could be crucial for debugging
Apply this diff to enhance logging reliability:
-PS4='\000[$(date)]\011'
+PS4='[$(date +"%Y-%m-%d %H:%M:%S.%3N")] '
22-24
: 🛠️ Refactor suggestion
Improve process termination using 'pkill'
The current process termination approach is overly complex and potentially fragile.
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 23-23: Quote this to prevent word splitting.
(SC2046)
[info] 23-23: Consider using pgrep instead of grepping ps output.
(SC2009)
DI-435/README.md (2)
4-6
: Implementation modes are well-documented.
The distinction between find_qc
and no_qc
modes is clearly explained, providing users with a solid understanding of the operational differences.
🧰 Tools
🪛 Markdownlint (0.35.0)
5-5: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
48-69
: Mode-specific documentation is thorough and well-structured.
The step-by-step process and output documentation for both modes provide clear guidance for users. The consistent file naming convention across modes is particularly helpful.
🧰 Tools
🪛 LanguageTool
[style] ~51-~51: Consider removing “of” to be more concise
Context: ...for each of these projects and reads in all of the QC status files into one merged datafra...
(ALL_OF_THE)
🪛 Markdownlint (0.35.0)
48-48: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
56-56: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
56-56: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
61-61: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
67-67: Expected: 1; Actual: 0; Above
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
67-67: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
50-50: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
55-55: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
57-57: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
63-63: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
66-66: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
68-68: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
DI-435/find_vcfs_to_merge.py (3)
60-60
: 🛠️ Refactor suggestion
Add missing trailing comma after the help
parameter
Adding a trailing comma improves code style compliance and may prevent potential issues.
Apply this diff:
)
+ ,
🧰 Tools
🪛 Ruff (0.8.0)
60-60: Trailing comma missing
Add trailing comma
(COM812)
355-355
: 🛠️ Refactor suggestion
Add type annotations to find_medicover_vcf_files
Adding type annotations improves code clarity and assists with static analysis.
Apply these changes:
- Import
List
fromtyping
at the top of the file:
+from typing import List
- Update the function definition:
-def find_medicover_vcf_files(projects):
+def find_medicover_vcf_files(projects: List[dict]) -> List[dict]:
🧰 Tools
🪛 Ruff (0.8.0)
355-355: Missing return type annotation for public function find_medicover_vcf_files
(ANN201)
355-355: Missing type annotation for function argument projects
(ANN001)
481-481
: 🛠️ Refactor suggestion
Add missing trailing comma after the print
argument
Including a trailing comma enhances code readability and complies with style guidelines.
Apply this diff:
print(
f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}"
+ ,
)
🧰 Tools
🪛 Ruff (0.8.0)
481-481: Trailing comma missing
Add trailing comma
(COM812)
while IFS=$'\t' read -r _ field2 field3; do | ||
((line_number++)) | ||
|
||
# Validate fields are not empty | ||
if [ -z "$field2" ] || [ -z "$field3" ]; then | ||
echo "ERROR: Missing project or file ID at line $line_number" | ||
exit 1 | ||
fi | ||
|
||
# Validate field format (assuming they should match dx:// format) | ||
if [[ ! "$field2" =~ ^project-.* ]] || [[ ! "$field3" =~ ^file-.* ]]; then | ||
echo "ERROR: Invalid project or file ID format at line $line_number" | ||
exit 1 | ||
fi | ||
project_files+=("${field2}:${field3}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Enhance input validation robustness
While the validation logic is sound, we can improve its maintainability and clarity.
Consider this enhancement:
+# Define validation patterns
+readonly PROJECT_ID_PATTERN='^project-[a-zA-Z0-9]{24}$'
+readonly FILE_ID_PATTERN='^file-[a-zA-Z0-9]{24}$'
+
while IFS=$'\t' read -r _ field2 field3; do
((line_number++))
# Validate fields are not empty
if [ -z "$field2" ] || [ -z "$field3" ]; then
echo "ERROR: Missing project or file ID at line $line_number"
exit 1
fi
# Validate field format
- if [[ ! "$field2" =~ ^project-.* ]] || [[ ! "$field3" =~ ^file-.* ]]; then
+ if [[ ! "$field2" =~ $PROJECT_ID_PATTERN ]] || [[ ! "$field3" =~ $FILE_ID_PATTERN ]]; then
echo "ERROR: Invalid project or file ID format at line $line_number"
exit 1
fi
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
while IFS=$'\t' read -r _ field2 field3; do | |
((line_number++)) | |
# Validate fields are not empty | |
if [ -z "$field2" ] || [ -z "$field3" ]; then | |
echo "ERROR: Missing project or file ID at line $line_number" | |
exit 1 | |
fi | |
# Validate field format (assuming they should match dx:// format) | |
if [[ ! "$field2" =~ ^project-.* ]] || [[ ! "$field3" =~ ^file-.* ]]; then | |
echo "ERROR: Invalid project or file ID format at line $line_number" | |
exit 1 | |
fi | |
project_files+=("${field2}:${field3}") | |
# Define validation patterns | |
readonly PROJECT_ID_PATTERN='^project-[a-zA-Z0-9]{24}$' | |
readonly FILE_ID_PATTERN='^file-[a-zA-Z0-9]{24}$' | |
while IFS=$'\t' read -r _ field2 field3; do | |
((line_number++)) | |
# Validate fields are not empty | |
if [ -z "$field2" ] || [ -z "$field3" ]; then | |
echo "ERROR: Missing project or file ID at line $line_number" | |
exit 1 | |
fi | |
# Validate field format | |
if [[ ! "$field2" =~ $PROJECT_ID_PATTERN ]] || [[ ! "$field3" =~ $FILE_ID_PATTERN ]]; then | |
echo "ERROR: Invalid project or file ID format at line $line_number" | |
exit 1 | |
fi | |
project_files+=("${field2}:${field3}") |
_get_peak_usage() { | ||
: ''' | ||
Reports the peak memory and storage usage from dstat, to be called at end of script | ||
''' | ||
dx watch "$DX_JOB_ID" --no-follow --quiet > job.log | ||
|
||
peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1) | ||
total_mem="$(($(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*')/1024))" | ||
|
||
peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1) | ||
total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}') | ||
|
||
echo "Memory usage peaked at ${peak_mem}/${total_mem}MB" | ||
echo "Storage usage peaked at ${peak_storage}/${total_storage}GB" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance error handling in peak usage monitoring
The peak usage function requires additional error handling to ensure reliability:
- No validation of dx watch command success
- Missing checks for memory info availability
- No cleanup of temporary job.log file
Apply this enhancement:
_get_peak_usage() {
: '''
Reports the peak memory and storage usage from dstat, to be called at end of script
'''
- dx watch "$DX_JOB_ID" --no-follow --quiet > job.log
+ if ! dx watch "$DX_JOB_ID" --no-follow --quiet > job.log; then
+ echo "ERROR: Failed to retrieve job metrics"
+ return 1
+ fi
+
+ if [ ! -s job.log ]; then
+ echo "ERROR: Job metrics log is empty"
+ return 1
+ fi
peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1)
- total_mem="$(($(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*')/1024))"
+ mem_total=$(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*' || echo 0)
+ if [ "$mem_total" -eq 0 ]; then
+ echo "ERROR: Failed to retrieve total memory"
+ return 1
+ fi
+ total_mem="$((mem_total/1024))"
peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1)
total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}')
echo "Memory usage peaked at ${peak_mem}/${total_mem}MB"
echo "Storage usage peaked at ${peak_storage}/${total_storage}GB"
+ rm -f job.log
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
_get_peak_usage() { | |
: ''' | |
Reports the peak memory and storage usage from dstat, to be called at end of script | |
''' | |
dx watch "$DX_JOB_ID" --no-follow --quiet > job.log | |
peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1) | |
total_mem="$(($(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*')/1024))" | |
peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1) | |
total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}') | |
echo "Memory usage peaked at ${peak_mem}/${total_mem}MB" | |
echo "Storage usage peaked at ${peak_storage}/${total_storage}GB" | |
} | |
_get_peak_usage() { | |
: ''' | |
Reports the peak memory and storage usage from dstat, to be called at end of script | |
''' | |
if ! dx watch "$DX_JOB_ID" --no-follow --quiet > job.log; then | |
echo "ERROR: Failed to retrieve job metrics" | |
return 1 | |
fi | |
if [ ! -s job.log ]; then | |
echo "ERROR: Job metrics log is empty" | |
return 1 | |
fi | |
peak_mem=$(grep 'INFO CPU' job.log | cut -d':' -f5 | cut -d'/' -f1 | sort -n | tail -n1) | |
mem_total=$(grep MemTotal /proc/meminfo | grep --only-matching '[0-9]*' || echo 0) | |
if [ "$mem_total" -eq 0 ]; then | |
echo "ERROR: Failed to retrieve total memory" | |
return 1 | |
fi | |
total_mem="$((mem_total/1024))" | |
peak_storage=$(grep 'INFO CPU' job.log | cut -d':' -f6 | cut -d'/' -f1 | sort -n | tail -n1) | |
total_storage=$(df -Pk / | awk 'NR == 2' | awk '{printf("%.0f", $2/1024/1024)}') | |
echo "Memory usage peaked at ${peak_mem}/${total_mem}MB" | |
echo "Storage usage peaked at ${peak_storage}/${total_storage}GB" | |
rm -f job.log | |
} |
Example command to find Medicover VCFs with no QC status files: | ||
```bash | ||
python3 find_vcfs_to_merge.py \ | ||
--assay "*TWE38M" \ | ||
--outfile_prefix TWE38M \ | ||
--run_mode no_qc | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Consider clarifying date parameter usage in no_qc
mode.
While the Medicover example effectively demonstrates the no_qc
mode, it would be helpful to explicitly state whether date parameters are supported or ignored in this mode.
🧰 Tools
🪛 Markdownlint (0.35.0)
28-28: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already clarified that the inputs are only allowed in find_qc mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rklocke, understood. Thanks for clarifying.
✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/README.md:27-33
Timestamp: 2024-11-29T09:56:48.931Z
Learning: In the `find_vcfs_to_merge.py` script within `DI-435/`, the date parameters `--start` and `--end` are only allowed in the `find_qc` mode.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
- `-a --assay`: The GRCh38 project search term for in DNAnexus, e.g. `"*CEN38"`. | ||
- `-o --outfile_prefix`: Prefix to use to name the output files, e.g. `CEN38`. | ||
- `-r --run_mode`: A choice of whether to find and use QC status files (`find_qc`) or not (`no_qc`). | ||
|
||
In `find_qc` mode the following inputs are allowed: | ||
- `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below. | ||
- `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Consider marking required parameters.
While the documentation clearly explains each parameter, it would be beneficial to explicitly mark which parameters are required for each mode. For example:
-a --assay
: [REQUIRED] The GRCh38 project search term...-o --outfile_prefix
: [REQUIRED] Prefix to use...-r --run_mode
: [REQUIRED] A choice of whether...
🧰 Tools
🪛 LanguageTool
[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e no_qc ``` #### Inputs - -a --assay
: The GRCh38 project search term for in D...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...e.g. "*CEN38"
. - -o --outfile_prefix
: Prefix to use to name the output files,...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~39-~39: Loose punctuation mark.
Context: ...t files, e.g. CEN38
. - -r --run_mode
: A choice of whether to find and use QC ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...section below. - -e --end (optional)
: A date used to find DNAnexus (GRCh37) p...
(UNLIKELY_OPENING_PUNCTUATION)
🪛 Markdownlint (0.35.0)
42-42: Expected: 0; Actual: 2
Unordered list indentation
(MD007, ul-indent)
43-43: Expected: 0; Actual: 2
Unordered list indentation
(MD007, ul-indent)
42-42: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's clear already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rklocke, understood.
(_/)
( ._.)
('')('')
### Bash script to merge VCFs | ||
The bash script is run in a DNAnexus cloud workstation and requires positional inputs of: | ||
- The output file generated from the Python script above | ||
- The job ID for the cloud workstation running | ||
- The reference genome for GRCh38 | ||
|
||
Example bash script command: | ||
`bash merge_VCF_AF.sh CEN38_vcf_to_merge.txt job-Gjb39Z04bxf82XZ12gPJ2bbV GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta` | ||
```bash | ||
bash merge_VCF_AF.sh \ | ||
CEN38_files_to_merge.txt \ | ||
GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta | ||
``` | ||
|
||
Output: | ||
Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. No newline at end of file | ||
Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Consider adding operational prerequisites.
While the bash script documentation is clear, it would be beneficial to add:
- Required permissions or roles for DNAnexus cloud workstation access
- Estimated runtime or resource requirements
- Any cleanup steps needed after execution
🧰 Tools
🪛 LanguageTool
[uncategorized] ~72-~72: The preposition “on” seems more likely in this position than the preposition “in”.
Context: ...pt to merge VCFs The bash script is run in a DNAnexus cloud workstation and requir...
(AI_EN_LECTOR_REPLACEMENT_PREPOSITION_IN_ON)
🪛 Markdownlint (0.35.0)
71-71: Expected: 1; Actual: 0; Below
Headings should be surrounded by blank lines
(MD022, blanks-around-headings)
77-77: null
Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
73-73: null
Lists should be surrounded by blank lines
(MD032, blanks-around-lists)
84-84: null
Files should end with a single newline character
(MD047, single-trailing-newline)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rklocke, understood.
✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/README.md:84-84
Timestamp: 2024-11-29T09:56:55.375Z
Learning: In `DI-435/README.md`, operational prerequisites for the bash script `merge_VCF_AF.sh`, such as required permissions, estimated runtime, resource requirements, and cleanup steps, are not required in the documentation.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
'_Wdh': '', | ||
'_Wdh2': '', | ||
'_Whd3': '', | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the potential typo in strings_to_remove
dictionary
The key '_Whd3'
might be a typo and should perhaps be '_Wdh3'
to maintain consistency with '_Wdh'
and '_Wdh2'
.
Apply this diff to correct the typo:
strings_to_remove = {
'_Wdh': '',
'_Wdh2': '',
- '_Whd3': '',
+ '_Wdh3': '',
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
'_Wdh': '', | |
'_Wdh2': '', | |
'_Whd3': '', | |
} | |
'_Wdh': '', | |
'_Wdh2': '', | |
'_Wdh3': '', | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not a typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rklocke, understood. Thanks for the clarification—I'll note that '_Whd3'
is correct.
✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/find_vcfs_to_merge.py:371-374
Timestamp: 2024-11-29T09:57:02.044Z
Learning: In `DI-435/find_vcfs_to_merge.py`, within the `strings_to_remove` dictionary in the `find_medicover_vcf_files` function, the key `'_Whd3'` is correct and intentional.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @coderabbitai[bot] from a discussion.
Reviewable status: 0 of 3 files reviewed, 19 unresolved discussions (waiting on @coderabbitai[bot] and @RSWilson1)
DI-435/find_vcfs_to_merge.py
line 357 at r1 (raw file):
Previously, RSWilson1 (Robert Wilson) wrote…
maybe add some more context to this docstring
Done.
DI-435/find_vcfs_to_merge.py
line 369 at r1 (raw file):
Previously, RSWilson1 (Robert Wilson) wrote…
this naming could be clearer
Done.
DI-435/find_vcfs_to_merge.py
line 381 at r1 (raw file):
Previously, RSWilson1 (Robert Wilson) wrote…
maybe a comment to explain the logic?
Done.
DI-435/find_vcfs_to_merge.py
Outdated
parser.add_argument( | ||
"-r", | ||
"--run_mode", | ||
choices=['find_qc', 'no_qc'], | ||
required=True, | ||
help=( | ||
"Runmode - either 'find_qc' where QC status files are " | ||
"searched for and used to find failed samples, or 'no_qc'" | ||
"where no QC status files are searched for" | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
DI-435/find_vcfs_to_merge.py
Outdated
def find_medicover_vcf_files(projects): | ||
""" | ||
Find VCF files in a project | ||
Parameters | ||
---------- | ||
project_id : str | ||
DX project ID | ||
Returns | ||
------- | ||
vcf_files : list | ||
list of dicts, each representing a VCF file in DX | ||
""" | ||
sub_remove = { | ||
'_Wdh': '', | ||
'_Wdh2': '', | ||
'_Whd3': '', | ||
} | ||
all_sample_vcfs = [] | ||
for project in projects: | ||
vcf_files = find_data( | ||
"*_markdup_recalibrated_Haplotyper.vcf.gz", project['id'] | ||
) | ||
for vcf_file in vcf_files: | ||
file_id = vcf_file["describe"]["id"] | ||
file_name = vcf_file["describe"]["name"] | ||
sample_name = re.sub("|".join( | ||
sub_remove), lambda x: sub_remove[x.group(0)], | ||
file_name.split('-TwistWE')[0] | ||
) | ||
|
||
all_sample_vcfs.append( | ||
{ | ||
"sample": sample_name, | ||
"project": project["describe"]["id"], | ||
"file_id": file_id | ||
} | ||
) | ||
|
||
print( | ||
f"\nFound {len(all_sample_vcfs)} VCF files in GRCh38 projects" | ||
) | ||
|
||
return all_sample_vcfs | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added extra detail but for consistency I will leave the type hints out
print( | ||
f"\n{len(b38_projects)} GRCh38 projects found:\n\t{projects_to_print}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed
'_Wdh': '', | ||
'_Wdh2': '', | ||
'_Whd3': '', | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not a typo
PS4='\000[$(date)]\011' | ||
export TZ=Europe/London |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works fine as is
DI-435/README.md
Outdated
#### find_qc mode | ||
How the script works in `find_qc` mode: | ||
1. Finds all DNAnexus projects with suffix `--assay`. | ||
2. Finds the related GRCh37 project (and between `--start` and `--end` dates, if provided) for each of these projects and reads in all of the QC status files into one merged dataframe. If multiple QC status files exist in a project, the one created last is used. | ||
3. Finds all raw VCFs in each of the DNAnexus projects. | ||
4. Splits these VCFs into a list of validation (including control) samples and non-validation samples based on naming conventions | ||
5. Removes any samples which are duplicates or any which failed QC at any time based on information within the QC status files. | ||
6. Creates a final list of all VCFs to merge and writes this out to file with`--outfile_prefix`. | ||
5. Removes the first instance of a sample which is duplicated or any which failed QC at any time based on information within the QC status files. | ||
6. Creates a final list of all VCFs to merge and writes this out to file with `--outfile_prefix`. | ||
##### Outputs | ||
- A TSV listing the VCF files for all non-validation samples to merge (named by `{outfile_prefix}_files_to_merge.txt`) | ||
- A CSV of all validation samples found (`{outfile_prefix}_validation_samples.csv`) | ||
- A CSV of all projects within search but missing QC file in DNAnexus and therefore not included (`{outfile_prefix}_projects_missing_QC.csv`). | ||
|
||
|
||
#### no_qc mode | ||
How the script works in `no_qc` mode: | ||
1. Finds all DNAnexus projects with suffix `--assay` | ||
2. Finds all raw VCFs in each of the DNAnexus projects. | ||
3. Removes all samples which are duplicates. | ||
4. Creates a final list of all VCFs to merge and writes this to file with `--outfile_prefix`. | ||
##### Outputs | ||
- A TSV listing the VCF files to merge (named by `{outfile_prefix}_files_to_merge.txt`) | ||
- A TSV of all duplicate files which have been removed (`{outfile_prefix}_all_dup_rows.txt`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
DI-435/README.md
Outdated
``` | ||
bash merge_VCF_AF.sh \ | ||
CEN38_files_to_merge.txt \ | ||
GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta | ||
``` | ||
|
||
Output: | ||
Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. No newline at end of file | ||
Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Example command to find Medicover VCFs with no QC status files: | ||
```bash | ||
python3 find_vcfs_to_merge.py \ | ||
--assay "*TWE38M" \ | ||
--outfile_prefix TWE38M \ | ||
--run_mode no_qc | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already clarified that the inputs are only allowed in find_qc mode
- `-a --assay`: The GRCh38 project search term for in DNAnexus, e.g. `"*CEN38"`. | ||
- `-o --outfile_prefix`: Prefix to use to name the output files, e.g. `CEN38`. | ||
- `-r --run_mode`: A choice of whether to find and use QC status files (`find_qc`) or not (`no_qc`). | ||
|
||
In `find_qc` mode the following inputs are allowed: | ||
- `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below. | ||
- `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's clear already
### Bash script to merge VCFs | ||
The bash script is run in a DNAnexus cloud workstation and requires positional inputs of: | ||
- The output file generated from the Python script above | ||
- The job ID for the cloud workstation running | ||
- The reference genome for GRCh38 | ||
|
||
Example bash script command: | ||
`bash merge_VCF_AF.sh CEN38_vcf_to_merge.txt job-Gjb39Z04bxf82XZ12gPJ2bbV GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta` | ||
```bash | ||
bash merge_VCF_AF.sh \ | ||
CEN38_files_to_merge.txt \ | ||
GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta | ||
``` | ||
|
||
Output: | ||
Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. No newline at end of file | ||
Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, dismissed some coderabbit comments as not needed.
Reviewed 3 of 3 files at r2, all commit messages.
Dismissed @coderabbitai[bot] from a discussion.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @rklocke)
export TZ=Europe/London | ||
|
||
# set frequency of instance usage in logs to 10 seconds | ||
sudo kill $(ps aux | grep pcp-dstat | head -n1 | awk '{print $2}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's fine then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @rklocke)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @coderabbitai[bot] from a discussion.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @rklocke)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @coderabbitai[bot] from 14 discussions.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @rklocke)
This change is
Summary by CodeRabbit
New Features
--run_mode
for the VCF merging script, allowing users to choose betweenfind_qc
andno_qc
modes.Documentation
Bug Fixes
Chores