Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DI-920 Changes for Medicover internal AF VCF #19

Merged
merged 8 commits into from
Nov 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 62 additions & 21 deletions DI-435/README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,84 @@
## Find and merge VCFs for creation of a 38 POP AF VCF

### Description
This folder holds code to create an internal GRCh38 POP AF VCF based for [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435).
This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:
- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples.
- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely.
Comment on lines +4 to +6
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Magnifique! The description is perfectly seasoned, but let's garnish it properly!

The explanation of the two modes is chef's kiss perfect, but just like a properly plated dish needs proper spacing, our markdown needs proper formatting.

Add a blank line before the list for proper markdown presentation:

 This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:
+
 - `find_qc` searches for GRCh38 projects...
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:
- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples.
- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely.
This folder holds code to create an internal GRCh38 POP AF VCF [EBH-435](https://cuhbioinformatics.atlassian.net/browse/DI-435). There are 2 modes implemented:
- `find_qc` searches for GRCh38 projects and finds the related GRCh37 project to retrieve the QC status file (used to remove failed samples) and keep only the first instance of any duplicated samples.
- `no_qc` does not search for QC status files (keeps all files) but removes any duplicated samples completely.
🧰 Tools
🪛 Markdownlint (0.35.0)

5-5: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)


### Python script to find VCFs
The Python script `find_vcfs_to_merge.py` takes inputs:
- `-a --assay`: The project prefix to look for in DNAnexus, e.g. `"*CEN38"`.
- `-o --outfile_prefix`: What to name the output TSV file which lists VCF files to merge.
- `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below.
- `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below.
#### Usage
Example command to find all CEN GRCh38 VCFs for all runs:
```bash
python3 find_vcfs_to_merge.py \
--assay "*CEN38" \
--outfile_prefix CEN38 \
--run_mode find_qc
```
Example command to find TWE GRCh38 VCFs where the GRCh37 runs were between two dates:
```bash
python3 find_vcfs_to_merge.py \
--assay "*TWE38" \
--outfile_prefix TWE38 \
--run_mode find_qc \
--start 2024-03-15 \
--end 2024-08-04
```

**Searching Dates**
These dates restrict the projects collated to only GRCh38 projects
which have corresponding GRCh37 projects which were created within the specified dates.
Example command to find Medicover VCFs with no QC status files:
```bash
python3 find_vcfs_to_merge.py \
--assay "*TWE38M" \
--outfile_prefix TWE38M \
--run_mode no_qc
```
Comment on lines +27 to +33
Copy link

@coderabbitai coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Consider clarifying date parameter usage in no_qc mode.

While the Medicover example effectively demonstrates the no_qc mode, it would be helpful to explicitly state whether date parameters are supported or ignored in this mode.

🧰 Tools
🪛 Markdownlint (0.35.0)

28-28: null
Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already clarified that the inputs are only allowed in find_qc mode

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklocke, understood. Thanks for clarifying.


✏️ Learnings added
Learnt from: rklocke
PR: eastgenomics/RD_requests#19
File: DI-435/README.md:27-33
Timestamp: 2024-11-29T09:56:48.931Z
Learning: In the `find_vcfs_to_merge.py` script within `DI-435/`, the date parameters `--start` and `--end` are only allowed in the `find_qc` mode.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


Example Python script command:
`python3 find_vcfs_to_merge.py --assay "*CEN38" --end 2024-05-03 --outfile_prefix CEN38`
#### Inputs

Output:
- A TSV listing the VCF files for all non-validation samples to merge (named by `{outfile_prefix}_files_to_merge.txt`)
- A CSV of all validation samples found (`{outfile_prefix}_validation_samples.csv`)
- A CSV of all projects within search but missing QC file in DNAnexus and therefore not included (`{outfile_prefix}_projects_missing_QC.csv`).
- `-a --assay`: The GRCh38 project search term for in DNAnexus, e.g. `"*CEN38"`.
- `-o --outfile_prefix`: Prefix to use to name the output files, e.g. `CEN38`.
- `-r --run_mode`: A choice of whether to find and use QC status files (`find_qc`) or not (`no_qc`).

In `find_qc` mode the following inputs are allowed:
- `-s --start (optional)`: A date used to find DNAnexus (GRCh37) projects created after, see searching dates section below.
- `-e --end (optional)`: A date used to find DNAnexus (GRCh37) projects created before, see searching dates section below.
Comment on lines +37 to +43
Copy link

@coderabbitai coderabbitai bot Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Consider marking required parameters.

While the documentation clearly explains each parameter, it would be beneficial to explicitly mark which parameters are required for each mode. For example:

  • -a --assay: [REQUIRED] The GRCh38 project search term...
  • -o --outfile_prefix: [REQUIRED] Prefix to use...
  • -r --run_mode: [REQUIRED] A choice of whether...
🧰 Tools
🪛 LanguageTool

[uncategorized] ~37-~37: Loose punctuation mark.
Context: ...e no_qc ``` #### Inputs - -a --assay: The GRCh38 project search term for in D...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~38-~38: Loose punctuation mark.
Context: ...e.g. "*CEN38". - -o --outfile_prefix: Prefix to use to name the output files,...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~39-~39: Loose punctuation mark.
Context: ...t files, e.g. CEN38. - -r --run_mode: A choice of whether to find and use QC ...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~43-~43: Loose punctuation mark.
Context: ...section below. - -e --end (optional): A date used to find DNAnexus (GRCh37) p...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Markdownlint (0.35.0)

42-42: Expected: 0; Actual: 2
Unordered list indentation

(MD007, ul-indent)


43-43: Expected: 0; Actual: 2
Unordered list indentation

(MD007, ul-indent)


42-42: null
Lists should be surrounded by blank lines

(MD032, blanks-around-lists)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clear already

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklocke, understood.

(_/)
( ._.)
('')('')


How the script works:
**Searching Dates**
These dates restrict the projects collated to only GRCh38 projects which have corresponding GRCh37 projects which were created within the specified dates.

#### find_qc mode
How the script works in `find_qc` mode:
1. Finds all DNAnexus projects with suffix `--assay`.
2. Finds the related GRCh37 project (and between `--start` and `--end` dates, if provided) for each of these projects and reads in all of the QC status files into one merged dataframe. If multiple QC status files exist in a project, the one created last is used.
3. Finds all raw VCFs in each of the DNAnexus projects.
4. Splits these VCFs into a list of validation (including control) samples and non-validation samples based on naming conventions
5. Removes any samples which are duplicates or any which failed QC at any time based on information within the QC status files.
6. Creates a final list of all VCFs to merge and writes this out to file with`--outfile_prefix`.
5. Removes the first instance of a sample which is duplicated or any which failed QC at any time based on information within the QC status files.
6. Creates a final list of all VCFs to merge and writes this out to file with `--outfile_prefix`.
##### Outputs
- A TSV listing the VCF files for all non-validation samples to merge (named by `{outfile_prefix}_files_to_merge.txt`)
- A CSV of all validation samples found (`{outfile_prefix}_validation_samples.csv`)
- A CSV of all projects within search but missing QC file in DNAnexus and therefore not included (`{outfile_prefix}_projects_missing_QC.csv`).

#### no_qc mode
How the script works in `no_qc` mode:
1. Finds all DNAnexus projects with suffix `--assay`
2. Finds all raw VCFs in each of the DNAnexus projects.
3. Removes all samples which are duplicates.
4. Creates a final list of all VCFs to merge and writes this to file with `--outfile_prefix`.
##### Outputs
- A TSV listing the VCF files to merge (named by `{outfile_prefix}_files_to_merge.txt`)
- A TSV of all duplicate files which have been removed (`{outfile_prefix}_all_dup_rows.txt`)

### Bash script to merge VCFs
The bash script is run in a DNAnexus cloud workstation and requires positional inputs of:
- The output file generated from the Python script above
- The job ID for the cloud workstation running
- The reference genome for GRCh38

Example bash script command:
`bash merge_VCF_AF.sh CEN38_vcf_to_merge.txt job-Gjb39Z04bxf82XZ12gPJ2bbV GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta`
```bash
bash merge_VCF_AF.sh \
CEN38_files_to_merge.txt \
GRCh38_GIABv3_no_alt_analysis_set_maskedGRC_decoys_MAP2K3_KMT2C_KCNJ18_noChr.fasta
```

Output:
Merged VCF file and index named from the job ID given, i.e. `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz` and `final_merged_job-Gjb39Z04bxf82XZ12gPJ2bbV.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.
Merged VCF file and index named from the input files_to_merge txt file, i.e. `final_merged_CEN38_files_to_merge.vcf.gz` and `final_merged_CEN38_files_to_merge.vcf.gz.tbi` which are both uploaded to the DNAnexus project the cloud workstation is running within.
Loading