Add check for duplicate records to MainVcfQc #705

kjaisingh · 2024-08-09T14:04:21Z

This PR addresses Issue #701.

Description

Includes sript identify_duplicates.py that takes a VCF files as input, and produces separate TSVs for 1. duplicate counts aggregated by category and 2. duplicate records tagged to a category.
Includes script merge_duplicates.py that aggregates multiple count and record TSVs produced by identify_duplicates.py into aggregated TSVs.
Outputs new files agg_duplicate_records.tsv and agg_duplicate_counts.tsv that represent the aggregated TSVs.
Invokes both scripts through two separate tasks in MainVcfQc.wdl:
- IdentifyDuplicates is a distributed task that runs in parallel across all VCFs, and is executed once SubsetVcfBySamplesList completes. Currently allows for an optional parameter identify_duplicates_custom to be passed, which is a path to a script that can be run in place of identify_duplicates.py.
- MergeDuplicates is a distributed task that runs in parallel across all VCFs, and is executed once all IdentifyDuplicates tasks complete. Currently allows for an optional parameter merge_duplicates_custom to be passed, which is a path to a script that can be run in place of merge_duplicates.py.

Testing

This Terra job shows an example run of the pipeline for the 1KGP cohort with this change.
The TSV files 1kgp_2batch_test_cohort.agg_duplicate_counts.tsv and 1kgp_2batch_test_cohort.agg_duplicate_records.tsv in the linked GCS folder contain example outputs from the 1KGP cohort.
The Python scripts themselves can also be tested locally through the following CLI commands, where ex1.vcf and ex2.vcf are sample VCF files:
- python src/sv-pipeline/scripts/identify_duplicates.py --vcf ex1.vcf --fout ex1
- python src/sv-pipeline/scripts/identify_duplicates.py --vcf ex2.vcf --fout ex2
- ```
python src/sv-pipeline/scripts/merge_duplicates.py \
--records ex1_duplicate_records.tsv ex2_duplicate_records.tsv \
--counts ex1_duplicate_counts.tsv ex2_duplicate_counts.tsv \
--fout agg
```
Validated all WDLs with womtool.

Pre-Merge Changes Required

Remove automated Dockstore image sync for development branch.
Remove custom script parameters for IdentifyDuplicates and MergeDuplicates.

mwalker174

This looks great, I have just a few minor suggestions.

mwalker174 · 2024-09-25T15:51:26Z

.github/.dockstore.yml

@@ -150,6 +150,7 @@ workflows:
    filters:
      branches: 
        - main
+        - kj/701_vcf_qc_duplicates


Suggested change

- kj/701_vcf_qc_duplicates

Thanks for the pointer - I've added a to-do in the PR description accordingly, was planning to remove after other changes are good to go. If it's preferred to leave this out before final approval, happy to do that as well - let me know.

mwalker174 · 2024-09-25T15:53:35Z

wdl/MainVcfQc.wdl

+    File? identify_duplicates_custom
+    File? merge_duplicates_custom


Suggested change

File? identify_duplicates_custom

File? merge_duplicates_custom

File? identify_duplicates_script

File? merge_duplicates_script

Updated accordingly - removed custom script.

mwalker174 · 2024-09-25T16:14:40Z

src/sv-pipeline/scripts/identify_duplicates.py

+            # Size comparison
+            if rec1[1] == rec2[1]:
+                counts['INS 100%'] += 1
+                f_records.write(f"INS 100%\t{rec1[0]},{rec2[0]}\n")


Let's get rid of whitespace and non-alphanumeric characters for more robust parsing:

ins_size_similarity_100 ins_size_similarity_50 ins_size_similarity_0 ins_alt_identical ins_alt_same_subtype ins_alt_different_subtype

Updated accordingly.

mwalker174 · 2024-09-25T16:26:28Z

src/sv-pipeline/scripts/identify_duplicates.py

+    exact_matches = defaultdict(list)
+    for key, record_id in exact_buffer:
+        exact_matches[key].append(record_id)


Could use groupby() here

Updated accordingly - thanks for the tip.

mwalker174 · 2024-09-25T16:28:08Z

wdl/MainVcfQc.wdl

+  # File default_script = "/src/sv-pipeline/scripts/merge_duplicates.py"
+  # File active_script = select_first([custom_script, default_script])


Should re-enable the docker script

Updated accordingly - removed the custom script.

mwalker174 · 2024-09-25T16:28:17Z

wdl/MainVcfQc.wdl

+  # File default_script = "/src/sv-pipeline/scripts/merge_duplicates.py"
+  # File active_script = select_first([custom_script, default_script])


Updated accordingly - removed the custom script.

mwalker174 · 2024-09-25T16:30:27Z

wdl/MainVcfQc.wdl

+
+  RuntimeAttr runtime_default = object {
+    mem_gb: 3.75,
+    disk_gb: 2 + ceil(size(vcf, "GiB")),


Suggested change

disk_gb: 2 + ceil(size(vcf, "GiB")),

disk_gb: 10 + ceil(size(vcf, "GiB")),

2GB is pretty small. VM disk speed scales with its size, so we usually set to at least 10.

Updated accordingly - thanks for the tip.

Co-authored-by: Mark Walker <markw@broadinstitute.org>

…stitute/gatk-sv into kj/701_vcf_qc_duplicates

kjaisingh added 18 commits August 6, 2024 14:21

Initial commit

968ab5b

Added branch name to dockstore file

239d7f9

Updated script definitions

aa921e2

Updated WDL to use custom script

2317dd7

Modified output file paths

bf4d7bc

Added more logging

f734014

Used dot notation

e7e9ddc

Minor logging changes

8d98cd1

Removed glob() references

cfe266e

Standardized variable input structure

9d4fcce

Removed direct python script call

4ff559c

Renamed output files to be duplicate instead of duplicated

b72e9d6

Removed call to MergeDuplicates

2b4c476

Undo commenting out of MergeDuplicates

ae2f33e

Removed direct reference to vcfs_for_qc

6a270e5

Minor changes

0f5a1b6

Added option for both default and custom script

899ff79

Modified path to python scripts

f9af767

kjaisingh linked an issue Aug 9, 2024 that may be closed by this pull request

Add check for duplicate records to MainVcfQc #701

Open

kjaisingh added 3 commits August 9, 2024 10:07

Removed unused import

e9e5872

Resoled all flake8 linting errors

37e20c2

Rolled back to solely use custom scripts for time being

89cd795

kjaisingh requested review from epiercehoffman and mwalker174 August 9, 2024 14:29

kjaisingh marked this pull request as ready for review August 9, 2024 14:29

kjaisingh self-assigned this Aug 12, 2024

kjaisingh added the enhancement New feature or request label Sep 24, 2024

mwalker174 requested changes Sep 25, 2024

View reviewed changes

kjaisingh and others added 2 commits September 25, 2024 13:56

Update wdl/MainVcfQc.wdl

f6a3025

Co-authored-by: Mark Walker <markw@broadinstitute.org>

Updated files

8b5bd12

kjaisingh added 5 commits September 25, 2024 18:21

Merge branch 'kj/701_vcf_qc_duplicates' of https://github.com/broadin…

3ae646f

…stitute/gatk-sv into kj/701_vcf_qc_duplicates

Merge branch 'main' into kj/701_vcf_qc_duplicates

b709ec9

Re-added extra line

be4f722

Corrected line add to test docker image publishing

ad609e1

Corrected to use identify_duplicates in task

f5b2ed0

kjaisingh requested a review from mwalker174 September 26, 2024 12:59

kjaisingh added 3 commits September 26, 2024 11:13

Added root path for python files

3fb16a4

Minor script update - use lowercase exact

577e6b1

Removed dockstore push

25a1467

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add check for duplicate records to MainVcfQc #705

Add check for duplicate records to MainVcfQc #705

kjaisingh commented Aug 9, 2024 •

edited

Loading

mwalker174 left a comment

mwalker174 Sep 25, 2024

kjaisingh Sep 25, 2024

mwalker174 Sep 25, 2024

kjaisingh Sep 25, 2024

mwalker174 Sep 25, 2024

kjaisingh Sep 25, 2024

mwalker174 Sep 25, 2024

kjaisingh Sep 25, 2024

mwalker174 Sep 25, 2024

kjaisingh Sep 25, 2024

mwalker174 Sep 25, 2024

kjaisingh Sep 25, 2024

mwalker174 Sep 25, 2024

kjaisingh Sep 25, 2024

		File? identify_duplicates_custom
		File? merge_duplicates_custom

		# File default_script = "/src/sv-pipeline/scripts/merge_duplicates.py"
		# File active_script = select_first([custom_script, default_script])

	disk_gb: 2 + ceil(size(vcf, "GiB")),
	disk_gb: 10 + ceil(size(vcf, "GiB")),

Add check for duplicate records to MainVcfQc #705

Are you sure you want to change the base?

Add check for duplicate records to MainVcfQc #705

Conversation

kjaisingh commented Aug 9, 2024 • edited Loading

Description

Testing

Pre-Merge Changes Required

mwalker174 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kjaisingh commented Aug 9, 2024 •

edited

Loading