-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add check for duplicate records to MainVcfQc #705
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, I have just a few minor suggestions.
.github/.dockstore.yml
Outdated
@@ -150,6 +150,7 @@ workflows: | |||
filters: | |||
branches: | |||
- main | |||
- kj/701_vcf_qc_duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- kj/701_vcf_qc_duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer - I've added a to-do in the PR description accordingly, was planning to remove after other changes are good to go. If it's preferred to leave this out before final approval, happy to do that as well - let me know.
wdl/MainVcfQc.wdl
Outdated
File? identify_duplicates_custom | ||
File? merge_duplicates_custom |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File? identify_duplicates_custom | |
File? merge_duplicates_custom | |
File? identify_duplicates_script | |
File? merge_duplicates_script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly - removed custom script.
# Size comparison | ||
if rec1[1] == rec2[1]: | ||
counts['INS 100%'] += 1 | ||
f_records.write(f"INS 100%\t{rec1[0]},{rec2[0]}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of whitespace and non-alphanumeric characters for more robust parsing:
ins_size_similarity_100
ins_size_similarity_50
ins_size_similarity_0
ins_alt_identical
ins_alt_same_subtype
ins_alt_different_subtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly.
exact_matches = defaultdict(list) | ||
for key, record_id in exact_buffer: | ||
exact_matches[key].append(record_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use groupby() here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly - thanks for the tip.
wdl/MainVcfQc.wdl
Outdated
# File default_script = "/src/sv-pipeline/scripts/merge_duplicates.py" | ||
# File active_script = select_first([custom_script, default_script]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should re-enable the docker script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly - removed the custom script.
wdl/MainVcfQc.wdl
Outdated
# File default_script = "/src/sv-pipeline/scripts/merge_duplicates.py" | ||
# File active_script = select_first([custom_script, default_script]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly - removed the custom script.
wdl/MainVcfQc.wdl
Outdated
|
||
RuntimeAttr runtime_default = object { | ||
mem_gb: 3.75, | ||
disk_gb: 2 + ceil(size(vcf, "GiB")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disk_gb: 2 + ceil(size(vcf, "GiB")), | |
disk_gb: 10 + ceil(size(vcf, "GiB")), |
2GB is pretty small. VM disk speed scales with its size, so we usually set to at least 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly - thanks for the tip.
Co-authored-by: Mark Walker <markw@broadinstitute.org>
…stitute/gatk-sv into kj/701_vcf_qc_duplicates
This PR addresses Issue #701.
Description
identify_duplicates.py
that takes a VCF files as input, and produces separate TSVs for 1. duplicate counts aggregated by category and 2. duplicate records tagged to a category.merge_duplicates.py
that aggregates multiple count and record TSVs produced byidentify_duplicates.py
into aggregated TSVs.agg_duplicate_records.tsv
andagg_duplicate_counts.tsv
that represent the aggregated TSVs.MainVcfQc.wdl
:identify_duplicates.py
.merge_duplicates.py
.Testing
1kgp_2batch_test_cohort.agg_duplicate_counts.tsv
and1kgp_2batch_test_cohort.agg_duplicate_records.tsv
in the linked GCS folder contain example outputs from the 1KGP cohort.ex1.vcf
andex2.vcf
are sample VCF files:python src/sv-pipeline/scripts/identify_duplicates.py --vcf ex1.vcf --fout ex1
python src/sv-pipeline/scripts/identify_duplicates.py --vcf ex2.vcf --fout ex2
Pre-Merge Changes Required