Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check for duplicate records to MainVcfQc #705

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

kjaisingh
Copy link
Collaborator

@kjaisingh kjaisingh commented Aug 9, 2024

This PR addresses Issue #701.

Description

  • Includes sript identify_duplicates.py that takes a VCF files as input, and produces separate TSVs for 1. duplicate counts aggregated by category and 2. duplicate records tagged to a category.
  • Includes script merge_duplicates.py that aggregates multiple count and record TSVs produced by identify_duplicates.py into aggregated TSVs.
  • Outputs new files agg_duplicate_records.tsv and agg_duplicate_counts.tsv that represent the aggregated TSVs.
  • Invokes both scripts through two separate tasks in MainVcfQc.wdl:
    • IdentifyDuplicates is a distributed task that runs in parallel across all VCFs, and is executed once SubsetVcfBySamplesList completes. Currently allows for an optional parameter identify_duplicates_custom to be passed, which is a path to a script that can be run in place of identify_duplicates.py.
    • MergeDuplicates is a distributed task that runs in parallel across all VCFs, and is executed once all IdentifyDuplicates tasks complete. Currently allows for an optional parameter merge_duplicates_custom to be passed, which is a path to a script that can be run in place of merge_duplicates.py.

Testing

  • This Terra job shows an example run of the pipeline for the 1KGP cohort with this change.
  • The TSV files 1kgp_2batch_test_cohort.agg_duplicate_counts.tsv and 1kgp_2batch_test_cohort.agg_duplicate_records.tsv in the linked GCS folder contain example outputs from the 1KGP cohort.
  • The Python scripts themselves can also be tested locally through the following CLI commands, where ex1.vcf and ex2.vcf are sample VCF files:
    • python src/sv-pipeline/scripts/identify_duplicates.py --vcf ex1.vcf --fout ex1
    • python src/sv-pipeline/scripts/identify_duplicates.py --vcf ex2.vcf --fout ex2
    • python src/sv-pipeline/scripts/merge_duplicates.py \
      --records ex1_duplicate_records.tsv ex2_duplicate_records.tsv \
      --counts ex1_duplicate_counts.tsv ex2_duplicate_counts.tsv \
      --fout agg
      
  • Validated all WDLs with womtool.

Pre-Merge Changes Required

  • Remove automated Dockstore image sync for development branch.
  • Remove custom script parameters for IdentifyDuplicates and MergeDuplicates.

@kjaisingh kjaisingh linked an issue Aug 9, 2024 that may be closed by this pull request
@kjaisingh kjaisingh marked this pull request as ready for review August 9, 2024 14:29
@kjaisingh kjaisingh self-assigned this Aug 12, 2024
@kjaisingh kjaisingh added the enhancement New feature or request label Sep 24, 2024
Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, I have just a few minor suggestions.

@@ -150,6 +150,7 @@ workflows:
filters:
branches:
- main
- kj/701_vcf_qc_duplicates
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- kj/701_vcf_qc_duplicates

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer - I've added a to-do in the PR description accordingly, was planning to remove after other changes are good to go. If it's preferred to leave this out before final approval, happy to do that as well - let me know.

Comment on lines 20 to 21
File? identify_duplicates_custom
File? merge_duplicates_custom
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
File? identify_duplicates_custom
File? merge_duplicates_custom
File? identify_duplicates_script
File? merge_duplicates_script

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - removed custom script.

# Size comparison
if rec1[1] == rec2[1]:
counts['INS 100%'] += 1
f_records.write(f"INS 100%\t{rec1[0]},{rec2[0]}\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of whitespace and non-alphanumeric characters for more robust parsing:

ins_size_similarity_100
ins_size_similarity_50
ins_size_similarity_0
ins_alt_identical
ins_alt_same_subtype
ins_alt_different_subtype

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly.

Comment on lines 72 to 74
exact_matches = defaultdict(list)
for key, record_id in exact_buffer:
exact_matches[key].append(record_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use groupby() here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - thanks for the tip.

Comment on lines 900 to 901
# File default_script = "/src/sv-pipeline/scripts/merge_duplicates.py"
# File active_script = select_first([custom_script, default_script])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should re-enable the docker script

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - removed the custom script.

Comment on lines 956 to 957
# File default_script = "/src/sv-pipeline/scripts/merge_duplicates.py"
# File active_script = select_first([custom_script, default_script])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - removed the custom script.


RuntimeAttr runtime_default = object {
mem_gb: 3.75,
disk_gb: 2 + ceil(size(vcf, "GiB")),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
disk_gb: 2 + ceil(size(vcf, "GiB")),
disk_gb: 10 + ceil(size(vcf, "GiB")),

2GB is pretty small. VM disk speed scales with its size, so we usually set to at least 10.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - thanks for the tip.

kjaisingh and others added 2 commits September 25, 2024 13:56
Co-authored-by: Mark Walker <markw@broadinstitute.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add check for duplicate records to MainVcfQc
2 participants