Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.0.0-beta.3 #349

Merged
merged 22 commits into from
Aug 9, 2024
Merged

2.0.0-beta.3 #349

merged 22 commits into from
Aug 9, 2024

Conversation

nebfield
Copy link
Member

@nebfield nebfield commented Aug 2, 2024

Changelog

Important fix: Fix splitting duplicated variant IDs across multiple scoring files

Background

  • The MATCH_COMBINE step writes new scoring files for input to plink2 --score
  • When plink2 encounters a variant with the same ID across multiple rows in a scoring file it will ignore duplicates and warn about them
  • This only happens when the same variant ID has different effect alleles across different rows
    • A variant ID with the same effect allele and scores across multiple columns is OK, this causes scores to be calculated in parallel

Example

When using PGS000039, PGS000040, and PGS000041 in parallel some variants have different effect alleles at the same coordinates, for example:

22:40682469:T:C with effect allele T (PGS000041_hmPOS_GRCh38)
22:40682469:T:C with effect allele C (PGS000039_hmPOS_GRCh38)

Impact

In versions v2.0.0-beta, beta.1, and beta.2 the duplicated variant is written to the same scoring file and ignored by plink2. The duplicated variant doesn't contribute to the final calculated PGS.

In all v2.0.0-alpha versions and beta.3 a second scoring file is correctly written containing the other allele (additional alleles create extra scoring files automatically within the updated MATCH_COMBINE process). We have also updated the software tests to ensure this error doesn't occur in future releases.

This problem is more likely to happen when larger scores are calculated in parallel. As more scores are calculated in parallel, it's more likely that variant IDs with different effect alleles will duplicate and be ignored during the score calculation stage.

While the overall impact on the final score is likely to be small we encourage users to upgrade to beta.3, especially if they calculate larger scores in parallel.

How do I know if my data are affected?

$ cd work/71/35fa3c977993b71d5a85fb6721e8c3 # cd to a scoring process directory 
$ comm -3 <(sort hgdp_22_additive_0.sscore.vars) <(zcat hgdp_22_additive_0.scorefile.gz | tail -n +2 | cut -f 1 | sort)
	22:40682469:T:C

One missing variant appears in the output. This check is now included in the scoring module.

Other fixes

@nebfield nebfield linked an issue Aug 6, 2024 that may be closed by this pull request
@nebfield nebfield marked this pull request as ready for review August 6, 2024 14:46
@nebfield nebfield requested a review from smlmbrt August 6, 2024 15:05
@smlmbrt
Copy link
Member

smlmbrt commented Aug 6, 2024

Going to run the new release on UKB overnight and test out some things.

@nebfield nebfield linked an issue Aug 7, 2024 that may be closed by this pull request
Copy link
Member

@smlmbrt smlmbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new version now runs correctly on copy of UKB (Cambridge cluster) and local dataset (single-sample) with the fraposa_update. Log also correct for the scores.

@smlmbrt
Copy link
Member

smlmbrt commented Aug 8, 2024

Waiting for bioconda/bioconda-recipes#49916 before merging.

@nebfield nebfield merged commit 96fbb23 into main Aug 9, 2024
121 checks passed
@nebfield nebfield deleted the match-fix branch August 9, 2024 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants