-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent ancestry assignments in v2.0.0-beta vs. alpha #343
Comments
Instead, I tried this in
|
I tried this in:
Perhaps the issue occurred before pgsc_calc v2.0.0-alpha.6 and after pgsc_calc v2.0.0-alpha.3. I will try these versions:
|
Hi @AWS-crafter, take a look at the related discussion here: #333 If you're running the calculator on an individual sample it may be unstable with the MAF & genotype missingness threshold? Or are you running it on many samples? |
For these tests, I am using a single sample. Why would MAF or missingness affect the results for a single sample? I would assume they are calculated either based on the available samples (so all sites in the single sample would have 0% missingness), or the reference panel, which also has just about every site in the sample.
|
To find out what is causing this, I would like to change this file: Can I do this by editing the below line in pgsc_calc's conf/modules.config to point to the modified version of pygscatalog, since I am using Singularity?
In other words, is this where pgsc_calc "looks" for pygscatalog? Alternatively, I can change pgsc_calc's modules/local/ancestry/intersect_variants.nf to edit output file(s) to change all variants to PCA_ELIGIBLE, right after the pgscatalog-intersect step. |
For single sample variants, MAF is either 0 (homozygous genotype) or 0.5 (heterozygous genotype), since a reference panel isn't used. The current behavior is to filter out all homozygous variants (the vast majority) and only use heterozygous variants for PCA. Even when I run multiple samples, this will filter out many variants if too many happen to be homozygous. This could cause issues if all samples are of the same ancestry group, but the reference panel is multi-ancestry, or if the sample size is small. Settting the MAF threshold to 0.0 will not fix this, since 0.0 is not > maf_filter = 0.0. Potential solutions:
|
I will fix the MAF filter to allow >= 0 and make it possible to remove it altogether. |
For future reference the container references are set here: Lines 37 to 43 in 0f33b4c
(The full string is Changing code that's running inside containers can be a little tricky, you'd need to:
A simpler approach would be to try the conda profile and edit the code that gets installed directly to But we'll do a release soon to fix the intersect issues 😅 |
The changes are great. Will switching the pgsc_calc to the dev branch and/or switching the container to ghcr.io/pgscatalog/pygscatalog:pgscatalog-utils-1.2.0-singularity fix this issue? I have figured out how to point pgsc_calc to a local container. |
The dev branch has the changes included now. You should be able to disable filtering by setting |
Thanks! Setting --maf_target to 0.0 (no geno_miss_target modification) fixed the issue, and ancestry assignments are now correct. |
Actually, in my above comment, I changed nextflow.config to have maf_target with 0.0. I tried again with the flag on the run command without modifying nextflow.config and it reverted to the problematic behavior (incorrect ancestry assignment). This was on version dev-f77eae1. Perhaps I need to change the flag name to pca_maf_target (70471cf)? Okay yes, reading further, I see it is supposed to be the new more descriptive name. f77eae1 I will try it with the new flag name instead. |
Sorry, I changed it to have a more descriptive name because I was getting confused and merged the change last night. I was able to replicate and fix the problem with the new parameter. I'm still considering whether to remove the MAF filtering as default because it may cause more problems than it fixes, I still think the missingness filter is sensible and does not cause problems. |
The new parameters are available in v2.0.0-beta.2 🥳 |
Description of the bug
I switched from using v1 to the v2.0.0-beta.1 version. I ran pgsc_calc on an exclusively European sample, and all of the MostSimilarPop values are AFR. Looking closer, pgsc_calc is assigning AFR as the most similar but there is still a minority percentage of the other ancestries. Besides this it runs normally.
Command used and terminal output
nextflow run pgscatalog/pgsc_calc
Relevant files
No response
System information
AWS, pgsc_calc v2.0.0-beta.1.
The text was updated successfully, but these errors were encountered: