Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent ancestry assignments in v2.0.0-beta vs. alpha #343

Closed
AWS-crafter opened this issue Jul 27, 2024 · 14 comments
Closed

Inconsistent ancestry assignments in v2.0.0-beta vs. alpha #343

AWS-crafter opened this issue Jul 27, 2024 · 14 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation user-query User queries & requests

Comments

@AWS-crafter
Copy link

AWS-crafter commented Jul 27, 2024

Description of the bug

I switched from using v1 to the v2.0.0-beta.1 version. I ran pgsc_calc on an exclusively European sample, and all of the MostSimilarPop values are AFR. Looking closer, pgsc_calc is assigning AFR as the most similar but there is still a minority percentage of the other ancestries. Besides this it runs normally.

Command used and terminal output

nextflow run pgscatalog/pgsc_calc

Relevant files

No response

System information

AWS, pgsc_calc v2.0.0-beta.1.

@AWS-crafter AWS-crafter added the bug Something isn't working label Jul 27, 2024
@AWS-crafter
Copy link
Author

AWS-crafter commented Jul 27, 2024

Instead, I tried this in

  • pgsc_calc v2.0.0-beta (2024-06-19), and
  • pgsc_calc v2.0.0-alpha.6 (2024-05-24), with the same results.

@AWS-crafter
Copy link
Author

I tried this in:

  • pgsc_calc version: pgsc_calc v2.0.0-alpha.5 (2024-03-19) [failed to complete, re-ran, and failed again]
  • pgsc_calc version: pgsc_calc v2.0.0-alpha.3 (2023-10-02) [this seems to have good results].

Perhaps the issue occurred before pgsc_calc v2.0.0-alpha.6 and after pgsc_calc v2.0.0-alpha.3.

I will try these versions:

  • v2.0.0-alpha.6
  • v2.0.0-alpha.5
  • v2.0.0-alpha.4
  • v2.0.0-alpha.3

@smlmbrt
Copy link
Member

smlmbrt commented Jul 28, 2024

Hi @AWS-crafter, take a look at the related discussion here: #333

If you're running the calculator on an individual sample it may be unstable with the MAF & genotype missingness threshold? Or are you running it on many samples?

@smlmbrt smlmbrt added user-query User queries & requests and removed bug Something isn't working labels Jul 28, 2024
@AWS-crafter
Copy link
Author

For these tests, I am using a single sample. Why would MAF or missingness affect the results for a single sample? I would assume they are calculated either based on the available samples (so all sites in the single sample would have 0% missingness), or the reference panel, which also has just about every site in the sample.

  • pgsc_calc v2.0.0-alpha.3 (2023-10-02) [Appears normal. Tried on three samples, one sample per run.]
  • pgsc_calc v2.0.0-alpha.5 (2024-03-19) [Did not complete.]
  • pgsc_calc v2.0.0-alpha.6 (2024-05-24) [Incorrect ancestry assignment.]

@AWS-crafter
Copy link
Author

To find out what is causing this, I would like to change this file:
https://github.com/PGScatalog/pygscatalog/blob/main/pgscatalog.match/src/pgscatalog/match/cli/intersect_cli.py to simply make all variants PCA_ELIGIBLE.

Can I do this by editing the below line in pgsc_calc's conf/modules.config to point to the modified version of pygscatalog, since I am using Singularity?

conf/modules.config ext.singularity = 'oras://ghcr.io/pgscatalog/pygscatalog'

In other words, is this where pgsc_calc "looks" for pygscatalog?

Alternatively, I can change pgsc_calc's modules/local/ancestry/intersect_variants.nf to edit output file(s) to change all variants to PCA_ELIGIBLE, right after the pgscatalog-intersect step.

@AWS-crafter
Copy link
Author

For single sample variants, MAF is either 0 (homozygous genotype) or 0.5 (heterozygous genotype), since a reference panel isn't used. The current behavior is to filter out all homozygous variants (the vast majority) and only use heterozygous variants for PCA.

Even when I run multiple samples, this will filter out many variants if too many happen to be homozygous. This could cause issues if all samples are of the same ancestry group, but the reference panel is multi-ancestry, or if the sample size is small.

Settting the MAF threshold to 0.0 will not fix this, since 0.0 is not > maf_filter = 0.0.

Potential solutions:

  • Use the reference panel for minor allele frequency at positions.
  • Conditional on small sample size, such as a single sample, remove the MAF criteria altogether (or set the filter to greater than or equal to 0.0).

@smlmbrt
Copy link
Member

smlmbrt commented Jul 29, 2024

I will fix the MAF filter to allow >= 0 and make it possible to remove it altogether.

@nebfield
Copy link
Member

To find out what is causing this, I would like to change this file: https://github.com/PGScatalog/pygscatalog/blob/main/pgscatalog.match/src/pgscatalog/match/cli/intersect_cli.py to simply make all variants PCA_ELIGIBLE.

Can I do this by editing the below line in pgsc_calc's conf/modules.config to point to the modified version of pygscatalog, since I am using Singularity?

conf/modules.config ext.singularity = 'oras://ghcr.io/pgscatalog/pygscatalog'

In other words, is this where pgsc_calc "looks" for pygscatalog?

For future reference the container references are set here:

withLabel: pgscatalog_utils {
ext.conda = "$projectDir/environments/pgscatalog_utils/environment.yml"
ext.docker = 'ghcr.io/pgscatalog/pygscatalog'
ext.singularity = 'oras://ghcr.io/pgscatalog/pygscatalog'
ext.docker_version = ':pgscatalog-utils-1.1.2'
ext.singularity_version = ':pgscatalog-utils-1.1.2-singularity'
}

(The full string is oras://ghcr.io/pgscatalog/pygscatalog:pgscatalog-utils-1.1.2-singularity)

Changing code that's running inside containers can be a little tricky, you'd need to:

  1. clone the pygscatalog repository
  2. make changes to the python package
  3. rebuild the singularity image and update the container references

A simpler approach would be to try the conda profile and edit the code that gets installed directly to work/envs/...

But we'll do a release soon to fix the intersect issues 😅

@smlmbrt smlmbrt added documentation Improvements or additions to documentation bug Something isn't working labels Jul 29, 2024
@smlmbrt smlmbrt changed the title Wrong ancestry assignments in v2.0.0 Wrong ancestry assignments in v2.0.0-beta Jul 29, 2024
@smlmbrt smlmbrt changed the title Wrong ancestry assignments in v2.0.0-beta Inconsistent ancestry assignments in v2.0.0-beta vs. alpha Jul 29, 2024
@AWS-crafter
Copy link
Author

AWS-crafter commented Jul 29, 2024

The changes are great. Will switching the pgsc_calc to the dev branch and/or switching the container to ghcr.io/pgscatalog/pygscatalog:pgscatalog-utils-1.2.0-singularity fix this issue? I have figured out how to point pgsc_calc to a local container.

@nebfield nebfield mentioned this issue Jul 30, 2024
@nebfield
Copy link
Member

nebfield commented Jul 30, 2024

The changes are great. Will switching the pgsc_calc to the dev branch and/or switching the container to ghcr.io/pgscatalog/pygscatalog:pgscatalog-utils-1.2.0-singularity fix this issue? I have figured out how to point pgsc_calc to a local container.

The dev branch has the changes included now. You should be able to disable filtering by setting --maf_target to 0 and --geno_miss_target to 1 when calling nextflow run pgscatalog/pgsc_calc -r dev ...

@AWS-crafter
Copy link
Author

Thanks! Setting --maf_target to 0.0 (no geno_miss_target modification) fixed the issue, and ancestry assignments are now correct.

@AWS-crafter
Copy link
Author

AWS-crafter commented Jul 31, 2024

Actually, in my above comment, I changed nextflow.config to have maf_target with 0.0. I tried again with the flag on the run command without modifying nextflow.config and it reverted to the problematic behavior (incorrect ancestry assignment). This was on version dev-f77eae1. Perhaps I need to change the flag name to pca_maf_target (70471cf)?

Okay yes, reading further, I see it is supposed to be the new more descriptive name. f77eae1

I will try it with the new flag name instead.

@smlmbrt
Copy link
Member

smlmbrt commented Jul 31, 2024

Sorry, I changed it to have a more descriptive name because I was getting confused and merged the change last night. I was able to replicate and fix the problem with the new parameter. I'm still considering whether to remove the MAF filtering as default because it may cause more problems than it fixes, I still think the missingness filter is sensible and does not cause problems.

@nebfield
Copy link
Member

The new parameters are available in v2.0.0-beta.2 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation user-query User queries & requests
Projects
None yet
Development

No branches or pull requests

3 participants