Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pheno 137 #20

Closed
jennysjaarda opened this issue Sep 9, 2019 · 5 comments
Closed

pheno 137 #20

jennysjaarda opened this issue Sep 9, 2019 · 5 comments

Comments

@jennysjaarda
Copy link

Hello,

I was just wondering if you know why all variants in the file located here: wget https://www.dropbox.com/s/qz4bu9lffse7q3l/137.gwas.imputed_v3.female.tsv.bgz?dl=0 -O 137.gwas.imputed_v3.female.tsv.bgz, are low confidence variants, while in the male and all_sex version there are millions of "low_confidence_variant==FALSE".

Thanks in advance,
Jenny

@howrigan
Copy link
Collaborator

howrigan commented Sep 12, 2019

Hi Jenny,

Thank you for calling this to our attention - after some digging around, it turns out that the issue with the low confidence flag is a bug in the code that only affects phenotypes with 3 or 4 categories. Fortunately, the summary statistics themselves are not compromised, but the low confidence flag ('low_confidence_variant' column) and corresponding expected AC ('expected_min_category_minor_AC' column) are incorrect, as it not based on the sample size (N) of the smallest category, but the N of missing values.

We are going to update the summary statistics to fix the low confidence flag, which affects ~147 categorical phenotypes. In the meantime, I describe below how you can manually recalculate the expected_min_category_minor_AC and update the low_confidence_variant flag using the phenotype summary file, and use 137.gwas.imputed_v3.female.tsv.bgz as the example

In the case of 137.gwas.imputed_v3.female.tsv.bgz, the missing N was 21 subjects, which would require a SNP at > 50% MAF to hit the 25 allele count cutoff, which is impossible (MAF caps out at 50%), and why all SNPs are deemed low confidence. The only reason there are "low_confidence_variant==FALSE" in the male and both_sexes sumstat files are becuase their missing N is 32 and 53, respectively.

From looking at the female phenotype summary file (phenotypes.female.tsv.gz), here is the breakdown of phenotype 137:

phenotype: 137
description: Number of treatments/medications taken
variable_type: ordinal
source: phesant
n_non_missing: 194153
n_missing: 21
n_controls: NA
n_cases: NA
PHESANT_transformation:
137_0|| INTEGER || CONTINUOUS || >20% IN ONE CATEGORY || Split into three bins: 0: <1, 1: [1,3), 2: >=3 ||cat N: 98337, 124631, 138173 || CAT-ORD || order: 0|1|2 || num categories: 3 ||
notes: Number of treatments (medications) entered

Here is the one-liner I used to grab this info:
zless -S phenotypes.female.tsv.gz | grep 'phenotype\|medications\ taken'

The key piece is to look at the PHESANT transformation, where it collapses the values into three category bins (Split into three bins: 0: <1, 1: [1,3), 2: >=3), and shows the resulting sample counts of each bin (cat N: 98337, 124631, 138173)

From here, we can deduce that the smallest category (Nsmallest_cat), those with 0 treatments/medications taken, has 98337 samples. Using this count, we can re-calculate the 'expected_min_category_minor_AC' with this value using the 'minor_AF' column in the sumstat file like so:

expected_min_category_minor_AC = minor_AF * Nsmallest_cat * 2

So taking a quick look at the first SNPs in the sumstats:
zless 137.gwas.imputed_v3.female.tsv | head | awk '{print $1,$2,$3,$4,$5}' | column -t

variant        minor_allele  minor_AF     expected_min_category_minor_AC  low_confidence_variant
1:15791:C:T    T             0.00000e+00  0.00000e+00                     true
1:69487:G:A    A             5.44345e-06  2.28625e-04                     true
1:69569:T:C    C             1.83512e-04  7.70751e-03                     true
1:139853:C:T   T             5.38286e-06  2.26080e-04                     true
1:692794:CA:C  C             1.10975e-01  4.66096e+00                     true
1:693731:A:G   G             1.16217e-01  4.88111e+00                     true
1:707522:G:C   C             9.77612e-02  4.10597e+00                     true
1:717587:G:A   A             1.56931e-02  6.59109e-01                     true
1:723329:A:T   T             1.72507e-03  7.24529e-02                     true

For variant 1:69487:G:A,
the expected_min_category_minor_AC is 5.44345e-06 * 98337 * 2 = 1.070585, making it a low confidence variant

But for variant 1:69569:T:C,
the expected_min_category_minor_AC is 1.83512e-04 * 98337 * 2 = 36.09204, which is above 25, and shouldn't be considered as a low confidence variant based on allele count. However, because the MAF is below 0.001, would be flagged as low-confidence based on allele frequency.

In the README, our rule of thumb was as follows:

Flag indicating low confidence results based on the following heuristics:

  • Case/control phenotypes: expected_case_minor_AC < 25 or minor_AF < 0.001.
  • Categorical phenotypes with less than 5 categories: expected_min_category_minor_AC < 25 or minor_AF < 0.001.
  • Quantitative phenotypes: minor_AF < 0.001.

Obviously, we've left it up to the analyst using the results files to override our suggested cutoffs and look at rarer SNPs, but now you can accurately re-calculate the columns without having to wait for us to update the sumstat files :)

@jennysjaarda
Copy link
Author

jennysjaarda commented Sep 26, 2019 via email

@howrigan
Copy link
Collaborator

howrigan commented Oct 9, 2019

Hi Jenny,

Thanks for your patience here - it's taken awhile to fully flesh out the problem, as not all phenotypes with 3 categories were affected, but only subset of them. I've add the list of the 111 phenotypes (with filenames) to the GitHub and you can download here:
https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/GWAS_list_low_confidence_filter_update.txt.gz

I've also added an R script that you can use to update the files here:
https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/Rapid_GWAS_low_confidence_filter_update.R

Use instructions are described at the beginning of the script

While I listed all files affected, there are a number of files where there is no need to update the filter, as the MAF < .001 filter is more restrictive than the smallest category filter. These files are listed as FALSE in the "tsv_requires_update" update. I would remove these lines from the list, along with any other files that you didn't download, then run the Rscript on your specified list.

I am currently running an update to summary statistics, but wanted to get this out so users could update the summary statistics on their own downloaded files. Please let us know here if you run into any more issues!

@jennysjaarda
Copy link
Author

Thanks a lot for all your help! I just wanted to clarify one thing. On the readme page of this repository you indicate that the files have been updated in the manifest - were they updated with the same name or some name with a new version # appended to the name?

@howrigan
Copy link
Collaborator

They were updated with the same name and wget command. We will probably add another field to the Manifest indicating that the file was edited at a more recent date, but I would use the updated file list for now to find which GWAS are updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants