pheno 137 #20

jennysjaarda · 2019-09-09T08:53:27Z

Hello,

I was just wondering if you know why all variants in the file located here: wget https://www.dropbox.com/s/qz4bu9lffse7q3l/137.gwas.imputed_v3.female.tsv.bgz?dl=0 -O 137.gwas.imputed_v3.female.tsv.bgz, are low confidence variants, while in the male and all_sex version there are millions of "low_confidence_variant==FALSE".

Thanks in advance,
Jenny

howrigan · 2019-09-12T21:47:07Z

Hi Jenny,

Thank you for calling this to our attention - after some digging around, it turns out that the issue with the low confidence flag is a bug in the code that only affects phenotypes with 3 or 4 categories. Fortunately, the summary statistics themselves are not compromised, but the low confidence flag ('low_confidence_variant' column) and corresponding expected AC ('expected_min_category_minor_AC' column) are incorrect, as it not based on the sample size (N) of the smallest category, but the N of missing values.

We are going to update the summary statistics to fix the low confidence flag, which affects ~147 categorical phenotypes. In the meantime, I describe below how you can manually recalculate the expected_min_category_minor_AC and update the low_confidence_variant flag using the phenotype summary file, and use 137.gwas.imputed_v3.female.tsv.bgz as the example

In the case of 137.gwas.imputed_v3.female.tsv.bgz, the missing N was 21 subjects, which would require a SNP at > 50% MAF to hit the 25 allele count cutoff, which is impossible (MAF caps out at 50%), and why all SNPs are deemed low confidence. The only reason there are "low_confidence_variant==FALSE" in the male and both_sexes sumstat files are becuase their missing N is 32 and 53, respectively.

From looking at the female phenotype summary file (phenotypes.female.tsv.gz), here is the breakdown of phenotype 137:

phenotype: 137
description: Number of treatments/medications taken
variable_type: ordinal
source: phesant
n_non_missing: 194153
n_missing: 21
n_controls: NA
n_cases: NA
PHESANT_transformation:
137_0|| INTEGER || CONTINUOUS || >20% IN ONE CATEGORY || Split into three bins: 0: <1, 1: [1,3), 2: >=3 ||cat N: 98337, 124631, 138173 || CAT-ORD || order: 0|1|2 || num categories: 3 ||
notes: Number of treatments (medications) entered

Here is the one-liner I used to grab this info:
zless -S phenotypes.female.tsv.gz | grep 'phenotype\|medications\ taken'

The key piece is to look at the PHESANT transformation, where it collapses the values into three category bins (Split into three bins: 0: <1, 1: [1,3), 2: >=3), and shows the resulting sample counts of each bin (cat N: 98337, 124631, 138173)

From here, we can deduce that the smallest category (Nsmallest_cat), those with 0 treatments/medications taken, has 98337 samples. Using this count, we can re-calculate the 'expected_min_category_minor_AC' with this value using the 'minor_AF' column in the sumstat file like so:

expected_min_category_minor_AC = minor_AF * Nsmallest_cat * 2

So taking a quick look at the first SNPs in the sumstats:
zless 137.gwas.imputed_v3.female.tsv | head | awk '{print $1,$2,$3,$4,$5}' | column -t

variant        minor_allele  minor_AF     expected_min_category_minor_AC  low_confidence_variant
1:15791:C:T    T             0.00000e+00  0.00000e+00                     true
1:69487:G:A    A             5.44345e-06  2.28625e-04                     true
1:69569:T:C    C             1.83512e-04  7.70751e-03                     true
1:139853:C:T   T             5.38286e-06  2.26080e-04                     true
1:692794:CA:C  C             1.10975e-01  4.66096e+00                     true
1:693731:A:G   G             1.16217e-01  4.88111e+00                     true
1:707522:G:C   C             9.77612e-02  4.10597e+00                     true
1:717587:G:A   A             1.56931e-02  6.59109e-01                     true
1:723329:A:T   T             1.72507e-03  7.24529e-02                     true

For variant 1:69487:G:A,
the expected_min_category_minor_AC is 5.44345e-06 * 98337 * 2 = 1.070585, making it a low confidence variant

But for variant 1:69569:T:C,
the expected_min_category_minor_AC is 1.83512e-04 * 98337 * 2 = 36.09204, which is above 25, and shouldn't be considered as a low confidence variant based on allele count. However, because the MAF is below 0.001, would be flagged as low-confidence based on allele frequency.

In the README, our rule of thumb was as follows:

Flag indicating low confidence results based on the following heuristics:

Case/control phenotypes: expected_case_minor_AC < 25 or minor_AF < 0.001.
Categorical phenotypes with less than 5 categories: expected_min_category_minor_AC < 25 or minor_AF < 0.001.
Quantitative phenotypes: minor_AF < 0.001.

Obviously, we've left it up to the analyst using the results files to override our suggested cutoffs and look at rarer SNPs, but now you can accurately re-calculate the columns without having to wait for us to update the sumstat files :)

jennysjaarda · 2019-09-26T08:38:44Z

Hi Daniel, Sorry for the late reply and thank-you very much for the detailed explanation! Do you know when you expect to release the updated summary stats with the proper low_confidence_flag? I see how to redefine these flags in theory, however I have downloaded many, many files and am using this flag, so I would like to quickly update them where necessary. I fear it would take me too long to write a script to find those phenotypes which were split into 3 or 4 bins, grab the lowest n, and then recalculate the expected_min_category_minor_AC and low_confidence_variant. It isn't a huge rush. If you already have such a script, I am happy to rerun rather than wait for you to re-process :) Best, Jenny

…

On Thu, Sep 12, 2019 at 11:47 PM Daniel P Howrigan ***@***.***> wrote: Hi Jenny, Thank you for calling this to our attention - after some digging around, it turns out that the issue with the low confidence flag is a bug in the code that only affects phenotypes with 3 or 4 categories. Fortunately, the summary statistics themselves are not compromised, but the low confidence flag ('low_confidence_variant' column) and corresponding expected AC ('expected_min_category_minor_AC' column) are incorrect, as it not based on the sample size (N) of the smallest category, but the N of missing values. We are going to update the summary statistics to fix the low confidence flag, which affects ~147 categorical phenotypes. In the meantime, I describe below how you can manually recalculate the expected_min_category_minor_AC and update the low_confidence_variant flag using the phenotype summary file, and use 137.gwas.imputed_v3.female.tsv.bgz as the example In the case of 137.gwas.imputed_v3.female.tsv.bgz, the missing N was 21 subjects, which would require a SNP at > 50% MAF to hit the 25 allele count cutoff, which is impossible (MAF caps out at 50%), and why all SNPs are deemed low confidence. The only reason there are "low_confidence_variant==FALSE" in the male and both_sexes sumstat files are becuase their missing N is 32 and 53, respectively. From looking at the female phenotype summary file (phenotypes.female.tsv.gz), here is the breakdown of phenotype 137: phenotype: 137 description: Number of treatments/medications taken variable_type: ordinal source: phesant n_non_missing: 194153 n_missing: 21 n_controls: NA n_cases: NA PHESANT_transformation: 137_0|| INTEGER || CONTINUOUS || >20% IN ONE CATEGORY || Split into three bins: 0: <1, 1: [1,3), 2: >=3 ||cat N: 98337, 124631, 138173 || CAT-ORD || order: 0|1|2 || num categories: 3 || notes: Number of treatments (medications) entered Here is the one-liner I used to grab this info: zless -S phenotypes.female.tsv.gz | grep 'phenotype\|medications\ taken' The key piece is to look at the PHESANT transformation, where it collapses the values into three category bins (Split into three bins: 0: <1, 1: [1,3), 2: >=3), and shows the resulting sample counts of each bin (cat N: 98337, 124631, 138173) From here, we can deduce that the smallest category (Nsmallest_cat), those with 0 treatments/medications taken, has 98337 samples. Using this count, we can re-calculate the 'expected_min_category_minor_AC' with this value using the 'minor_AF' column in the sumstat file like so: expected_min_category_minor_AC = minor_AF * Nsmallest_cat * 2 So taking a quick look at the first SNPs in the sumstats: zless 137.gwas.imputed_v3.female.tsv | head | awk '{print $1,$2,$3,$4,$5}' | column -t variant minor_allele minor_AF expected_min_category_minor_AC low_confidence_variant 1:15791:C:T T 0.00000e+00 0.00000e+00 true 1:69487:G:A A 5.44345e-06 2.28625e-04 true 1:69569:T:C C 1.83512e-04 7.70751e-03 true 1:139853:C:T T 5.38286e-06 2.26080e-04 true 1:692794:CA:C C 1.10975e-01 4.66096e+00 true 1:693731:A:G G 1.16217e-01 4.88111e+00 true 1:707522:G:C C 9.77612e-02 4.10597e+00 true 1:717587:G:A A 1.56931e-02 6.59109e-01 true 1:723329:A:T T 1.72507e-03 7.24529e-02 true For variant 1:69487:G:A, the expected_min_category_minor_AC is 5.44345e-06 * 98337 * 2 = 1.070585, making it a low confidence variant But for variant 1:69569:T:C, the expected_min_category_minor_AC is 1.83512e-04 * 98337 * 2 = 36.09204, which is above 25, and shouldn't be considered as a low confidence variant. In the README, our rule of thumb was as follows: Flag indicating low confidence results based on the following heuristics: - Case/control phenotypes: expected_case_minor_AC < 25 or minor_AF < 0.001. - Categorical phenotypes with less than 5 categories: expected_min_category_minor_AC < 25 or minor_AF < 0.001. - Quantitative phenotypes: minor_AF < 0.001. Obviously, we've left it up to the analyst using the results files to override our suggested cutoffs and look at rarer SNPs, but now you can accurately re-calculate the columns without having to wait for us to update the sumstat files :) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#20?email_source=notifications&email_token=AITJSVSPFJFV5DVVK65T2ALQJK2F5A5CNFSM4IUYOFKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6TLORI#issuecomment-531019589>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AITJSVTU7OYFTNZAQOFI7QLQJK2F5ANCNFSM4IUYOFKA> .

howrigan · 2019-10-09T16:07:47Z

Hi Jenny,

Thanks for your patience here - it's taken awhile to fully flesh out the problem, as not all phenotypes with 3 categories were affected, but only subset of them. I've add the list of the 111 phenotypes (with filenames) to the GitHub and you can download here:
https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/GWAS_list_low_confidence_filter_update.txt.gz

I've also added an R script that you can use to update the files here:
https://github.com/Nealelab/UK_Biobank_GWAS/blob/master/Rapid_GWAS_low_confidence_filter_update.R

Use instructions are described at the beginning of the script

While I listed all files affected, there are a number of files where there is no need to update the filter, as the MAF < .001 filter is more restrictive than the smallest category filter. These files are listed as FALSE in the "tsv_requires_update" update. I would remove these lines from the list, along with any other files that you didn't download, then run the Rscript on your specified list.

I am currently running an update to summary statistics, but wanted to get this out so users could update the summary statistics on their own downloaded files. Please let us know here if you run into any more issues!

jennysjaarda · 2019-10-23T21:40:09Z

Thanks a lot for all your help! I just wanted to clarify one thing. On the readme page of this repository you indicate that the files have been updated in the manifest - were they updated with the same name or some name with a new version # appended to the name?

howrigan · 2019-10-25T14:54:02Z

They were updated with the same name and wget command. We will probably add another field to the Manifest indicating that the file was edited at a more recent date, but I would use the updated file list for now to find which GWAS are updated.

howrigan closed this as completed Jan 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pheno 137 #20

pheno 137 #20

jennysjaarda commented Sep 9, 2019

howrigan commented Sep 12, 2019 •

edited

Loading

jennysjaarda commented Sep 26, 2019 via email

howrigan commented Oct 9, 2019

jennysjaarda commented Oct 23, 2019

howrigan commented Oct 25, 2019

pheno 137 #20

pheno 137 #20

Comments

jennysjaarda commented Sep 9, 2019

howrigan commented Sep 12, 2019 • edited Loading

jennysjaarda commented Sep 26, 2019 via email

howrigan commented Oct 9, 2019

jennysjaarda commented Oct 23, 2019

howrigan commented Oct 25, 2019

howrigan commented Sep 12, 2019 •

edited

Loading