-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pheno 137 #20
Comments
Hi Jenny, Thank you for calling this to our attention - after some digging around, it turns out that the issue with the low confidence flag is a bug in the code that only affects phenotypes with 3 or 4 categories. Fortunately, the summary statistics themselves are not compromised, but the low confidence flag ('low_confidence_variant' column) and corresponding expected AC ('expected_min_category_minor_AC' column) are incorrect, as it not based on the sample size (N) of the smallest category, but the N of missing values. We are going to update the summary statistics to fix the low confidence flag, which affects ~147 categorical phenotypes. In the meantime, I describe below how you can manually recalculate the expected_min_category_minor_AC and update the low_confidence_variant flag using the phenotype summary file, and use 137.gwas.imputed_v3.female.tsv.bgz as the example In the case of 137.gwas.imputed_v3.female.tsv.bgz, the missing N was 21 subjects, which would require a SNP at > 50% MAF to hit the 25 allele count cutoff, which is impossible (MAF caps out at 50%), and why all SNPs are deemed low confidence. The only reason there are "low_confidence_variant==FALSE" in the male and both_sexes sumstat files are becuase their missing N is 32 and 53, respectively. From looking at the female phenotype summary file (phenotypes.female.tsv.gz), here is the breakdown of phenotype 137:
Here is the one-liner I used to grab this info: The key piece is to look at the PHESANT transformation, where it collapses the values into three category bins (Split into three bins: 0: <1, 1: [1,3), 2: >=3), and shows the resulting sample counts of each bin (cat N: 98337, 124631, 138173) From here, we can deduce that the smallest category (Nsmallest_cat), those with 0 treatments/medications taken, has 98337 samples. Using this count, we can re-calculate the 'expected_min_category_minor_AC' with this value using the 'minor_AF' column in the sumstat file like so: expected_min_category_minor_AC = minor_AF * Nsmallest_cat * 2 So taking a quick look at the first SNPs in the sumstats:
For variant 1:69487:G:A, But for variant 1:69569:T:C, In the README, our rule of thumb was as follows: Flag indicating low confidence results based on the following heuristics:
Obviously, we've left it up to the analyst using the results files to override our suggested cutoffs and look at rarer SNPs, but now you can accurately re-calculate the columns without having to wait for us to update the sumstat files :) |
Hi Daniel,
Sorry for the late reply and thank-you very much for the detailed
explanation! Do you know when you expect to release the updated summary
stats with the proper low_confidence_flag? I see how to redefine these
flags in theory, however I have downloaded many, many files and am using
this flag, so I would like to quickly update them where necessary. I fear
it would take me too long to write a script to find those phenotypes which
were split into 3 or 4 bins, grab the lowest n, and then recalculate
the expected_min_category_minor_AC
and low_confidence_variant. It isn't a huge rush. If you already have such
a script, I am happy to rerun rather than wait for you to re-process :)
Best,
Jenny
…On Thu, Sep 12, 2019 at 11:47 PM Daniel P Howrigan ***@***.***> wrote:
Hi Jenny,
Thank you for calling this to our attention - after some digging around,
it turns out that the issue with the low confidence flag is a bug in the
code that only affects phenotypes with 3 or 4 categories. Fortunately, the
summary statistics themselves are not compromised, but the low confidence
flag ('low_confidence_variant' column) and corresponding expected AC
('expected_min_category_minor_AC' column) are incorrect, as it not based on
the sample size (N) of the smallest category, but the N of missing values.
We are going to update the summary statistics to fix the low confidence
flag, which affects ~147 categorical phenotypes. In the meantime, I
describe below how you can manually recalculate the
expected_min_category_minor_AC and update the low_confidence_variant flag
using the phenotype summary file, and use
137.gwas.imputed_v3.female.tsv.bgz as the example
In the case of 137.gwas.imputed_v3.female.tsv.bgz, the missing N was 21
subjects, which would require a SNP at > 50% MAF to hit the 25 allele count
cutoff, which is impossible (MAF caps out at 50%), and why all SNPs are
deemed low confidence. The only reason there are
"low_confidence_variant==FALSE" in the male and both_sexes sumstat files
are becuase their missing N is 32 and 53, respectively.
From looking at the female phenotype summary file
(phenotypes.female.tsv.gz), here is the breakdown of phenotype 137:
phenotype: 137
description: Number of treatments/medications taken
variable_type: ordinal
source: phesant
n_non_missing: 194153
n_missing: 21
n_controls: NA
n_cases: NA
PHESANT_transformation:
137_0|| INTEGER || CONTINUOUS || >20% IN ONE CATEGORY || Split into three bins: 0: <1, 1: [1,3), 2: >=3 ||cat N: 98337, 124631, 138173 || CAT-ORD || order: 0|1|2 || num categories: 3 ||
notes: Number of treatments (medications) entered
Here is the one-liner I used to grab this info:
zless -S phenotypes.female.tsv.gz | grep 'phenotype\|medications\ taken'
The key piece is to look at the PHESANT transformation, where it collapses
the values into three category bins (Split into three bins: 0: <1, 1:
[1,3), 2: >=3), and shows the resulting sample counts of each bin (cat N:
98337, 124631, 138173)
From here, we can deduce that the smallest category (Nsmallest_cat), those
with 0 treatments/medications taken, has 98337 samples. Using this count,
we can re-calculate the 'expected_min_category_minor_AC' with this value
using the 'minor_AF' column in the sumstat file like so:
expected_min_category_minor_AC = minor_AF * Nsmallest_cat * 2
So taking a quick look at the first SNPs in the sumstats:
zless 137.gwas.imputed_v3.female.tsv | head | awk '{print $1,$2,$3,$4,$5}'
| column -t
variant minor_allele minor_AF expected_min_category_minor_AC low_confidence_variant
1:15791:C:T T 0.00000e+00 0.00000e+00 true
1:69487:G:A A 5.44345e-06 2.28625e-04 true
1:69569:T:C C 1.83512e-04 7.70751e-03 true
1:139853:C:T T 5.38286e-06 2.26080e-04 true
1:692794:CA:C C 1.10975e-01 4.66096e+00 true
1:693731:A:G G 1.16217e-01 4.88111e+00 true
1:707522:G:C C 9.77612e-02 4.10597e+00 true
1:717587:G:A A 1.56931e-02 6.59109e-01 true
1:723329:A:T T 1.72507e-03 7.24529e-02 true
For variant 1:69487:G:A,
the expected_min_category_minor_AC is 5.44345e-06 * 98337 * 2 = 1.070585,
making it a low confidence variant
But for variant 1:69569:T:C,
the expected_min_category_minor_AC is 1.83512e-04 * 98337 * 2 = 36.09204,
which is above 25, and shouldn't be considered as a low confidence variant.
In the README, our rule of thumb was as follows:
Flag indicating low confidence results based on the following heuristics:
- Case/control phenotypes: expected_case_minor_AC < 25 or minor_AF <
0.001.
- Categorical phenotypes with less than 5 categories:
expected_min_category_minor_AC < 25 or minor_AF < 0.001.
- Quantitative phenotypes: minor_AF < 0.001.
Obviously, we've left it up to the analyst using the results files to
override our suggested cutoffs and look at rarer SNPs, but now you can
accurately re-calculate the columns without having to wait for us to update
the sumstat files :)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20?email_source=notifications&email_token=AITJSVSPFJFV5DVVK65T2ALQJK2F5A5CNFSM4IUYOFKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6TLORI#issuecomment-531019589>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AITJSVTU7OYFTNZAQOFI7QLQJK2F5ANCNFSM4IUYOFKA>
.
|
Hi Jenny, Thanks for your patience here - it's taken awhile to fully flesh out the problem, as not all phenotypes with 3 categories were affected, but only subset of them. I've add the list of the 111 phenotypes (with filenames) to the GitHub and you can download here: I've also added an R script that you can use to update the files here: Use instructions are described at the beginning of the script While I listed all files affected, there are a number of files where there is no need to update the filter, as the MAF < .001 filter is more restrictive than the smallest category filter. These files are listed as FALSE in the "tsv_requires_update" update. I would remove these lines from the list, along with any other files that you didn't download, then run the Rscript on your specified list. I am currently running an update to summary statistics, but wanted to get this out so users could update the summary statistics on their own downloaded files. Please let us know here if you run into any more issues! |
Thanks a lot for all your help! I just wanted to clarify one thing. On the readme page of this repository you indicate that the files have been updated in the manifest - were they updated with the same name or some name with a new version # appended to the name? |
They were updated with the same name and wget command. We will probably add another field to the Manifest indicating that the file was edited at a more recent date, but I would use the updated file list for now to find which GWAS are updated. |
Hello,
I was just wondering if you know why all variants in the file located here: wget https://www.dropbox.com/s/qz4bu9lffse7q3l/137.gwas.imputed_v3.female.tsv.bgz?dl=0 -O 137.gwas.imputed_v3.female.tsv.bgz, are low confidence variants, while in the male and all_sex version there are millions of "low_confidence_variant==FALSE".
Thanks in advance,
Jenny
The text was updated successfully, but these errors were encountered: