Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement] how to obtain the number of cases in the summary statistics output for binary phenotypes? #348

Closed
freeseek opened this issue Oct 15, 2022 · 5 comments

Comments

@freeseek
Copy link

When I run regenie step 2 with option --bt for binary phenotypes, the output contains the following columns:

#CHROM
GENPOS
ID
ALLELE0
ALLELE1
A1FREQ
INFO
N
TEST
BETA
SE
CHISQ
LOG10P
EXTRA

The N column contains the total number of samples in the association. However, it would be useful to also have the number of cases in the output table, as from the total number of samples and the number of cases you can compute the effective sample size which can be useful when running a fixed effect meta-analysis weighted by the sample size.

I don't necessarily expect this to be a mandatory field in the output, but at least it would be nice to have an option to include this information. I often convert the output of regenie into GWAS-VCF using the following map:

#CHROM -> #CHROM
GENPOS -> POS
ID -> ID
ALLELE0 -> REF
ALLELE1 -> ALT
N -> NS
INFO -> SI
BETA -> ES
SE -> SE
LOG10P -> LP
A1FREQ -> AF

But the GWAS-VCF format has also a NC field that could host the number of cases.

Separately from the main request, would there be interest in allowing regenie to generate summary statistics outputs as GWAS-VCF files? If regenie had the option to output summary statistics as binary GWAS-VCF files, this could also allow keeping the statistics as 32-bits floats without having to round and lose the less significant digits.

@joellembatchou
Copy link
Collaborator

joellembatchou commented Oct 18, 2022

Hi,

I have made a note to output the number of cases/controls for BTs when using --af-cc in future releases. In the meanwhile, can you try the more detailed summary statistics output with --htp as in #227?

Cheers,
Joelle

@freeseek
Copy link
Author

I see, I did not know about option --htp which, as explained in:

Will generate the following columns:

Name -> ID
Chr -> #CHROM
Pos -> POS
Ref -> REF
Alt -> ALT
Trait
Cohort
Model
Effect
LCI_Effect
UCI_Effect
Pval -> 10^(-LP)
AAF -> AF
Num_Cases -> NC
Cases_Ref
Cases_Het
Cases_Alt
Num_Controls -> NS-NC
Controls_Ref
Controls_Het
Controls_Alt
Info:REGENIE_BETA -> ES
Info:REGENIE_SE -> SE
Info:INFO -> SI
Info:MAC -> AC (= Cases_Het + 2 * Cases_Alt + Controls_Het + 2 * Controls_Het)

One drawback is definitely that the p-value is not expressed as -log10(p) anymore and so this makes retrieving LP impossible for very low p-values and also it is a bit weird that the last four key statistics are all compressed into a single Info column. If Num_Cases -> NC could be retrieved from the default output statistics without the option --htp, then maybe it would be also nice if something like Info:MAC -> AC could be output (ideally the alternate allele count rather than the minor allele count), possibly when something like the --af-cc option is used as Weinberg suggested in #227, as this is also a field that can be filled from the GWAS-VCF specification.

@bgorm
Copy link

bgorm commented Oct 26, 2022

I wanted to second @freeseek's request -- for directly typed genotypes (WGS, array) where there is per-SNP missingness, it is non-trivial to get the case and control numbers intersecting with the called genotypes post-hoc, so this would add value.

@joellembatchou
Copy link
Collaborator

Hi,

Thank you both for your valuable feedback. The columns N_CASES and N_CONTROLS will be added when using --af-cc in the next REGENIE release (should be within a couple weeks).

Cheers,
Joelle

@joellembatchou
Copy link
Collaborator

This is now available in v3.2.2 (released here).

Cheers,
Joelle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants