George Kanoungi, Michael Nothnagel, Tim Becker, Dmitriy Drichel. "The exhaustive genomic scan approach, with an application to rare-variant association analysis" https://www.biorxiv.org/content/10.1101/571752v1
git clone https://github.com/ddrichel/GECS.git
cd GECS
make
GECS is an implementation of our novel approach to conduct exhaustive region-based association analysis of rare variants in genomic case-control studies. The main idea of the exhaustive scan is to compute test statistics for all contigous subsequences of larger sequences, such as human chromosomes. The idea of the exhaustive one-dimensional scan can be regarded as analogous to Kulldorff's two-dimensional spacial scan (Kulldorff, 1996).
GECS is written in C++ and implements an ultrafast algorithm that enables testing for association of all genomic subsequences using the collapsing test of Li and Leal (2008). The application is scalable suitable for running analyses on large whole-exome and whole-genome studies.
GECS is written in C++ and distributed under the GPL3 license. This program uses the alglib C++ library for computing the p-values for the Pearson's chi-squared test (http://www.alglib.net/).
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
execute some examples of gecs included in the bash file, applied on an example data set. In the Examples you will try different combinations of parameter specifications for a specific goal in the analysis.
- Example 1 : You will perform a single marker analysis and use 999 permutations for correcting for multiple testing. Note no need to specify the minor allele frequency threshold (MAFT) or the threshold of number of carriers (NCT).
- Example 2: You will perform the variable binning approach instead of the single marker analysis, and here you are obligated to specify the parameter MAFT or the equivalent parameter NCT like in the example 3.
- Example 4: In this example a conflict in the values of NCT and MAFT will be solved by considering NCT always it is specified.
- Example 5: The SMA approach is applied on the chromosome 22 of a simulated data set with 1000 sample size.
The usage of gecs is:
gecs <~/path/to/file.param>
gecs --help will present a brief description of the keywords in the parameter file.
Keywords in the parameter file *.param :
KEY | <Input> | Default | Description | Notes |
---|---|---|---|---|
BFILE | <string> | none | Prefix of the plink binary file | With this parameter, the name of the binary plink files (*.bed, *.bim, and *.fam) will be specified |
SINGLEMARKER | <bool> | 0 | Whether the single-marker analysis (SMA) should be performed instead of the region-based exhaustive scan | If SINGLEMARKER is set to 1, the standard single marker analysis (SMA) will be performed and all variants will be included in the analysis, regardless of their minor allele frequencies. For SINGLEMARKER 0, the exhaustive scan approach will be preformed, and the threshold of the minor allele frequencies (or number of carriers per variant) needs to be specified. |
NCT | <int> | none | "Rareness" threshold: max. number of carriers per variant | Either NCT or MAFT has to be specified for the exhaustive scan approach |
MAFT | <double> | none | Minor allele frequency threshold for rare variants | Ignored if SINGLEMARKER is set to 1, otherwise either NCT or MAFT has to be specified |
PERMUTATIONS | <int> | 999 | Number of permutations for correction of multiple testing | In order to control the familiy wise error rate (FWER) at the 5% level, gecs performs a full analysis of each permutation replicate (random reassigment of case-control labels). The smallest p-value of each replicate is written to an output file. The full list of smallest p-values is used for correction for multiple testing |
CORRECTED_P | <bool> | 0 | Whether the permutation-adjusted p-values will be written to output, together with the 95% CI (Wilson score interval) | Useful if the data set can be analyzed in a single run. Adjustment is performed for a single "rareness" threshold |
PTHRESHOLD | <double> | 1 | Max. nominal p-value for bins to be written to output | By specifying this parameter you can restrict the results on only p-values less than a specific threshold. This feature is usefull for controlling the size of output files |
ALLBINS | <bool> | 0 | Whether locally not-distinct bins should be written to output (useful for plotting and testing) | Use with caution and small input files, otherwise the output can be become extremely large. Therefore, no correction for multiple testing is possible for this process (PERMUTATIONS will be set to 0) |
OR | <bool> | 0 | Whether the odds ratios should be calculated | |
ORCONFIDENCE | <bool> | 0 | Whether the 95% confidence interval should be calculated for odds ratios | |
OUTPUT | <string> | Prefix of the input file | Prefix of output files |
GECS provides two major modes for conducting association analysis, namely for single markers (SMA) and for all possible bins (subsequences of contiguous markers below a specified "rareness" threshold) in the genetic data set. In the single marker analysis, all variants will be considered in the analysis, regardless of their frequencies. Permutations with respect to the case-control labels is applied to make correction for multiple testing. If PERMUTATIONS is set to 0, no correction for multiple testing will be performed and only nominal p-values will be computed.
If SINGLEMARKER is set to 1, only the single-marker analysis (SMA) will be performed. The SMA can be regarded as a special case of the collapsing test COLL (Li and Leal, 2008) without a maximum "rareness" threshold and with bin size of 1. If SINGLEMARKER is set to 1, MAFT will be effectively set to 0.5, and the NCT and ALLBINS parameters will be ignored. For SMA, the COLL test is identical to Pearson's chi-squared test under the dominant model.
- A text file [*_singlemarker.txt], which includes the results of association tests for all included markers.
- Two files [*_singlemarker.pvals] and [*_singlemarker.srt.pvals] for the correction of multiple testing based on the null model.
The final corrected alpha will be reported with other information about the analysis in the [*gecs_nct<nct>.log] file.
In the default case (SINGLEMARKER=0) the parameters NCT and MAFT do the same job, which is determinig the rareness threshold for the analysis.
- In case of specifying the both parameters NCT and MAFT, the NCT parameter will overwrite the parameter MAFT by default, otherwise, specifying only one parameter will be enough.
- ALLBINS paremeter is set to 0 for the default case, where only distinct bins will be considered. This feature is only usefull in case of scanning small regions for plotting purposes. Only
- A text file [*gecs<nct>.txt], which includes the results of association tests for all genomic subsequences.
- Two files [*gecs<nct>.pvals] and [*gecs<nct>.srt.pvals] for the correction of multiple testing based on the null model.
The correction for multiple testing is achieved by performing permutation. the final corrected alpha will be reported in the [*gecs_nct<nct>.log] file, along with additional information.
The keyword "VT" in a parameter file will activate the Variable-threshold analysis if set to 1. The algorithm is described in the supplemental information of the manuscrupt, section d). The VT method iterates though the different NCT thresholds in addition to the genomic positions. The implementation should be considered experimental and used with caution, as the computational requirements can be very large and we conducted only a limited amount of testing. To reproduce the testing script, recompile the application with DEBUG=1 in the scan.cpp file and execute the script DATA/VB_test.sh out of the GECS parent directory.
George Kanoungi, Dmitriy Drichel
This project is licensed under the GNU General Public License v3.0 - see the LICENSE.md file for details.
This project was supported by the German Research Foundation grant BE 38/28/9-1. The funding organization did not have any influence on the design, conduct or conclusions of the study.
Please site: George Kanoungi, Michael Nothnagel, Tim Becker, Dmitriy Drichel. "The exhaustive genomic scan approach, with an application to rare-variant association analysis" (2019) https://www.biorxiv.org/content/10.1101/571752v1