There are two versions of DBSLMM: the tuning version and the deterministic version. The tuning version examines three different heritability choices and requires a validation data to tune the heritability hyper-parameter. The deterministic version uses one heritability estimate and directly fit the model in the training data without a separate validation data. Both versions requires a reference data to compute the SNP correlation matrix.
For binary traits, especially for psychiatry diseases (i.e. major depression), DBSLMM usually outperforms some existing PGS construction methods.
Method Paper: Yang S#, Zhou X. Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets. Am J Hum Genet. 2020 May 7;106(5):679-693.
Bechmarking Paper: Yang S#*, Zhou X. PGS-server: accuracy, robustness and transferability of polygenic score methods for biobank scale studies. Brief Bioinform. 2022 Mar 10;23(2):bbac039.
Database paper: Cao C, Zhang S, Wang J, Tian M, Ji X, Huang D*, Yang S*, Gu N. PGS-Depot: a comprehensive resource for polygenic scores constructed by summary statistics based methods. Nucleic Acids Res. 2024 Jan 5;52(D1):D963-D971.
Webserver paper: Yang S#*, Ye X#, Ji X#, Li Z, Tian M, Huang P, Cao C. PGSFusion streamlines polygenic score construction and epidemiological applications in biobank-scale cohorts. bioRxiv.
Psycho-metabolic nexus: Guo X, Feng Y, Ji X, Jia N, Mainaiti A, Lai J, Wang Z*, Yang S*, Hu S*. Shared genetic architecture and bidirectional clincial risks with psycho-metabolic nexus. EBioMedicine. (in press)
PTRS for Stroke: Cai X, Li H, Cao X, Ma X, Zhu W, Xu L, Yang S, Yu R, Huang P. Integrating transcriptomic and polygenic risk scores to enhance predictive accuracy for ischemic stroke subtypes. Hum Genet. 2024 Nov 18. doi: 10.1007/s00439-024-02717-7.
We update the software/DBSLMM.R
and software/TUNE.R
.
- integrates bigsnpr to estiamte the heritability, select large effect SNPs by clumping, and calculate the
$\hat{y}$ for the validation set - provides 'map_hm3_plus.rds' and corresponding PLINK files for three ancestries from 1000 Genomes Project
- fits the whole genomoe for three DBSLMM models with one R script
- provides the online server PGSFsuion constructing PGS and performing epidemiological application in UKBB
- provides the online version https://www.pgs-server.com/.
- fixes the bug of model fiting
- fits DBSLMM in different heritabiliies with one clumping
- uses a script to make the usage of DBSLMM-tuning version more easier.
- validates the flods of heritability: 0.8, 1 and 1.2
- fits the LMM, when the chromosome without any large effect SNPs
- fits the external validation all by R code.
We use one R script to construct PGS
In this version, we use DBSLMM_script.sh
to make the usage of DBSLMM more easier. v0.3 needs valg
to be the whole genome plink file. By parameter -m
, we give two different model assumptions, including DBSLMM and LMM. Specially, LMM model assumption only have the default version, because it do not need to tune -h2f
.
PACK_DIR=/your/path/
SPLITCHR=${PACK_DIR}DBSLMM/software/SPLITCHR.R
PLINK=${PACK_DIR}plink_1.9/plink
DBSLMM=${PACK_DIR}DBSLMM/DBSLMM_script.sh
# Split into chromosome
DATADIR=/your/path/example_data/
summ=${DATADIR}all/summary
Rscript ${SPLITCHR} --summary ${summ}.assoc.txt
# Set parameters
BLOCK=${PACK_DIR}DBSLMM/LDblock/chr
herit=/your/path/herit.log
index=r2
thread=1
outpath=${DATADIR}output/
# DBSLMM tuning version (without covariates)
type=t
valg=${DATADIR}val/valid
valp=${DATADIR}val/valid_pheno.txt
model=DBSLMM
sh ${DBSLMM} -D ${PACK_DIR} -p ${PLINK} -B ${BLOCK} -s ${summ} -H ${herit} -m ${model} -G ${valg} -P ${valp}\
-l ${col} -T ${type} -i ${index} -t ${thread} -o ${outpath}
# DBSLMM automatic version (without covariates)
type=d
refp=${DATADIR}val/val
model=DBSLMM
sh ${DBSLMM} -D ${PACK_DIR} -p ${PLINK} -B ${BLOCK} -s ${summ} -H ${herit} -m ${model} -G ${refp}\
-T ${type} -i ${index} -t ${thread} -o ${outpath}
# LMM version
type=d
refp=${DATADIR}val/val
model=LMM
sh ${DBSLMM} -D ${PACK_DIR} -p ${PLINK} -B ${BLOCK} -s ${summ} -H ${herit} -m ${model} -G ${refp}\
-T ${type} -i ${index} -t ${thread} -o ${outpath}
If the user wants to change the fold of heritability, you can revise the setting in DBSLMM_script.sh
.
You should use the output file of ldsc
as -H
parameter of DBSLMM
.
The download link of dbslmm
is https://drive.google.com/file/d/1eAbEyhF8rO_faOFL3jqRo9LmfgJNRH6K/view?usp=sharing.
We should use reference panel to construct SNP correlation matrix. You can treat 1000 Genome Project as reference panel. We also need the block information. The code is MAKELD.R
.
mkld=/your/path/MAKELD.R
# construct SNP corrlation matrix for each chromosome
for chr in `seq 1 22`
do
bfile=/your/path/chr${chr}
blockfile=/your/path/chr${chr}.bed
plink=/your/path/plink-1.9
outpath=/your/path/outpath/
Rscript ${mkld} --bfile ${bfile} --blockfile ${blockfile} --plink ${plink} --outpath ${outpath} --chr ${chr}
done
The format of extsumm
is the same as that of LDSC
.
SNP N Z A1 A2
rs1983865 253135 3.79310344827586 T C
rs1983864 251364 -4.51612903225806 T G
rs12411954 253213 0.413793103448276 T C
rs7077266 250092 -0.92 T G
We should use the standardized SNP effect size. You have two options. First, the variant ID is read from column 1, an allele code is read from the following column, the standardized effect size associated with the named allele is read from the column after the allele column. E.g.
esteff=/your/path/esteff.txt,1,2,3
Second, the variant ID is read from column 1, an allele code is read from the following column, the standardized effect size associated with the named allele is read from the column after the allele column, the MAF with the named allele is read from the last column. E.g.
esteff=/your/path/esteff.txt,1,2,3,4
The example of EXTERNAL.R
is as following:
external=/your/path/EXTERNAL.R
extsumm=/your/path/extsumm.ldsc.gz
esteff=/your/path/esteff.txt,1,2,3
LDpath=/your/path/LDpath/
outpath=/your/path/outpath/
Rscript ${external} --extsumm ${extsumm} --esteff ${esteff} --LDpath ${LDpath} --outpath ${outpath}