Skip to content

scikit-learn wrapper for generalized linear mixed model methods in R

License

Notifications You must be signed in to change notification settings

stanbiryukov/sklearn-GLMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sklearn-GLMM

scikit-learn wrapper for generalized linear mixed model methods in R

This is a lightweight wrapper that enables fitting generalized linear multivariate multilevel models from python via R. It easily enables fitting bayesian models via brms calls: https://github.com/paul-buerkner/brms

It is flexible enough to extend to other R based models and is designed to be as compatible with scikit-learn syntax as possible.

It's specifically built against rpy2==2.8.6 to enable both python2 and 3 support.

Installation:

pip install git+https://github.com/stanbiryukov/sklearn-GLMM

Under the hood, the class relies on the R libraries listed below, so for it to work properly they must be installed in the R environment linked to rpy2. R's pacman will attempt to install any missing libraries but rstan usually requires system specific configuration.

library(pacman)
library(rstan)
library(parallel)
library(brms)
library(lme4)
library(data.table)
library(dplyr)
library(merTools)
library(pbmcapply)

Demonstrate by example:

# df is your pandas dataframe with predictors and target column
import numpy as np
import pandas as pd

df = pd.read_csv('https://stats.idre.ucla.edu/stat/data/hdp.csv')
df = df.apply(lambda x: pd.factorize(x)[0] if np.issubdtype(x.dtype, np.number) is False else x) # factorize some columns

from pyGLMM import skGLMM
from sklearn.preprocessing import FunctionTransformer, StandardScaler
ml = skGLMM(x_scaler = StandardScaler(), y_scaler = FunctionTransformer(validate=True), 
            r_call = "brm(remission ~ IL6 + CRP + CancerStage + LengthofStay + Experience + (1 | DID), data = df, family = bernoulli(), algorithm='sampling', iter = 1000, chains = 4, cores = 4)")

ml.fit(df[['IL6', 'CRP', 'CancerStage', 'LengthofStay', 'Experience', 'DID']], df['remission'])
phat = ml.predict(df[['IL6', 'CRP', 'CancerStage', 'LengthofStay', 'Experience', 'DID']])
# Support for parallel prediction and sampling from posterior for BRMS and LME4 models
phatdraws = ml.predict(df[['IL6', 'CRP', 'CancerStage', 'LengthofStay', 'Experience', 'DID']], n_draws=1000, parallel=True)

# Also returns R style summary
ml.summary()
 Family: bernoulli 
  Links: mu = logit 
Formula: remission ~ IL6 + CRP + CancerStage + LengthofStay + Experience + (1 | DID) 
   Data: df (Number of observations: 8525) 
Samples: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup samples = 2000

Group-Level Effects: 
~DID (Number of levels: 407) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.99      0.10     1.80     2.20 1.00      401      746

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept       -1.48      0.10    -1.68    -1.28 1.00      259      615
IL6             -0.15      0.03    -0.21    -0.09 1.00     2717     1470
CRP             -0.06      0.03    -0.12    -0.00 1.00     1867     1228
CancerStage     -0.29      0.03    -0.35    -0.22 1.00     1935     1258
LengthofStay    -0.31      0.03    -0.38    -0.25 1.00     1905     1377
Experience       0.49      0.12     0.25     0.73 1.02      130      136

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).