How to handle high-dimensional covariates? (or just fitting a propensity model) #78

adeldaoud · 2021-10-18T09:20:11Z

I am unable to locate documentation on whether tmle3 can handle user-specified propensity scores or how to fit a propensity score model within tmle3 prior to fitting the outcome model. What I can gather from the Handbook (https://tlverse.org/tlverse-handbook/tmle3.html#tmle), tmle3 fits a propensity score model implicitly, based on the user-defined structural causal model.

If it does fit a propensity score implicitly, it seems as tmle3 is struggling with handling high-dimensional data. We are in a setting where we are using text for causal inference, and thus we using a high-dimensional sparse document-term matrix (10 000 cases and 130 000 covariates). When I fit this model (takes about 6 h), along the lines of the code below (I am happy to provide the simulated data if you wish to experiment), the average treatment effect is way off. The simulated ATE is 5.5 but tmle3 is producing an ATE of -128.

Any ideas of how to proceed?

`

4. "Moderate cleaning", with text controls ----

tdm <- imf.tdm.moderate.cleaning.df

gather tabular data

mf.meta1 <- imf.meta

imf.meta2 <- cbind(imf.meta1, tdm)

imf.meta contains "C1", "C2", "C3" tabular data and the TDM

nodes <- list(W = c("C1", "C2", "C3", names(tdm)),
A = "T",
Y = "Y")

lrnr_glm_fast <- make_learner(Lrnr_glm_fast)
lrnr_mean <- make_learner(Lrnr_mean)
learner_list <- list(Y = lrnr_mean, A = lrnr_glm_fast)

make a new copy to deal with data.table weirdness

imf.meta3 <- data.table::copy(imf.meta2)

estimate

start_time <- Sys.time()
tmle_fit_from_spec <- tmle3(tmle_TSM_all(), imf.meta3, nodes, learner_list)
end_time <- Sys.time()
end_time - start_time

`

jeremyrcoyle · 2021-10-18T15:43:43Z

tmle3 trains the sl3 learners you ask it to train in the learner_list argument on the data in the tmle_task. You can specify any sl3 learners you like. If they are pretrained, it will not train them again. See examples here for how to specify such learners: http://tlverse.org/tlverse-handbook/sl3.html. We don't have a way of automatically determining appropriate learners for you.

You've specified a mean learner for the outcome model and a GLM learner for the propensity score model. Neither is appropriate for your use case. Because you're not attempting to learn the outcome model (mean model), you're in effect generating a type of IPTW estimate (ie no double robustness from TMLE). Because you're attempting to learn the propensity score with a GLM, you would expect to receive a dense model for the propensity score, which seems unlikely to be effective in a high dimensional setting. What's more, GLM isn't really defined on problems with more covariates than observations, so it's unsurprising that it's behaving poorly.

Basically, whatever kinds of models you would use for this type of data generally would be appropriate candidates for including in the outcome and propensity score ensembles. In a high dimensional use case, you would want to use learners that are able to find and return a sparse model. Things like Lrnr_glmnet, Lrnr_xgboost, Lrnr_ranger would probably be effective. Adding a pre-screening step or a preprocessing step to reduce dimensionality (I haven't done NLP in a very long time, but people used to do things like SVD or ICA) would also be appropriate and should dramatically reduce computation time.

adeldaoud · 2021-10-26T12:18:06Z

@jeremyrcoyle many thanks for your input! I was following the tutorial provided in the Handbook (https://tlverse.org/tlverse-handbook/tmle3.html ) and the Tutorial tlverse.org (https://tlverse.org/tmle3/articles/framework.html). Good point about using Lrnr_glmnet for the propensity model.

Per both tutorials, they are using Lrnr_mean and therefore assumed this is the optimal choice for the outcome model. So I guess you are saying that if I used, e.g. Lrnr_glm_fast as the outcome model, I would be in fact using double robustness from TMLE. Or to put it another word, what outcome model would you use to obtains doubly roubesstness?

rachaelvp · 2021-10-27T01:02:06Z

Hi @adeldaoud, I commented on your latest response below. I bolded what you wrote and my response is the text that is not bolded.

Per both tutorials, they are using Lrnr_mean and therefore assumed this is the optimal choice for the outcome model. The tutorials use Lrnr_mean only because its fast. By no means is it assumed this is the optimal choice for the outcome model! The SL learner libraries in the tutorials are for demonstrative purposes and should not be considered as an example of a good library in practice. Sorry you were confused by this. So I guess you are saying that if I used, e.g. Lrnr_glm_fast as the outcome model, I would be in fact using double robustness from TMLE. No. Or to put it another word, what outcome model would you use to obtains doubly roubesstness? Double robustness of the TMLE for the ATE requires consistent estimation of either the propensity score (PS) or the outcome regression. Whether or not that is acheived depends hugely on your dataset, and what is known about the experiment that generated it. It's also depends on how you estimate the PS and the outcome regression. It's impossible to know in applied settings (i.e., non-simulation settings where the underlying truth is unknown) whether or not the estimated PS or the estimated outcome regression acheived consistency, so it's impossible to know whether or not the double robustness is satisfied in applied settings like yours. Fortunately, the analyst can be contientious about how they estimate the PS and the outcome regression in order to do as good of a job as possible for their data, and presumably the better they do, the more the TMLE's double robustness will kick in. Using the super learner (SL) to estimate the PS and the outcome regression in the TMLE is an excellent first step as the SL allows the user to consider many diverse strategies for learning from the data. It's up to the user to specify a diverse library for their dataset. The better the library, the better the resulting SL will be. These tutorial examples that you're referring to consider poor SL libraries that are not nearly diverse enough and would never be used in practice. In practice (i.e., for a real analysis), the user needs to specify a diverse SL library that's appropraite for their dataset and analysis. @jeremyrcoyle provided some insight on the kinds of learners you might consider for your library.

Basically, whatever kinds of models you would use for this type of data generally would be appropriate candidates for including in the outcome and propensity score ensembles. In a high dimensional use case, you would want to use learners that are able to find and return a sparse model. Things like Lrnr_glmnet, Lrnr_xgboost, Lrnr_ranger would probably be effective. Adding a pre-screening step or a preprocessing step to reduce dimensionality (I haven't done NLP in a very long time, but people used to do things like SVD or ICA) would also be appropriate and should dramatically reduce computation time.

I hope this helps. Please feel free to continue this thread with more Qs / analysis planning. We can help you build out your SL library and show you how to construct a well-specified TMLE. You seem to have an interesting and complex problem. I have a few Q's about your problem, and I included them below.

You mentioned you have 10,000 cases and 130,000 covariates. What is your sample size (i.e., how many independent and identically distributed observations are in your data), is it 10,000 or are the observations clustered/dependent?
What is your question of interest? For instance, you could fill in the blanks to define your question with respect to the ATE: What is the average effect of __________ on _________ when adjusting for possible confounders like ________ in the target population defined by ________________.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle high-dimensional covariates? (or just fitting a propensity model) #78

How to handle high-dimensional covariates? (or just fitting a propensity model) #78

adeldaoud commented Oct 18, 2021

jeremyrcoyle commented Oct 18, 2021

adeldaoud commented Oct 26, 2021

rachaelvp commented Oct 27, 2021 •

edited

Loading

How to handle high-dimensional covariates? (or just fitting a propensity model) #78

How to handle high-dimensional covariates? (or just fitting a propensity model) #78

Comments

adeldaoud commented Oct 18, 2021

4. "Moderate cleaning", with text controls ----

gather tabular data

imf.meta contains "C1", "C2", "C3" tabular data and the TDM

make a new copy to deal with data.table weirdness

estimate

jeremyrcoyle commented Oct 18, 2021

adeldaoud commented Oct 26, 2021

rachaelvp commented Oct 27, 2021 • edited Loading

rachaelvp commented Oct 27, 2021 •

edited

Loading