Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle high-dimensional covariates? (or just fitting a propensity model) #78

Open
adeldaoud opened this issue Oct 18, 2021 · 3 comments

Comments

@adeldaoud
Copy link

I am unable to locate documentation on whether tmle3 can handle user-specified propensity scores or how to fit a propensity score model within tmle3 prior to fitting the outcome model. What I can gather from the Handbook (https://tlverse.org/tlverse-handbook/tmle3.html#tmle), tmle3 fits a propensity score model implicitly, based on the user-defined structural causal model.

If it does fit a propensity score implicitly, it seems as tmle3 is struggling with handling high-dimensional data. We are in a setting where we are using text for causal inference, and thus we using a high-dimensional sparse document-term matrix (10 000 cases and 130 000 covariates). When I fit this model (takes about 6 h), along the lines of the code below (I am happy to provide the simulated data if you wish to experiment), the average treatment effect is way off. The simulated ATE is 5.5 but tmle3 is producing an ATE of -128.

Any ideas of how to proceed?

`

4. "Moderate cleaning", with text controls ----

tdm <- imf.tdm.moderate.cleaning.df

gather tabular data

mf.meta1 <- imf.meta

imf.meta2 <- cbind(imf.meta1, tdm)

imf.meta contains "C1", "C2", "C3" tabular data and the TDM

nodes <- list(W = c("C1", "C2", "C3", names(tdm)),
A = "T",
Y = "Y")

lrnr_glm_fast <- make_learner(Lrnr_glm_fast)
lrnr_mean <- make_learner(Lrnr_mean)
learner_list <- list(Y = lrnr_mean, A = lrnr_glm_fast)

make a new copy to deal with data.table weirdness

imf.meta3 <- data.table::copy(imf.meta2)

estimate

start_time <- Sys.time()
tmle_fit_from_spec <- tmle3(tmle_TSM_all(), imf.meta3, nodes, learner_list)
end_time <- Sys.time()
end_time - start_time

`

@jeremyrcoyle
Copy link
Collaborator

tmle3 trains the sl3 learners you ask it to train in the learner_list argument on the data in the tmle_task. You can specify any sl3 learners you like. If they are pretrained, it will not train them again. See examples here for how to specify such learners: http://tlverse.org/tlverse-handbook/sl3.html. We don't have a way of automatically determining appropriate learners for you.

You've specified a mean learner for the outcome model and a GLM learner for the propensity score model. Neither is appropriate for your use case. Because you're not attempting to learn the outcome model (mean model), you're in effect generating a type of IPTW estimate (ie no double robustness from TMLE). Because you're attempting to learn the propensity score with a GLM, you would expect to receive a dense model for the propensity score, which seems unlikely to be effective in a high dimensional setting. What's more, GLM isn't really defined on problems with more covariates than observations, so it's unsurprising that it's behaving poorly.

Basically, whatever kinds of models you would use for this type of data generally would be appropriate candidates for including in the outcome and propensity score ensembles. In a high dimensional use case, you would want to use learners that are able to find and return a sparse model. Things like Lrnr_glmnet, Lrnr_xgboost, Lrnr_ranger would probably be effective. Adding a pre-screening step or a preprocessing step to reduce dimensionality (I haven't done NLP in a very long time, but people used to do things like SVD or ICA) would also be appropriate and should dramatically reduce computation time.

@adeldaoud
Copy link
Author

@jeremyrcoyle many thanks for your input! I was following the tutorial provided in the Handbook (https://tlverse.org/tlverse-handbook/tmle3.html ) and the Tutorial tlverse.org (https://tlverse.org/tmle3/articles/framework.html). Good point about using Lrnr_glmnet for the propensity model.

Per both tutorials, they are using Lrnr_mean and therefore assumed this is the optimal choice for the outcome model. So I guess you are saying that if I used, e.g. Lrnr_glm_fast as the outcome model, I would be in fact using double robustness from TMLE. Or to put it another word, what outcome model would you use to obtains doubly roubesstness?

@rachaelvp
Copy link
Member

rachaelvp commented Oct 27, 2021

Hi @adeldaoud, I commented on your latest response below. I bolded what you wrote and my response is the text that is not bolded.

Per both tutorials, they are using Lrnr_mean and therefore assumed this is the optimal choice for the outcome model. The tutorials use Lrnr_mean only because its fast. By no means is it assumed this is the optimal choice for the outcome model! The SL learner libraries in the tutorials are for demonstrative purposes and should not be considered as an example of a good library in practice. Sorry you were confused by this. So I guess you are saying that if I used, e.g. Lrnr_glm_fast as the outcome model, I would be in fact using double robustness from TMLE. No. Or to put it another word, what outcome model would you use to obtains doubly roubesstness? Double robustness of the TMLE for the ATE requires consistent estimation of either the propensity score (PS) or the outcome regression. Whether or not that is acheived depends hugely on your dataset, and what is known about the experiment that generated it. It's also depends on how you estimate the PS and the outcome regression. It's impossible to know in applied settings (i.e., non-simulation settings where the underlying truth is unknown) whether or not the estimated PS or the estimated outcome regression acheived consistency, so it's impossible to know whether or not the double robustness is satisfied in applied settings like yours. Fortunately, the analyst can be contientious about how they estimate the PS and the outcome regression in order to do as good of a job as possible for their data, and presumably the better they do, the more the TMLE's double robustness will kick in. Using the super learner (SL) to estimate the PS and the outcome regression in the TMLE is an excellent first step as the SL allows the user to consider many diverse strategies for learning from the data. It's up to the user to specify a diverse library for their dataset. The better the library, the better the resulting SL will be. These tutorial examples that you're referring to consider poor SL libraries that are not nearly diverse enough and would never be used in practice. In practice (i.e., for a real analysis), the user needs to specify a diverse SL library that's appropraite for their dataset and analysis. @jeremyrcoyle provided some insight on the kinds of learners you might consider for your library.

Basically, whatever kinds of models you would use for this type of data generally would be appropriate candidates for including in the outcome and propensity score ensembles. In a high dimensional use case, you would want to use learners that are able to find and return a sparse model. Things like Lrnr_glmnet, Lrnr_xgboost, Lrnr_ranger would probably be effective. Adding a pre-screening step or a preprocessing step to reduce dimensionality (I haven't done NLP in a very long time, but people used to do things like SVD or ICA) would also be appropriate and should dramatically reduce computation time.

I hope this helps. Please feel free to continue this thread with more Qs / analysis planning. We can help you build out your SL library and show you how to construct a well-specified TMLE. You seem to have an interesting and complex problem. I have a few Q's about your problem, and I included them below.

  • You mentioned you have 10,000 cases and 130,000 covariates. What is your sample size (i.e., how many independent and identically distributed observations are in your data), is it 10,000 or are the observations clustered/dependent?
  • What is your question of interest? For instance, you could fill in the blanks to define your question with respect to the ATE: What is the average effect of __________ on _________ when adjusting for possible confounders like ________ in the target population defined by ________________.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants