With causal inference, we conduct experiments where the treatment assignment is randomised, and we observe the outcomes (conventionally known as Y) on the experimental units. To calculate the average treatment effect within the whole population, we take the average value of the outcome within the treated group and compare it against the average outcome value within the control group. Since the treatment assignment is randomised, the only difference between both groups is the treatment effect, and we can conclude that the resultant difference is the causal effect of the treatment.
However, we can expect certain subgroups within the general population to experience a greater treatment effect compared to other subgroups. For example, in the context of a medicinal drug used to combat an illness, it is possible that the elderly group (which tends to have a weaker immune system) may have a higher incremental response rate with the drug treatment compared to the young adult group (which tends to have stronger immune system).
This brings us to the topic of heterogeneous treatment effects (HTE), which implies that different subpopulation groups have varying treatment effects. By identifying that different subgroups will have different response rates to the treatment, we can perform “uplift modelling” to identify and rank the subgroups that have the best response rates and prioritise them.
Uplift (also known as incremental value) modelling is based on a generic framework using the theory of Conditional Average Treatment Effect (CATE). For a given intervention Treatment T, the incremental value is the difference in expected outcomes between T = 1 and T = 0, conditioned upon some covariates/features X.
The assumptions behind uplift modelling are very similar to what we would in terms of a causal experiment with heterogeneous treatment effect. For a given set of covariates X, we assume conditional independence between the treatment assignment T and the potential outcomes (Y1,Y0). This is also known as the conditional exchangeability/unfoundedness assumption.
In this article, we will discuss uplift modelling with respect to a business setting where:
- the customers are the experimental units,
- the customers’ demographic information can be captured by covariates X
- the treatment/intervention T is a business action (e.g. promotion), and
- the outcome is a binary metric of interest (e.g. customer purchase behaviour)
Based on the actions taken by customers/subjects with an intervention (e.g. a promotional offer), there is a fundamental segmentation that separates customers into four following groups.
- The Persuadables: Customers who will respond positively because of an intervention.
- The Sure Things: Customers who would responded positively independent of whether they were given an intervention or not.
- The Lost Causes: Customers who respond negatively independent of whether an intervention is given or not.
- The Do Not Disturbs: Customers who will respond negatively if they were given an intervention.
The objective of any uplift modeling exercise is to identify the Persuadables and prioritise them while avoiding the rest.
There are various approaches through which we can model incremental value. In this article we cover four such techniques:
- Single Model Approach (S-Learner) [1],
- Two Model Approach (T-Learner) [1],
- X-Learner [1], and
- CATE-generating Outcome Transformation (OT) approach [2]
All the above formulations can be thought of as algorithmic frameworks to model incremental value, in which one can incorporate any typical machine learning algorithm as a base learner. These algorithmic frameworks are typically referred to as Meta-learners in the literature.
The S-Learner can be thought of as a “single” model approach. We then model the outcome Y conditional on X AND T (where T is treated as one of the covariates) with the data. The implications of the S Learner is that if the set of X features is very large (or high dimensional), estimating the causal effect of T on Y might be difficult since T is just one out of many covariates. In the following notation, M represents a particular supervised learning model applied on predicting Y conditional on X and T.
After creating the model, the CATE for a given unit (with its corresponding set of X features) is estimated via:
The T-Learner on the other hand uses two models: one for the treated group (represented by M1), and one for the control group (represented by M0). Given that there are separate models for the corresponding groups, the T column is not included as a covariate in the modelling.
After creating the two models, the CATE estimation for a given unit (with its corresponding set of X features) is estimated via:
With the T-Learner, note that there is data inefficiency in the modelling since we only use the treated group data for one model, and the control group data for the other model. To overcome this data inefficiency, we can take a look at the X-Learner.
The X-Learner consists of two stages each with two models. The two models in the first stage are essentially the same as the two models in the T-Learner.
Once again, note that M1 was created with the Treated group, while M0 was created using the Control group. After both are created, we can use them to predict the counterfactual outcomes on their corresponding counterpart data group to calculate the intermediate values D.
- For the Control group data, we calculate the intermediate variable DT=0 based on the M1 counterfactual outcome prediction minus the observed outcomes.
- For the Treated group data, we estimate an intermediate variable DT=1 based on the observed outcome minus the M0 counterfactual outcome prediction.
Thereafter, in the second stage, the two models (M11 and M00) are created using the intermediate values D1 and D0 accordingly:
Subsequently, the CATE estimation for a given unit (with its corresponding X features) is shown by the following:
Where g(x) is some function and typically created as the propensity scoring model.
For the OT approach, a key insight is that we can characterize the CATE as a conditional expectation of an observed variable by transforming the outcome using the treatment indicator and the treatment assignment probability.
The CATE-generating transformation of the outcome is shown by the following formula:
where:
- Yi* is the transformed outcome variable to be modelled.
- i is an indexing on the treated subject,
- Yiobs is the observed outcome
- e(Xi) is the treatment assignment probability aka propensity score
- Wi is treatment indicator variable
Suppose that the unconfoundedness/conditional exchangeability assumption holds, then
For a detailed mathematical proof of the above, please refer to Appendix A1 which has a handwritten breakdown of the mathematical formulation.
Evaluation of uplift models involves various metrics. However, before we get into the different metrics, one should understand how to read a gain chart. In the x-axis, we seek to rank the population according to the predicted treatment effects (from highest to lowest typically in a left to right manner along the x-axis.). Thereafter, we can calculate the metric of interest on the y-axis as if we were accumulating more and more of the population quantiles.
Notably, the typical benchmark for evaluative comparison is a randomised model. This is often represented by the diagonal relatively “straight” line which represents a policy/model that does not discriminate between the subgroups populations (which is represented by the solid black line above). With the lack of prioritisation, the gain effect climbs steadily on average as you increment across the population quantiles.
In the chart above, the green and blue lines show the gain performance for uplift models that have prioritised the population subgroups. For a given population percentage (vertical line at a x-axis value), we can read off the graph to compare the y-values between different models and also against the benchmark value. This difference represents the performance lift, and it can possibly lead to some form of value optimization. For example, as shown by the red line along the green curve, by targeting 40% of the population, one captures about 85% of the possible gain value compared to targeting the whole population.
In uplift modelling, there are 3 common evaluative measures namely:
- Qini curve
- Adjusted Qini curve
- Cumulative Gain curve
The Qini Curve is formulated by the following formula:
The population fraction ϕ is the fraction of population treated (in either treated t or control c, as indicated by the subscripts) ordered by predicted uplift (from highest to lowest). The numerators represent the count of positive binary outcomes corresponding to either the Treatment group or the Control Group. The denominators (represented by capital N) however do not depend on the population fraction, and are instead the total count of either Treatment or Controls in the experiment.
Mathematically, the value represents the difference in Positive Outcomes between the Treatment group vs the Control Group for a given population quantile. The bigger the value is, the bigger the treatment effect.
The intuition is that since we ranked the population by descending predicted treatment effects, we would expect the Qini value to be higher at the earliest quantile of the population (where the treatment effect is bigger), and taper off with increasing population quantiles (where the treatment effect is smaller).
The Adjusted Qini curve is based on the following formula:
The difference between the Qini curve and the adjusted Qini curve is the fraction for the Control group (represented by the fraction being subtracted). This modified fraction is formulated to represent the fraction value while adjusting for the count of Positive Outcomes in the Control group as if the Total Control Group size was similar to the Total Treatment group size. This is particularly applicable for cases where the treatment group is much smaller compared to the control group in a randomised control trial (where the rationale being that you may not want to expose a potentially harmful treatment to a large proportion of your experimental population).
There are different variants of the Cumulative Gain metric, but for this article, the formula is based on the following:
The first bracket shows the difference in fractions in the Treatment vs Control groups, whereby the numerators represent the count of Positive Outcomes while the denominator represents the count of Treatment/Control based on a given population.
The Adjusted Qini Curve can also be reformulated in a different way to illustrate that it is similar to the Cumulative Gain, but with a different multiplier.
Based on the revised formulation of the Adjusted Qini, we can clearly see the difference in the multipliers.
Adjusted Qini | Cumulative Gains |
The Adjusted Qini has a multiplier that is based on but the Cumulative Gain has a multiplier based on .
Given the cumulative gain uses both Treated and Control population at every fraction in adjustment factor, we emphasize that the cumulative gains chart is less biased than the adjusted Qini curve. However, the adjusted Qini can be useful when the percentage of the Treated group is much smaller compared to the Control group in the experiment. Under such a scenario, the adjusted Qini will value the treatment group information disproportionately higher.
To incorporate the various metrics for evaluation, we have to refer back to the gain chart visualisation. As mentioned, the randomised model is the typical benchmark that is represented by the diagonal 45 degree line.
When we evaluate a model’s performance relative using the Qini formulation for example, we calculate the Q coefficient which is represented by:
This difference in AUC is represented by the red shading in the above figure. Note that the same can be done for Adjusted Qini or Cumulative Gain as the metric.
For more details, please refer to PyLift’s documentation on the 3 measures.
The Criteo dataset was created as a standardised benchmark dataset for research experimentation and algorithm comparison. This dataset consists of about 13 million observations.
The dataset comes with the following variables:
- f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
- treatment: treatment group (1 = treated, 0 = control)
- conversion: whether a conversion occurred for this user (binary, label)
- visit: whether a visit occurred for this user (binary, label)
- exposure: treatment effect, whether the user has been effectively exposed (binary)
With this dataset, the treatment assignment is randomised, but the ratio between the Treated vs Control group is 85% to 15%.
Based on the difference in treatment assignments, we can take a look at the breakdown of two different outcome distributions, namely “Visit” and “Conversion”.
With Visit as an outcome, the response rates are
- Control group: 80105/ (80105 + 2016832) = 3.82%
- Treated group: 576824 / (576824 + 11305831) = 4.85%
With Conversion as an outcome, the response rates are:
- Control group: 4063 / (4063 + 2092874) = 0.194%
- Treated group: 36711 / (36711 + 11845944) = 0.309%.
The table below shows a big difference in the quantum of the outcome scenarios, where the Conversion rates are about 10 to 20 times smaller than the Visit rates.
Outcome | Visit | Conversion |
Control | 3.82% | 0.194% |
Treated | 4.85% | 0.309% |
For more details, you can refer to the paper “A Large Scale Benchmark for Uplift Modeling” or to this link.
Since the Criteo dataset is large, in order for us to have faster experimentation cycles, we decided to randomly select 50% of the population which is ~6.5 million samples. We evaluated different uplift modeling approaches with 5-fold cross validation Qini (qini), Adjusted Qini (aqini) and Cumulative Gains(cgains) metrics.
Given that the 4 frameworks are considered Meta-Learners, we incorporated Gradient Boosting Classifier models with default hyperparameters, and did not perform any hyperparameter tuning for the purposes of keeping the analysis simple.
We will analyse the results of two scenarios of treatment variable T vs different outcome variables:
- Treatment vs Visit
- Treatment vs Conversion.
As we saw in the evaluation metrics section, uplift models can be evaluated with "qini", "aqini" & "cgains" metrics. We compared various modeling approaches with these evaluation metrics across 5 fold cross validation.
An example of the OT framework performance on the Treatment vs Visit outcome of a particular fold is shown below.
Notice how with the OT meta-learner model, by prioritising the top 40% of the population, we capture about 90% of the overall cumulative gain. This actually opens up options for value optimisation, but this will not be explored within this article.
In terms of 5 fold cross validation, the following charts should show the performance of the different frameworks.
Qini |
Adjusted Qini |
Cumulative Gains |
Observations from Treatment on Visit:
- All the meta-learners are performing similarly both on mean & variances on cross validation metrics. The visualisations are all zoomed in (where the y-axis does not start at 0).
- We observe that in general, the T-Learner tends to have very slightly higher variance compared to the rest. This could be due to the fact that the T-Learner itself comes with inefficient data use where it develops separate models for Treated vs Control groups. Since our Control group is randomised at 15% of the data, this could explain the generally higher variance in the T-Learner framework.
An example of the X-Learner framework performance on the Treatment vs Visit outcome of a particular fold is shown below.
In terms of 5 fold cross validation, the following charts should show the performance of the different frameworks.
Qini |
Adjusted Qini |
Cumulative Gains |
Observations from Treatment on Conversion experiments:
- The OT framework performs significantly better than all other meta-learners with cgains of 0.181 with the least standard deviation of 0.017 on cross validation metrics.
- X-Learner is next best with cgains of 0.151 with standard deviation of 0.033.
- S-Learner is the worst of all in terms of mean and variance when compared to other meta-learners. We suspect that in much lower response rate scenarios (i.e. Treatment on Conversion), the severe class imbalance of Treated vs Control (85:15 ratio) groups might worsen the variance of the S-Learner estimates. We also suspect that if there is some form of class imbalance correction (either with 50:50 Treated vs Control ratio of experimental data), the S-Learner performance will be much better in terms of estimate mean and variance.
- Comparing T-Learner against OT and X-Learner, we hypothesize that the reduced performance is once again due to the data inefficiency inherent in its methodology.
A big difference between the Visit outcome scenario and the Conversion outcome scenario is that in the latter case, both OT and X-Learner perform much better than the S and T-Learner. Note that the difference in scenarios are that the Visit outcome has about 4-5% outcome response rates while Conversion outocme has about 0.2 to 0.3% outcome response rates. For the OT and X-Learner, we posit that by explicitly incorporating propensity score models in their formulation, there is some form of "weighting correction" that allows for better modelling of the incremental treatment effect especially in low response rates scenarios. Thus, they are able to perform significantly better than the T and S-Learner.
Reference to the notebooks within this project repo can be found here:
- 5 Fold Cross Validation Notebook
- CV Results Visualisation Notebook
- Example CGAINS chart for various metalearners
With the Criteo uplift dataset, all the approaches that we adopted showed similar performance for the Treatment vs Visit setting. However, we observe a stark difference in performances across the models for the Treatment vs Conversion setting. Not only does the Outcome Transformed model have the best average performance by far in that setting, it also has the lowest variance compared to the other models. On the other extreme, the S-learner has the lowest average performance with the highest variance across the runs. We would like to investigate variance behavior across the modeling approaches in future articles.
Another interesting way to look at uplift modeling is deriving a value from a flat A/B experiment result[3]. While a regular A/B experiment only allows us to discover treatment effects at the level of the whole population, causal inference techniques enable us to extract valuable information about the variation in responses across different subpopulations.
Future articles that follow in this series cover aspects like causal inference in non-randomized settings aka observational causal inference and uplift modeling with ROI constraints[4]. In the later part of the series we will explore topics at the intersection of causal inference and multi arm bandits.
- Machine Learning for Estimating Heretogeneous Casual Effects
- Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning
- Leveraging Causal Modeling to Get More Value from Flat Experiment Results
- Free Lunch! Retrospective Uplift Modeling for Dynamic Promotions Recommendation within ROI Constraints