Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bart #539

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Bart #539

wants to merge 5 commits into from

Conversation

caioguirado
Copy link

Proposed changes

This PR proposes the implementation of Bayesian Additive Regression Trees (BART) as an additional method to the package.
The implementation allows the usage of BART for both a Classic ML problem setting and Uplift Modeling. The reson for including also the Classic ML setting was to allow easier validation of the method with synthetic data.

Currently the method works for regression and binary classification response types, and with binary treatment type.

References:
[1] Chipman et al. (2010)
[2] Hill (2011)
[3] Kapelner and Bleich (2014)
[4] Tan and Roy (2019)
BartPy

Types of changes

What types of changes does your code introduce to CausalML?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have signed the CLA
  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in downstream modules

Further comments

Some next steps are proposed for improvement:

  • Add parallelization:
    In the example notebook added, there's a cProfile analysis of the methods that take the most time to execute inside the fit method. Both the computation of the individual tree residuals and prediction step are the top opportunities of improvement. They also have a very similar logic. Pratola et al. proposed a way of parallelizing it.

  • Add non-binary treatment support.

  • Add multi-class classification support.

  • Add MCMC statistics report and confidence intervals for BART predictions

@jeongyoonlee
Copy link
Collaborator

Thanks for the PR for BART, @caioguirado!

I will take a look further in details, but have two comments at the moments as follows:

  • Currently, it's failing at the Lint test. Could you please run black to reformat the code?
  • The test data set used in the example notebook looks too small. Could you please use more features e.g. ~10 with the mix of informative and non-informative features?

Thanks!

if type_ == 'update_leaves':
if not root.left and not root.right:
self.leaves.append(root)
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why we need to return None here (and below).

response = np.where(y.reshape(-1, 1) == 1, max_els, min_els) # Z in the paper

for j in range(len(self.trees)):
residuals = self.compute_residual(X=X, y=response, trees=self.trees, j=j) # opportunity to parallelize according to Pratola et al. (2013) (https://arxiv.org/abs/1309.1906)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what it says in the reference, but if I understand things correctly, for each tree we compute the residuals by looping over every other tree, which indeed seems wasteful. Is it not possible to store the prediction of each tree to avoid computing the same prediction over and over again?

df_res = pd.DataFrame(np.array(predictions).T, columns=columns)

# From: https://github.com/uber/causalml/blob/c42e873061eb74ec9c3ca6ea991e113b886245ae/causalml/inference/tree/uplift.pyx
df_res['recommended_treatment'] = df_res.apply(np.argmax, axis=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily something to solve here, but we need to change the terminology of "recommended treatment", because CATE is just one factor that is relevant for whether a treatment should be recommended or not.

@t-tte
Copy link
Collaborator

t-tte commented Aug 29, 2022

Dropped some comments but overall looks great to me. The example notebook fails to render for me at the moment but I'll take a look if/when you've added more predictors as Jeong suggested.

@zhenyuz0500
Copy link
Collaborator

One of the BART features is supporting continuous treatment - it will be an excellent add to the package if this is supported. I'm wondering how much effort is needed to support continuous treatment.

Other than that, it will be great to compare BART's performance with some other models and show its advantage and value. Or maybe mention BART can support which scenarios that cannot be solved by other models in the current implementation.

@ras44
Copy link
Collaborator

ras44 commented Nov 15, 2023

@jeongyoonlee happy to take a look at this PR again too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants