CausalPy statistical backends #250

drbenvincent · 2023-09-22T18:27:31Z

drbenvincent
Sep 22, 2023
Maintainer

When CausalPy was conceived the goal was for the Bayesian estimation methods to be primary, but to also offer traditional OLS style estimation methods.

Bayesian backend

In terms of Bayesian estimation methods, we are obviously biased and are using PyMC for the Bayesian estimation. In as far as I can dictate things (being lead dev) there are no plans for any additional Bayesian backends using some alternative Bayesian library. I don't see any benefit in doing that at all and only costs in terms of complexity and increased maintenance burdens. As far as I'm concerned, an alternative Bayesian backend would amount to being a completely different project, so others are free to take that up if they want, but as a totally separate project.

Traditional estimation backends

In terms of traditional estimation methods, we forged ahead with scikit-learn because it offers a very large array of model types and has a very well known API. However, in #8, we raised the idea of using a traditional estimation method which also offered uncertainty quantification (e.g. confidence intervals). #8 specifically considered using the statsmodels library. Since then, in #229, @juanitorduz also proposed the linearmodels library.

Possible approaches for traditional estimation backends

I think it's time to come up with a coherent plan. I think there are a few issues to think about:

1. Do we actually want `scikit-learn` as a back-end at all?

Not having any uncertainty estimation out of the box is a limitation of scikit-learn. Adding such estimates to CausalPy would be highly appealing, but do we really want 3 backends (PyMC, scikit-learn, and another)? I think this would be pretty sub-optimal. The focus of CausalPy (at least from my perspective) should be the Bayesian estimation methods, so having that as 1/3rd of the functionality seems silly. That said, if we get a bunch of contributors who want to build that out, then why not. The only downside of dropping scikit-learn is that we lose access to some model types which are not in statsmodels or linearmodels such as Gaussian Processes.

If we decided to get rid of it, then we could of course slow-motion deprecate that functionality rather than kill it immediately.

2. Traditional estimation back-end strategy

So the next question would come down to:
a. Do we pick a single OLS estimation backend? If so, should that be statsmodels or linearmodels, or even something else?
b. Do we actually just aim to offer OLS functionality and be ambivalent about what back-end we use to implement that? If we went down this road, then we could end up with a single OLS sub-module, but we'd utilise statsmodels or linearmodels on a model-by-model basis, whatever happened to be most useful.

What do I think?

Personally, in retrospect, scikit-learn is not that useful for CausalPy because of lack of confidence intervals out of the box.

Right now I don't have a strong preference towards statsmodels or linearmodels, but I will try to look into both in more detail, and am very eager to hear peoples' thoughts.

In terms of OLS backends, The benefit of (a) is that we have some uniformity and predictability and likely have an easier time implementation and coding wise. We also have a cost of (b) in terms of more dependencies and effort in working with multiple backends. So I have a mild preference towards (a), but am actually totally open towards (b)

What do you think?

Tagging in particular @NathanielF, @juanitorduz, @twiecki, but we're interested to hear back from any users of CausalPy. We also want to know if you really don't care and want to vote that we focus exclusively on Bayesian functionality.

juanitorduz · 2023-09-22T18:47:00Z

juanitorduz
Sep 22, 2023
Collaborator

Hey! Thanks for this write-up! Here are some comments:

I think CausalPy shines when used with the PyMC backend. We should work on adding more features here. Something in the direction of hierarchical models + causal inference, as this is one of the areas where Bayes shines. See, for example, http://www.stat.columbia.edu/~gelman/research/published/HierarchicalCausal.pdf
We could continue supporting scikit-learn and integrate it with https://github.com/scikit-learn-contrib/MAPIE to get uncertainty estimations.
Just to clarify: I actually did not suggest integrating linearmodels. I was just pointing out there I no need to re-implement everything as this package already offers a great frequentist solution (apologies for the confusion).

My suggestion is to prioritize the Bayesian models and continue with scikit-learn (with MAPIE), at least in the mid-term term :)

1 reply

twiecki Sep 23, 2023
Maintainer

I agree, I don't think it's worthwhile to add more backends just to have them. The fundamental question should not be "what backend do you want to use" but "what do you want to achieve", and I think between PyMC and scikit-learn (especially if we add MAPIE) it covers all the bases.

NathanielF · 2023-09-25T19:41:05Z

NathanielF
Sep 25, 2023
Collaborator

That's interesting!

I don't know much about MAPIE but my understanding is that it would help quantify uncertainty only in the case of predictions. Much of causal inference and explicitly the key parameters in instrumental regression require an understanding of the uncertainty in those estimates. Not sure how prediction intervals help here? But if we're happy keeping the uncertainty quantification Bayesian... they maybe we just stick with scikit learn.

As for acoiding re-implementing everything, that's a fair point regarding Linearmodels. I was impressed with the package and it's coverage of the instrumental variable regression.

I agree that more explicit development of the causal modelling with PyMC would be a great direction to take CausalPy. See here for an example: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=226627

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CausalPy statistical backends #250

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

CausalPy statistical backends #250

drbenvincent Sep 22, 2023 Maintainer

Bayesian backend

Traditional estimation backends

Possible approaches for traditional estimation backends

1. Do we actually want scikit-learn as a back-end at all?

2. Traditional estimation back-end strategy

What do I think?

What do you think?

Replies: 2 comments · 1 reply

juanitorduz Sep 22, 2023 Collaborator

twiecki Sep 23, 2023 Maintainer

NathanielF Sep 25, 2023 Collaborator

drbenvincent
Sep 22, 2023
Maintainer

1. Do we actually want `scikit-learn` as a back-end at all?

Replies: 2 comments 1 reply

juanitorduz
Sep 22, 2023
Collaborator

twiecki Sep 23, 2023
Maintainer

NathanielF
Sep 25, 2023
Collaborator