CausalPy statistical backends #250
Replies: 2 comments 1 reply
-
Hey! Thanks for this write-up! Here are some comments:
My suggestion is to prioritize the Bayesian models and continue with scikit-learn (with MAPIE), at least in the mid-term term :) |
Beta Was this translation helpful? Give feedback.
-
That's interesting! I don't know much about MAPIE but my understanding is that it would help quantify uncertainty only in the case of predictions. Much of causal inference and explicitly the key parameters in instrumental regression require an understanding of the uncertainty in those estimates. Not sure how prediction intervals help here? But if we're happy keeping the uncertainty quantification Bayesian... they maybe we just stick with scikit learn. As for acoiding re-implementing everything, that's a fair point regarding Linearmodels. I was impressed with the package and it's coverage of the instrumental variable regression. I agree that more explicit development of the causal modelling with PyMC would be a great direction to take CausalPy. See here for an example: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=226627 |
Beta Was this translation helpful? Give feedback.
-
When CausalPy was conceived the goal was for the Bayesian estimation methods to be primary, but to also offer traditional OLS style estimation methods.
Bayesian backend
In terms of Bayesian estimation methods, we are obviously biased and are using
PyMC
for the Bayesian estimation. In as far as I can dictate things (being lead dev) there are no plans for any additional Bayesian backends using some alternative Bayesian library. I don't see any benefit in doing that at all and only costs in terms of complexity and increased maintenance burdens. As far as I'm concerned, an alternative Bayesian backend would amount to being a completely different project, so others are free to take that up if they want, but as a totally separate project.Traditional estimation backends
In terms of traditional estimation methods, we forged ahead with
scikit-learn
because it offers a very large array of model types and has a very well known API. However, in #8, we raised the idea of using a traditional estimation method which also offered uncertainty quantification (e.g. confidence intervals). #8 specifically considered using thestatsmodels
library. Since then, in #229, @juanitorduz also proposed thelinearmodels
library.Possible approaches for traditional estimation backends
I think it's time to come up with a coherent plan. I think there are a few issues to think about:
1. Do we actually want
scikit-learn
as a back-end at all?Not having any uncertainty estimation out of the box is a limitation of
scikit-learn
. Adding such estimates to CausalPy would be highly appealing, but do we really want 3 backends (PyMC
,scikit-learn
, and another)? I think this would be pretty sub-optimal. The focus of CausalPy (at least from my perspective) should be the Bayesian estimation methods, so having that as 1/3rd of the functionality seems silly. That said, if we get a bunch of contributors who want to build that out, then why not. The only downside of droppingscikit-learn
is that we lose access to some model types which are not instatsmodels
orlinearmodels
such as Gaussian Processes.If we decided to get rid of it, then we could of course slow-motion deprecate that functionality rather than kill it immediately.
2. Traditional estimation back-end strategy
So the next question would come down to:
a. Do we pick a single OLS estimation backend? If so, should that be
statsmodels
orlinearmodels
, or even something else?b. Do we actually just aim to offer OLS functionality and be ambivalent about what back-end we use to implement that? If we went down this road, then we could end up with a single OLS sub-module, but we'd utilise
statsmodels
orlinearmodels
on a model-by-model basis, whatever happened to be most useful.What do I think?
Personally, in retrospect,
scikit-learn
is not that useful for CausalPy because of lack of confidence intervals out of the box.Right now I don't have a strong preference towards
statsmodels
orlinearmodels
, but I will try to look into both in more detail, and am very eager to hear peoples' thoughts.In terms of OLS backends, The benefit of (a) is that we have some uniformity and predictability and likely have an easier time implementation and coding wise. We also have a cost of (b) in terms of more dependencies and effort in working with multiple backends. So I have a mild preference towards (a), but am actually totally open towards (b)
What do you think?
Tagging in particular @NathanielF, @juanitorduz, @twiecki, but we're interested to hear back from any users of CausalPy. We also want to know if you really don't care and want to vote that we focus exclusively on Bayesian functionality.
Beta Was this translation helpful? Give feedback.
All reactions