-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PRB intercept and CI sampling #27
Conversation
@@ -60,13 +61,13 @@ def boot_ci( | |||
raise ValueError("'nboot' must be a positive integer greater than 0.") | |||
check_inputs(estimate, sale_price) | |||
df = pd.DataFrame({"estimate": estimate, "sale_price": sale_price}) | |||
n: int = df.size | |||
n: int = len(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.size
is the length of all columns in the DataFrame, not the number of rows 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad for missing this as well!
@@ -122,6 +123,6 @@ def prb_ci( | |||
:func:`boot_ci` | |||
""" | |||
prb_model = _calculate_prb(estimate, sale_price) | |||
prb_ci = prb_model.conf_int(alpha=alpha)[0].tolist() | |||
prb_ci = prb_model.conf_int(alpha=alpha)[1].tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This index also needs to change since 0 now specifies the intercept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] I'm a little bit confused by this change. I get why we need to adjust the index into prb_model.params
in metrics.prb()
, since it makes sense for params
to include the intercept, but why is the intercept the first element of the return value for conf_int()
? Is there a situation in which it makes sense for a confidence interval to include the intercept of the model? Or is it just that we've manually added a constant to the model, so it gets propagated to all return values for the model's results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have a confidence interval on statsmodels
would return it.
@@ -8,33 +8,33 @@ class TestCI: | |||
def metric(self, request): | |||
return request.param | |||
|
|||
@pt.fixture(params=[0.80, 0.90, 0.95]) | |||
@pt.fixture(params=[0.50, 0.20, 0.10, 0.05]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm big dumb and specified the alpha values as 1 - alpha
, thinking "This is the range of the CI I want i.e. 95%." I fixed the values and pulled the test results used in assessr
.
@@ -18,7 +18,7 @@ def test_metric_value_is_correct(self, metric, metric_val): | |||
expected = { | |||
"cod": 17.81456901196891, | |||
"prd": 1.0484192615223522, | |||
"prb": 0.0009470721642262903, | |||
"prb": 0.0024757, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the output value in assessr
, so now both packages match.
prb_model = sm.OLS( | ||
endog=lhs.to_numpy(), exog=sm.tools.tools.add_constant(rhs.to_numpy()) | ||
).fit(method="qr") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the crux of the PRB model issue. statsmodel
doesn't add an intercept by default (unlike R's lm()
). The method change to "qr"
was just to match R, but doesn't seem to actually make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Praise] Nice catch! This is a super annoying interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very tricky problems here, thanks for fixing 😅 I'm a little bit confused why we need to change the way we index into the return value from prb_model.conf_int()
, but if you feel confident in that change, feel free to merge!
@@ -60,13 +61,13 @@ def boot_ci( | |||
raise ValueError("'nboot' must be a positive integer greater than 0.") | |||
check_inputs(estimate, sale_price) | |||
df = pd.DataFrame({"estimate": estimate, "sale_price": sale_price}) | |||
n: int = df.size | |||
n: int = len(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad for missing this as well!
@@ -122,6 +123,6 @@ def prb_ci( | |||
:func:`boot_ci` | |||
""" | |||
prb_model = _calculate_prb(estimate, sale_price) | |||
prb_ci = prb_model.conf_int(alpha=alpha)[0].tolist() | |||
prb_ci = prb_model.conf_int(alpha=alpha)[1].tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] I'm a little bit confused by this change. I get why we need to adjust the index into prb_model.params
in metrics.prb()
, since it makes sense for params
to include the intercept, but why is the intercept the first element of the return value for conf_int()
? Is there a situation in which it makes sense for a confidence interval to include the intercept of the model? Or is it just that we've manually added a constant to the model, so it gets propagated to all return values for the model's results?
prb_model = sm.OLS( | ||
endog=lhs.to_numpy(), exog=sm.tools.tools.add_constant(rhs.to_numpy()) | ||
).fit(method="qr") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Praise] Nice catch! This is a super annoying interface.
@@ -178,7 +180,7 @@ def prb( | |||
ap.prb(ap.ccao_sample().estimate, ap.ccao_sample().sale_price) | |||
""" | |||
prb_model = _calculate_prb(estimate, sale_price) | |||
prb = float(prb_model.params[0]) | |||
prb = float(prb_model.params[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion, non-blocking] Since the order of coefficients has proven a little bit tricky, perhaps we can persist a comment explaining our choice?
prb = float(prb_model.params[1]) | |
# Get the coefficient from the OLS model. | |
# We select element 1, since element 0 is the intercept | |
prb = float(prb_model.params[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 065dfcf!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, @jeancochrane already asked the interesting q's. Thank you both @dfsnow and @Damonamajor for catching this.
This PR fixes two bugs, one longstanding and the other introduced by #24.
prb()
formula specified instatsmodels
OLS was missing an intercept value.statsmodel
bizarrely doesn't add an intercept by default, unlike R'slm()
.n
) specified inboot_ci()
was incorrectly based ondf.size
, rather thanlen(df)
.Further, I screwed up the CI test results by mis-specifying the
alpha
value. I updated the CI tests to use (mostly) the test result fromassessr
.