Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PRB intercept and CI sampling #27

Merged
merged 5 commits into from
Nov 27, 2024
Merged

Fix PRB intercept and CI sampling #27

merged 5 commits into from
Nov 27, 2024

Conversation

dfsnow
Copy link
Member

@dfsnow dfsnow commented Nov 27, 2024

This PR fixes two bugs, one longstanding and the other introduced by #24.

  1. The prb() formula specified in statsmodels OLS was missing an intercept value. statsmodel bizarrely doesn't add an intercept by default, unlike R's lm().
  2. The sample size (n) specified in boot_ci() was incorrectly based on df.size, rather than len(df).

Further, I screwed up the CI test results by mis-specifying the alpha value. I updated the CI tests to use (mostly) the test result from assessr.

@@ -60,13 +61,13 @@ def boot_ci(
raise ValueError("'nboot' must be a positive integer greater than 0.")
check_inputs(estimate, sale_price)
df = pd.DataFrame({"estimate": estimate, "sale_price": sale_price})
n: int = df.size
n: int = len(df)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.size is the length of all columns in the DataFrame, not the number of rows 🤦

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad for missing this as well!

@@ -122,6 +123,6 @@ def prb_ci(
:func:`boot_ci`
"""
prb_model = _calculate_prb(estimate, sale_price)
prb_ci = prb_model.conf_int(alpha=alpha)[0].tolist()
prb_ci = prb_model.conf_int(alpha=alpha)[1].tolist()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This index also needs to change since 0 now specifies the intercept.

Copy link

@jeancochrane jeancochrane Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] I'm a little bit confused by this change. I get why we need to adjust the index into prb_model.params in metrics.prb(), since it makes sense for params to include the intercept, but why is the intercept the first element of the return value for conf_int()? Is there a situation in which it makes sense for a confidence interval to include the intercept of the model? Or is it just that we've manually added a constant to the model, so it gets propagated to all return values for the model's results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have a confidence interval on $b_0$ as well as $b_1$; it makes perfect sense that statsmodels would return it.

@@ -8,33 +8,33 @@ class TestCI:
def metric(self, request):
return request.param

@pt.fixture(params=[0.80, 0.90, 0.95])
@pt.fixture(params=[0.50, 0.20, 0.10, 0.05])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm big dumb and specified the alpha values as 1 - alpha, thinking "This is the range of the CI I want i.e. 95%." I fixed the values and pulled the test results used in assessr.

@@ -18,7 +18,7 @@ def test_metric_value_is_correct(self, metric, metric_val):
expected = {
"cod": 17.81456901196891,
"prd": 1.0484192615223522,
"prb": 0.0009470721642262903,
"prb": 0.0024757,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the output value in assessr, so now both packages match.

Comment on lines +135 to +137
prb_model = sm.OLS(
endog=lhs.to_numpy(), exog=sm.tools.tools.add_constant(rhs.to_numpy())
).fit(method="qr")
Copy link
Member Author

@dfsnow dfsnow Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the crux of the PRB model issue. statsmodel doesn't add an intercept by default (unlike R's lm()). The method change to "qr" was just to match R, but doesn't seem to actually make a difference.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Praise] Nice catch! This is a super annoying interface.

@dfsnow dfsnow self-assigned this Nov 27, 2024
@dfsnow dfsnow requested a review from jeancochrane November 27, 2024 17:33
@dfsnow dfsnow marked this pull request as ready for review November 27, 2024 17:33
@dfsnow dfsnow requested a review from wrridgeway as a code owner November 27, 2024 17:33
Copy link

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very tricky problems here, thanks for fixing 😅 I'm a little bit confused why we need to change the way we index into the return value from prb_model.conf_int(), but if you feel confident in that change, feel free to merge!

@@ -60,13 +61,13 @@ def boot_ci(
raise ValueError("'nboot' must be a positive integer greater than 0.")
check_inputs(estimate, sale_price)
df = pd.DataFrame({"estimate": estimate, "sale_price": sale_price})
n: int = df.size
n: int = len(df)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad for missing this as well!

@@ -122,6 +123,6 @@ def prb_ci(
:func:`boot_ci`
"""
prb_model = _calculate_prb(estimate, sale_price)
prb_ci = prb_model.conf_int(alpha=alpha)[0].tolist()
prb_ci = prb_model.conf_int(alpha=alpha)[1].tolist()
Copy link

@jeancochrane jeancochrane Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] I'm a little bit confused by this change. I get why we need to adjust the index into prb_model.params in metrics.prb(), since it makes sense for params to include the intercept, but why is the intercept the first element of the return value for conf_int()? Is there a situation in which it makes sense for a confidence interval to include the intercept of the model? Or is it just that we've manually added a constant to the model, so it gets propagated to all return values for the model's results?

Comment on lines +135 to +137
prb_model = sm.OLS(
endog=lhs.to_numpy(), exog=sm.tools.tools.add_constant(rhs.to_numpy())
).fit(method="qr")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Praise] Nice catch! This is a super annoying interface.

@@ -178,7 +180,7 @@ def prb(
ap.prb(ap.ccao_sample().estimate, ap.ccao_sample().sale_price)
"""
prb_model = _calculate_prb(estimate, sale_price)
prb = float(prb_model.params[0])
prb = float(prb_model.params[1])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion, non-blocking] Since the order of coefficients has proven a little bit tricky, perhaps we can persist a comment explaining our choice?

Suggested change
prb = float(prb_model.params[1])
# Get the coefficient from the OLS model.
# We select element 1, since element 0 is the intercept
prb = float(prb_model.params[1])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 065dfcf!

Copy link
Member

@wrridgeway wrridgeway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, @jeancochrane already asked the interesting q's. Thank you both @dfsnow and @Damonamajor for catching this.

@dfsnow dfsnow merged commit 002ce64 into main Nov 27, 2024
14 checks passed
@dfsnow dfsnow deleted the dfsnow/fix-prb-formula branch November 27, 2024 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants