Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linear regression with confidence bands #945

Merged
merged 16 commits into from
Jun 20, 2022
Merged

linear regression with confidence bands #945

merged 16 commits into from
Jun 20, 2022

Conversation

mbostock
Copy link
Member

@mbostock mbostock commented Jun 17, 2022

Inspired by https://observablehq.com/@toja/linear-regression-with-confidence-bands, using jStat for probability functions. Fixes #168.

Screen Shot 2022-06-17 at 11 28 19 AM

Plot.plot({
  marks: [
    Plot.dot(cars, {x: "weight (lb)", y: "economy (mpg)", r: 2}),
    Plot.linearRegressionY(cars, {x: "weight (lb)", y: "economy (mpg)", p: 0.01})
  ]
})

TODO

  • linearRegressionX
  • remove sort: {channel: x}
  • combine the two marks into one?

@mbostock mbostock requested a review from Fil June 17, 2022 18:30
@Fil
Copy link
Contributor

Fil commented Jun 17, 2022

A few remarks after minimal testing.

  • should we throw an error or at least a warning when p> 0.5? or just ignore in that case? (in any case, it should not reverse the area).
  • we should not return a band if p is 0, null or undefined (the current code returns a path with infinite coordinates). By the way p:null could be the documented method to specify that the user doesn't want a band (maybe they want to draw the band below the dots, then the dots, then the line).
  • options.fill is overwritten by stroke, preventing from writing "fill: 'gray'" if we want the confidence band to be gray.
  • x extent: this PR draws the regression line over the whole width of the frame, but sometimes you want to limit it to the data's extent (or, I guess, the data extent +- a certain padding/inset) [this is what this older approach does, and it seems quite nice for this particular dataset]. It would be nice to have the option of limiting the line to the data extent (maybe with domain: "frame", "data", or [x1, x2]?), and another option to add (possibly negative) padding?

(I can explore variants in a PR to the PR, just raising them as comments for now)

@mbostock
Copy link
Member Author

Okay, I think I’ve done everything you’ve asked! How’s it now? 😄

@mbostock mbostock force-pushed the mbostock/regression branch from 3734eb8 to ca8cbf8 Compare June 18, 2022 13:46
@Fil
Copy link
Contributor

Fil commented Jun 18, 2022

What should happen with p=0? The error message is inconsistent invalid p; not in [0, 0.5): 0 (at ca8cbf8)

@Fil
Copy link
Contributor

Fil commented Jun 18, 2022

In https://observablehq.com/@fil/plot-regression-945 I've added a quick comparison with ggplot2: p=0.05 (this default) corresponds to level=0.90 of ggplot2, and p=0.025 corresponds to level=0.95 (ggplot2 default). Suggest to change the variable name and default so that we get the same definitions as ggplot2?

It would also make it easier to document the mark, because we wouldn't have to get into details about the Student t test, cumulative distributions etc. Seems easier to write "the band in which the linear relation lay with a confidence of 95%".

@Fil
Copy link
Contributor

Fil commented Jun 18, 2022

I've added a bit of documentation, but I don't know how to describe the band simply in terms of p.

@mbostock
Copy link
Member Author

I’ve replaced the p option with the more understandable ci option representing the confidence interval in [0, 1). This corresponds to ggplot2’s level option. (I think that “ci” is more self-describing than “level”.) Also I think there is a bug in Torben’s notebook, because a confidence interval of ci = 0.95 corresponds to the old p = 0.025, not 0.05. I’ve confirmed this with a visual comparison of ggplot2’s behavior using the mtcars notebook based on a blog post by Thomas Neitmann.

@mbostock mbostock merged commit 6a442e3 into main Jun 20, 2022
@mbostock mbostock deleted the mbostock/regression branch June 20, 2022 23:30
@teecrow
Copy link

teecrow commented Mar 24, 2023

Thanks for implementing confidence intervals - really fantastic work! As a note, to clear up any confusion (and I recognize you may have already figured this out, but in case anyone else might come across this): the confidence level C, often defaulting at 95%, corresponds to p=.05. Above the regression line, you have half your confidence interval, which is p=.025, and same below the line. I suspect that's the cause of the uncertainty about .025 being mentioned in some places. And FWIW I think using "CI" instead of "p" was a good idea.

If it's useful for your documentation, or for anyone else who might be reading, there is an excellent paper with several different recommendations for simple but accurate ways to describe the CI:
Cumming, G., & Finch, S. (2005). Inference by Eye: Confidence Intervals and How to Read Pictures of Data. American Psychologist, 60(2), 170–180. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=79e79e30c2c0f35e1add65d0375ddc38e939385d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Linear regression?
3 participants