Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

Histogram improvements #268

Merged
merged 8 commits into from
Sep 24, 2014
Merged

Histogram improvements #268

merged 8 commits into from
Sep 24, 2014

Conversation

wch
Copy link
Contributor

@wch wch commented Sep 17, 2014

This patchset is from @rpruim, and committed by me.

@rpruim
Copy link
Contributor

rpruim commented Sep 17, 2014

Hurrah!

Sorry for the round-about mechanism.

If you detect any bugs or just have questions, feel free to contact me.

—rjp

On Sep 17, 2014, at 2:44 PM, Winston Chang notifications@github.com wrote:

This patchset is from @rpruim, and committed by me.

You can merge this Pull Request by running

git pull https://github.com/rstudio/ggvis histograms
Or view, comment on, or merge it at:

#268

Commit Summary

new histogram stuff based on center/width/boundary
right = TRUE -> closed = "right"
File Changes

M .gitignore (1)
M NAMESPACE (3)
M R/compute_bin.R (346)
M R/layer_bars.R (2)
M R/layer_bins.R (34)
M R/scales.R (2)
A man/bin_vector.Rd (42)
M man/compute_bin.Rd (30)
M man/layer_bars.Rd (2)
M man/layer_histograms.Rd (35)
M man/scale_ordinal.Rd (2)
M tests/specs/layer.r (4)
M tests/specs/layer/freqpoly-grouped.json (6)
M tests/specs/layer/histogram.json (6)
M tests/specs/line/layer-line-nominal-x.json (76)
M tests/specs/line/layer-line.json (73)
M tests/specs/scales/datetime_hist.json (6)
M tests/testthat/test-compute-bin.r (51)
M vignettes/cookbook.Rmd (34)
M vignettes/ggvis-basics.Rmd (6)
M vignettes/overview.Rmd (2)
Patch Links:

https://github.com/rstudio/ggvis/pull/268.patch
https://github.com/rstudio/ggvis/pull/268.diff

Reply to this email directly or view it on GitHub.

@wch wch force-pushed the histograms branch 3 times, most recently from b9d349d to 719b51b Compare September 17, 2014 19:47
@wch
Copy link
Contributor Author

wch commented Sep 17, 2014

I've done some cleanup to the code. Some things remaining (I'll add more as I think of them):

  • Backward-compatibility with old binwidth and origin arguments. Even if the behavior isn't exactly the same, it should accept those arguments and print out a message/warning.
  • Fix remaining examples that use binwidth.

I don't really like the new default behavior, which is to have bars centered around the data, rather than aligned to multiples of width, starting from 0. For example:

mtcars %>%
  ggvis(x = ~wt) %>%
  layer_histograms(width = 1)

image

I think this is what most users would want:

mtcars %>%
  ggvis(x = ~wt) %>%
  layer_histograms(width = 1, boundary = 0)

image

Either that, or boundary = width/2.

@rpruim
Copy link
Contributor

rpruim commented Sep 18, 2014

I think I mostly agree. People specifying nice round values of width probably want nice round values of boundary as well.

But I don't think that either boundary=width or boundary=width/2 is always the better option. What if we took the following strategy: When only one of width and boundary (or center) is specified, then we compute the other and round it based on the value of the one specified. This assumes that if you give a nice value of width, for example, you probably want a nice value for boundary as well.

Something like this might be a reasonable first attempt:

prettify <- function( x, m, d=2 ) { (m/d) * ( (x + m/d/2) %/% (m/d)) }

We just need to watch out for setting width=0. Speaking of which, do we want to catch that situation and print a nice message or live with this:

> data.frame(x = rbinom(100, 25, .4)) %>% ggvis (x=~x) %>% layer_histograms(width=0)
 Error in seq.default(params$origin, max(x) + params$binwidth, params$binwidth) : 
  'from' cannot be NA, NaN or infinite 

@rpruim
Copy link
Contributor

rpruim commented Sep 18, 2014

Regarding backward compatibility, I guess I was thinking that the warning about the API potentially changing meant that it wasn't absolutely necessary to maintain backward compatibility at this point. An easy partial fix is to set boundary to origin when boundary is not specified but origin is. The behavior will be slightly different, but perhaps not in an annoying way. Setting width to binwidth when binwidth is specified and width is not is even easier. But doing this bloats the API until these are deprecated at some point.

@hadley
Copy link
Member

hadley commented Sep 18, 2014

I'm not terribly worried about backward compatibility for origin - it's pretty rarely used.

@rpruim
Copy link
Contributor

rpruim commented Sep 19, 2014

Last night I experimented with some prettification of width and boundary when these are auto-calculated. Based on the examples I've looked at so far, I'm pretty happy with the results. I'll take another look later today and should have have a pull-request after that. When width is specified, boundary will be either width or width/2, whichever is closer to what the old algorithm produced. When width is not specified, the calculated width is prettified to a value with 1 or 2 digits -- one if the leading digit is 1 or 2 (or the first two digits are 31) and 1 otherwise. This is roughly equivalent to a common rule about expressing standard errors used by natural scientists (and promoted by NIST, IIRC). It could be tuned to make 3.1 round to 3.0 instead, but the algorithm is more elegant the way it is currently written.

Both types of rounding are encapsulated in a new function called prettify(), so it would be easy to change the rounding behavior by tweaking that function. I suppose one could even add an argument to control the degree of prettification, but at that point, the user should probably just be setting width and boundary (or center) manually.

Because of the new rounding rules, the number of bins can vary from approximately 20 to 40. The actual number is calculated and reported along with the guessed width.

> mtcars %>% ggvis( x=~wt ) %>% layer_histograms(width=1)
> mtcars %>% ggvis( x=~wt ) %>% layer_histograms()
Guessing width = 0.2 # approximately range / 20

image
image

@wch
Copy link
Contributor Author

wch commented Sep 24, 2014

@rpruim I've done more cleaning up of the code, so if you could base your work off the rstudio/histograms branch, that would help a lot.

wch added a commit that referenced this pull request Sep 24, 2014
@wch wch merged commit dac26bc into master Sep 24, 2014
@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2014

I've pulled in your stuff and merged in some of my prettification stuff. I note that for grouped data you say in the code

  # We want to use the same boundary and width across groups, so calculate
  # bin params here.

but your tests test that all the bin boundaries are the same. These are not identical conditions. In particular, if some groups have a narrower range, they will use fewer bins. Is it sufficient for the bins that overlap to be identical, or do you want each group to have identical sets of bins (with 0 counts for some groups if they don't have data in the bin)?

Based on my limited testing, the plots seem to work fine even if some layers have different bins. For now I will leave the tests failing as a reminder.

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2014

It looks like you have changed the behavior of bin_params.integer(). I had written it so that it intentionally chose integer widths and avoided having integers at the boundaries. Integer widths ensure that every bin contains the same number of integer values (avoiding artificial gaps or spikes in the histogram). Putting boundaries half-way between integers makes it unambiguous where each data value is binned, avoids biasing the histogram, and for relatively narrow ranges (especially when width = 1) makes for pleasing labeling and bin breaks. It is perhaps less pleasing when width is 10 and bins run from, say -0.5 to 9.5.

I was going to explore a rule for switching between the two approaches, but I thought I would check first: Do you just prefer integer boundaries (by default) with the data are integers?

Note: getting the boundaries to avoid integers can be achieved with boundary=0.5, but that may not be the optimal boundary.

@hadley
Copy link
Member

hadley commented Sep 25, 2014

When the data are integers, I definitely think we want boundaries half-way in between.

I think the bins should be exactly the same across groups.

@wch
Copy link
Contributor Author

wch commented Sep 25, 2014

@rpruim I made another change to the binning yesterday on the rstudio/master branch. I can see from the network graph that your master branch is off - instead of merging in rstudio/master, you should make it identical to rstudio/master with the following:

git checkout master
git reset --hard rstudio/master
git push -f origin

# Track rstudio/master in the future
git branch --set-upstream-to rstudio/master

Then in the future, when you're on your local master and do a git fetch/pull, it'll get the changes from rstudio/master. You can then base your work branches off of that.

Regarding the integer binning, I think being centered above the numbers is good. The present code does that when bin width is 1, as in:

data.frame(x=1:40) %>% ggvis(~x) %>% layer_histograms()
# Guessing width = 1 # range / 39

image

However, when the range is smaller, the behavior isn't ideal. In this example, the bins are too narrow:

data.frame(x=1:4) %>% ggvis(~x) %>% layer_histograms()
# Guessing width = 0.1 # range / 30

image

Also, when the bins are wider, they're not necessarily centered over the integer values. For example, with width 2, there would be bins for [1,3], (3,5], and so on. Bins with ranges [0.5, 2.5], (2.5, 4.5], etc. would be more centered over the values, but I'm not sure that would be preferable.

> data.frame(x=1:50) %>% ggvis(~x) %>% layer_histograms()
Guessing width = 2 # range / 25

image

Regarding the identical bins for groups, you're right that they're always exactly the same. Some groups may have more bins than others. But the bin boundaries should always be aligned. I'll add a test for that case.

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2014

Here are the two principles that I think should govern (default) binning integer data (and perhaps eventually be generalized to other granular data.)

  1. Bins should be integer width
    • This may result in many fewer than 30 bins, that doesn't bother me as much as having gaping holes. Why have a bin that can't possibly hold data?
    • Bins that are non-integer in width mean that some bins contain more integers than others, so even if there are not complete gaps, there will be bins that have more and fewer chances to be populated -- distorting the data view
  2. Boundaries should be integer + 0.5
    • Centers may or may not be integers depending on whether the width is even or odd. I am not bothered by bins that are don't have an integer at their center. Usually this is exactly what I want. For bins of width two, I want to clearly see which two integers are in each bin, so bins like (0.5, 2.5), (2.5,4.5) are perfect. (Note: I also don't have to worry about which side is closed.)

The only place I currently think one might fudge is on the integer + 0.5 boundary when the width is a bit larger than 1, but not very large. In that case, you might see in the histogram that the bins and the tick marks don't line up but are close enough that you might think they ought to line up. This is purely an aesthetic thing. When width is smaller, there are enough ticks that my proposed default looks fine. When the width is really big, you can't see a difference of 0.5 on the plot anyway (especially once the pixel size is bigger than that) so doesn't really matter what you do.

It's actually pretty hard to come up with examples in this "fudge area", so I say we just stick with the two principles above. It is easy to describe (easier than having some fudge rules thrown in), and it is based on good statistical properties of the resulting plots. Users can always set boundary=0 if they want the edges to be on integers.

My old integer binning code followed items 1 and 2. I think the revisions you did have killed that, but we could go back.

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2014

Regarding the state of my repo. I worked off of the histograms branch and then was going to update master late last night. Part way in, I realized that you had done things to binning in master too (I thought you were doing that all in the histograms branch). It was late, so I pushed off cleaning things up. But I'll do as you suggest and make my master match yours.

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2014

Regarding bins exactly the same across groups. I think this doesn't matter for the plots themselves, but it is probably a good idea -- especially if people look at the intermediate data.

I think making the bins identical means that we need to store something in our bin_params so we know how far to go. Currently we stop when the data (for a given group) stop. We could store the top boundary or the number of bins. Number of bins has the advantage that it can't be "wrong". Given any origin, width, and number of bins, you can always calculate the top boundary. But not all top boundaries are compatible with a given origin and width. So unless there are reasons to do things differently, I'd suggest storing the number of bins as well as the origin and width in bin_params. Together with closed this provides a complete description of the bins in canonical form.

@rpruim
Copy link
Contributor

rpruim commented Sep 25, 2014

I just noticed that one call to prettify() was lost when merging things last night. I just pushed that to my histograms branch. This now does some nice rounding of auto-generated widths and boundaries. The basic rule is

  • for width, use 2 sig figs for small leading digits (1, 2 and some 3's) and 1 sig fig for larger leading digits
  • one of boundary or center should be equal to width

The pretttify() function is more general than this and can be tuned to round more or less severely.

Here are some example auto-generated bin widths:

> data.frame(x= seq(1.5,18.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 0.6 # approximately range / 29
> data.frame(x= seq(1.5,28.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 1 # approximately range / 27
> data.frame(x= seq(1.5,38.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 1.2 # approximately range / 31
> data.frame(x= seq(1.5,48.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 1.6 # approximately range / 30
> data.frame(x= seq(1.5,88.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 3 # approximately range / 29
> data.frame(x= seq(1.5,98.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 4 # approximately range / 24
> data.frame(x= seq(1.5,90.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 3 # approximately range / 30
> data.frame(x= seq(1.5,94.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 3.2 # approximately range / 29

Of course, rounding width means that the number of bins won't be exactly 30, but the message indicates the approximate number of bins.

@wch wch deleted the histograms branch September 25, 2014 20:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants