Histogram improvements #268

wch · 2014-09-17T18:44:28Z

This patchset is from @rpruim, and committed by me.

rpruim · 2014-09-17T18:45:59Z

Hurrah!

Sorry for the round-about mechanism.

If you detect any bugs or just have questions, feel free to contact me.

—rjp

On Sep 17, 2014, at 2:44 PM, Winston Chang notifications@github.com wrote:

This patchset is from @rpruim, and committed by me.

You can merge this Pull Request by running

git pull https://github.com/rstudio/ggvis histograms
Or view, comment on, or merge it at:

#268

Commit Summary

new histogram stuff based on center/width/boundary
right = TRUE -> closed = "right"
File Changes

M .gitignore (1)
M NAMESPACE (3)
M R/compute_bin.R (346)
M R/layer_bars.R (2)
M R/layer_bins.R (34)
M R/scales.R (2)
A man/bin_vector.Rd (42)
M man/compute_bin.Rd (30)
M man/layer_bars.Rd (2)
M man/layer_histograms.Rd (35)
M man/scale_ordinal.Rd (2)
M tests/specs/layer.r (4)
M tests/specs/layer/freqpoly-grouped.json (6)
M tests/specs/layer/histogram.json (6)
M tests/specs/line/layer-line-nominal-x.json (76)
M tests/specs/line/layer-line.json (73)
M tests/specs/scales/datetime_hist.json (6)
M tests/testthat/test-compute-bin.r (51)
M vignettes/cookbook.Rmd (34)
M vignettes/ggvis-basics.Rmd (6)
M vignettes/overview.Rmd (2)
Patch Links:

https://github.com/rstudio/ggvis/pull/268.patch
https://github.com/rstudio/ggvis/pull/268.diff
—
Reply to this email directly or view it on GitHub.

wch · 2014-09-17T20:04:27Z

I've done some cleanup to the code. Some things remaining (I'll add more as I think of them):

Backward-compatibility with old binwidth and origin arguments. Even if the behavior isn't exactly the same, it should accept those arguments and print out a message/warning.
Fix remaining examples that use binwidth.

I don't really like the new default behavior, which is to have bars centered around the data, rather than aligned to multiples of width, starting from 0. For example:

mtcars %>%
  ggvis(x = ~wt) %>%
  layer_histograms(width = 1)

I think this is what most users would want:

mtcars %>%
  ggvis(x = ~wt) %>%
  layer_histograms(width = 1, boundary = 0)

Either that, or boundary = width/2.

rpruim · 2014-09-18T01:05:26Z

I think I mostly agree. People specifying nice round values of width probably want nice round values of boundary as well.

But I don't think that either boundary=width or boundary=width/2 is always the better option. What if we took the following strategy: When only one of width and boundary (or center) is specified, then we compute the other and round it based on the value of the one specified. This assumes that if you give a nice value of width, for example, you probably want a nice value for boundary as well.

Something like this might be a reasonable first attempt:

prettify <- function( x, m, d=2 ) { (m/d) * ( (x + m/d/2) %/% (m/d)) }

We just need to watch out for setting width=0. Speaking of which, do we want to catch that situation and print a nice message or live with this:

> data.frame(x = rbinom(100, 25, .4)) %>% ggvis (x=~x) %>% layer_histograms(width=0)
 Error in seq.default(params$origin, max(x) + params$binwidth, params$binwidth) : 
  'from' cannot be NA, NaN or infinite

rpruim · 2014-09-18T01:15:45Z

Regarding backward compatibility, I guess I was thinking that the warning about the API potentially changing meant that it wasn't absolutely necessary to maintain backward compatibility at this point. An easy partial fix is to set boundary to origin when boundary is not specified but origin is. The behavior will be slightly different, but perhaps not in an annoying way. Setting width to binwidth when binwidth is specified and width is not is even easier. But doing this bloats the API until these are deprecated at some point.

hadley · 2014-09-18T09:50:18Z

I'm not terribly worried about backward compatibility for origin - it's pretty rarely used.

rpruim · 2014-09-19T11:39:12Z

Last night I experimented with some prettification of width and boundary when these are auto-calculated. Based on the examples I've looked at so far, I'm pretty happy with the results. I'll take another look later today and should have have a pull-request after that. When width is specified, boundary will be either width or width/2, whichever is closer to what the old algorithm produced. When width is not specified, the calculated width is prettified to a value with 1 or 2 digits -- one if the leading digit is 1 or 2 (or the first two digits are 31) and 1 otherwise. This is roughly equivalent to a common rule about expressing standard errors used by natural scientists (and promoted by NIST, IIRC). It could be tuned to make 3.1 round to 3.0 instead, but the algorithm is more elegant the way it is currently written.

Both types of rounding are encapsulated in a new function called prettify(), so it would be easy to change the rounding behavior by tweaking that function. I suppose one could even add an argument to control the degree of prettification, but at that point, the user should probably just be setting width and boundary (or center) manually.

Because of the new rounding rules, the number of bins can vary from approximately 20 to 40. The actual number is calculated and reported along with the guessed width.

> mtcars %>% ggvis( x=~wt ) %>% layer_histograms(width=1)
> mtcars %>% ggvis( x=~wt ) %>% layer_histograms()
Guessing width = 0.2 # approximately range / 20

wch · 2014-09-24T05:00:06Z

@rpruim I've done more cleaning up of the code, so if you could base your work off the rstudio/histograms branch, that would help a lot.

Histogram improvements

rpruim · 2014-09-25T04:07:07Z

I've pulled in your stuff and merged in some of my prettification stuff. I note that for grouped data you say in the code

  # We want to use the same boundary and width across groups, so calculate
  # bin params here.

but your tests test that all the bin boundaries are the same. These are not identical conditions. In particular, if some groups have a narrower range, they will use fewer bins. Is it sufficient for the bins that overlap to be identical, or do you want each group to have identical sets of bins (with 0 counts for some groups if they don't have data in the bin)?

Based on my limited testing, the plots seem to work fine even if some layers have different bins. For now I will leave the tests failing as a reminder.

rpruim · 2014-09-25T04:32:21Z

It looks like you have changed the behavior of bin_params.integer(). I had written it so that it intentionally chose integer widths and avoided having integers at the boundaries. Integer widths ensure that every bin contains the same number of integer values (avoiding artificial gaps or spikes in the histogram). Putting boundaries half-way between integers makes it unambiguous where each data value is binned, avoids biasing the histogram, and for relatively narrow ranges (especially when width = 1) makes for pleasing labeling and bin breaks. It is perhaps less pleasing when width is 10 and bins run from, say -0.5 to 9.5.

I was going to explore a rule for switching between the two approaches, but I thought I would check first: Do you just prefer integer boundaries (by default) with the data are integers?

Note: getting the boundaries to avoid integers can be achieved with boundary=0.5, but that may not be the optimal boundary.

hadley · 2014-09-25T11:39:34Z

When the data are integers, I definitely think we want boundaries half-way in between.

I think the bins should be exactly the same across groups.

wch · 2014-09-25T14:45:19Z

@rpruim I made another change to the binning yesterday on the rstudio/master branch. I can see from the network graph that your master branch is off - instead of merging in rstudio/master, you should make it identical to rstudio/master with the following:

git checkout master
git reset --hard rstudio/master
git push -f origin

# Track rstudio/master in the future
git branch --set-upstream-to rstudio/master

Then in the future, when you're on your local master and do a git fetch/pull, it'll get the changes from rstudio/master. You can then base your work branches off of that.

Regarding the integer binning, I think being centered above the numbers is good. The present code does that when bin width is 1, as in:

data.frame(x=1:40) %>% ggvis(~x) %>% layer_histograms()
# Guessing width = 1 # range / 39

However, when the range is smaller, the behavior isn't ideal. In this example, the bins are too narrow:

data.frame(x=1:4) %>% ggvis(~x) %>% layer_histograms()
# Guessing width = 0.1 # range / 30

Also, when the bins are wider, they're not necessarily centered over the integer values. For example, with width 2, there would be bins for [1,3], (3,5], and so on. Bins with ranges [0.5, 2.5], (2.5, 4.5], etc. would be more centered over the values, but I'm not sure that would be preferable.

> data.frame(x=1:50) %>% ggvis(~x) %>% layer_histograms()
Guessing width = 2 # range / 25

Regarding the identical bins for groups, you're right that they're always exactly the same. Some groups may have more bins than others. But the bin boundaries should always be aligned. I'll add a test for that case.

rpruim · 2014-09-25T15:25:23Z

Here are the two principles that I think should govern (default) binning integer data (and perhaps eventually be generalized to other granular data.)

Bins should be integer width
- This may result in many fewer than 30 bins, that doesn't bother me as much as having gaping holes. Why have a bin that can't possibly hold data?
- Bins that are non-integer in width mean that some bins contain more integers than others, so even if there are not complete gaps, there will be bins that have more and fewer chances to be populated -- distorting the data view
Boundaries should be integer + 0.5
- Centers may or may not be integers depending on whether the width is even or odd. I am not bothered by bins that are don't have an integer at their center. Usually this is exactly what I want. For bins of width two, I want to clearly see which two integers are in each bin, so bins like (0.5, 2.5), (2.5,4.5) are perfect. (Note: I also don't have to worry about which side is closed.)

The only place I currently think one might fudge is on the integer + 0.5 boundary when the width is a bit larger than 1, but not very large. In that case, you might see in the histogram that the bins and the tick marks don't line up but are close enough that you might think they ought to line up. This is purely an aesthetic thing. When width is smaller, there are enough ticks that my proposed default looks fine. When the width is really big, you can't see a difference of 0.5 on the plot anyway (especially once the pixel size is bigger than that) so doesn't really matter what you do.

It's actually pretty hard to come up with examples in this "fudge area", so I say we just stick with the two principles above. It is easy to describe (easier than having some fudge rules thrown in), and it is based on good statistical properties of the resulting plots. Users can always set boundary=0 if they want the edges to be on integers.

My old integer binning code followed items 1 and 2. I think the revisions you did have killed that, but we could go back.

rpruim · 2014-09-25T15:27:21Z

Regarding the state of my repo. I worked off of the histograms branch and then was going to update master late last night. Part way in, I realized that you had done things to binning in master too (I thought you were doing that all in the histograms branch). It was late, so I pushed off cleaning things up. But I'll do as you suggest and make my master match yours.

rpruim · 2014-09-25T15:33:51Z

Regarding bins exactly the same across groups. I think this doesn't matter for the plots themselves, but it is probably a good idea -- especially if people look at the intermediate data.

I think making the bins identical means that we need to store something in our bin_params so we know how far to go. Currently we stop when the data (for a given group) stop. We could store the top boundary or the number of bins. Number of bins has the advantage that it can't be "wrong". Given any origin, width, and number of bins, you can always calculate the top boundary. But not all top boundaries are compatible with a given origin and width. So unless there are reasons to do things differently, I'd suggest storing the number of bins as well as the origin and width in bin_params. Together with closed this provides a complete description of the bins in canonical form.

rpruim · 2014-09-25T15:56:08Z

I just noticed that one call to prettify() was lost when merging things last night. I just pushed that to my histograms branch. This now does some nice rounding of auto-generated widths and boundaries. The basic rule is

for width, use 2 sig figs for small leading digits (1, 2 and some 3's) and 1 sig fig for larger leading digits
one of boundary or center should be equal to width

The pretttify() function is more general than this and can be tuned to round more or less severely.

Here are some example auto-generated bin widths:

> data.frame(x= seq(1.5,18.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 0.6 # approximately range / 29
> data.frame(x= seq(1.5,28.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 1 # approximately range / 27
> data.frame(x= seq(1.5,38.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 1.2 # approximately range / 31
> data.frame(x= seq(1.5,48.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 1.6 # approximately range / 30
> data.frame(x= seq(1.5,88.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 3 # approximately range / 29
> data.frame(x= seq(1.5,98.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 4 # approximately range / 24
> data.frame(x= seq(1.5,90.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 3 # approximately range / 30
> data.frame(x= seq(1.5,94.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 3.2 # approximately range / 29

Of course, rounding width means that the number of bins won't be exactly 30, but the message indicates the approximate number of bins.

wch force-pushed the histograms branch 3 times, most recently from b9d349d to 719b51b Compare September 17, 2014 19:47

wch force-pushed the histograms branch from 719b51b to a6f8851 Compare September 17, 2014 20:32

wch force-pushed the master branch from 70336bf to 62306ee Compare September 18, 2014 02:35

rpruim and others added 3 commits September 23, 2014 14:36

new histogram stuff based on center/width/boundary

0224120

right = TRUE -> closed = "right"

5681183

Revert extraneous changes and style fixes

e644c25

wch force-pushed the histograms branch from a6f8851 to 5b3d97c Compare September 24, 2014 04:44

wch added 5 commits September 23, 2014 23:57

Simplify compute_bin code

5c96ef4

Revamp compute_bin tests

542fbf8

Re-document

bd88dce

Deprecate binwidth argument

7e138ad

Replace remaining 'binwidth' with 'width'

ef55e00

wch force-pushed the histograms branch from bbfeb7b to ef55e00 Compare September 24, 2014 04:58

wch added a commit that referenced this pull request Sep 24, 2014

Merge pull request #268 from rstudio/histograms

dac26bc

Histogram improvements

wch merged commit dac26bc into master Sep 24, 2014

This was referenced Sep 24, 2014

compute_bin: need better bin alignment #275

Closed

should we do something like hist(..., include.lowest=TRUE)? should it be the default? #265

Closed

wch deleted the histograms branch September 25, 2014 20:00

wch mentioned this pull request Sep 26, 2014

histograms bins via width/center/boundary #179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram improvements #268

Histogram improvements #268

wch commented Sep 17, 2014

rpruim commented Sep 17, 2014

wch commented Sep 17, 2014

rpruim commented Sep 18, 2014

rpruim commented Sep 18, 2014

hadley commented Sep 18, 2014

rpruim commented Sep 19, 2014

wch commented Sep 24, 2014

rpruim commented Sep 25, 2014

rpruim commented Sep 25, 2014

hadley commented Sep 25, 2014

wch commented Sep 25, 2014

rpruim commented Sep 25, 2014

rpruim commented Sep 25, 2014

rpruim commented Sep 25, 2014

rpruim commented Sep 25, 2014

Histogram improvements #268

Histogram improvements #268

Conversation

wch commented Sep 17, 2014

rpruim commented Sep 17, 2014

wch commented Sep 17, 2014

rpruim commented Sep 18, 2014

rpruim commented Sep 18, 2014

hadley commented Sep 18, 2014

rpruim commented Sep 19, 2014

wch commented Sep 24, 2014

rpruim commented Sep 25, 2014

rpruim commented Sep 25, 2014

hadley commented Sep 25, 2014

wch commented Sep 25, 2014

rpruim commented Sep 25, 2014

rpruim commented Sep 25, 2014

rpruim commented Sep 25, 2014

rpruim commented Sep 25, 2014