-
Notifications
You must be signed in to change notification settings - Fork 172
Conversation
Hurrah! Sorry for the round-about mechanism. If you detect any bugs or just have questions, feel free to contact me. —rjp On Sep 17, 2014, at 2:44 PM, Winston Chang notifications@github.com wrote:
|
b9d349d
to
719b51b
Compare
I've done some cleanup to the code. Some things remaining (I'll add more as I think of them):
I don't really like the new default behavior, which is to have bars centered around the data, rather than aligned to multiples of mtcars %>%
ggvis(x = ~wt) %>%
layer_histograms(width = 1) I think this is what most users would want: mtcars %>%
ggvis(x = ~wt) %>%
layer_histograms(width = 1, boundary = 0) Either that, or |
I think I mostly agree. People specifying nice round values of But I don't think that either Something like this might be a reasonable first attempt:
We just need to watch out for setting
|
Regarding backward compatibility, I guess I was thinking that the warning about the API potentially changing meant that it wasn't absolutely necessary to maintain backward compatibility at this point. An easy partial fix is to set |
I'm not terribly worried about backward compatibility for origin - it's pretty rarely used. |
Last night I experimented with some prettification of Both types of rounding are encapsulated in a new function called Because of the new rounding rules, the number of bins can vary from approximately 20 to 40. The actual number is calculated and reported along with the guessed > mtcars %>% ggvis( x=~wt ) %>% layer_histograms(width=1)
> mtcars %>% ggvis( x=~wt ) %>% layer_histograms()
Guessing width = 0.2 # approximately range / 20 |
@rpruim I've done more cleaning up of the code, so if you could base your work off the rstudio/histograms branch, that would help a lot. |
I've pulled in your stuff and merged in some of my prettification stuff. I note that for grouped data you say in the code
but your tests test that all the bin boundaries are the same. These are not identical conditions. In particular, if some groups have a narrower range, they will use fewer bins. Is it sufficient for the bins that overlap to be identical, or do you want each group to have identical sets of bins (with 0 counts for some groups if they don't have data in the bin)? Based on my limited testing, the plots seem to work fine even if some layers have different bins. For now I will leave the tests failing as a reminder. |
It looks like you have changed the behavior of I was going to explore a rule for switching between the two approaches, but I thought I would check first: Do you just prefer integer boundaries (by default) with the data are integers? Note: getting the boundaries to avoid integers can be achieved with |
When the data are integers, I definitely think we want boundaries half-way in between. I think the bins should be exactly the same across groups. |
@rpruim I made another change to the binning yesterday on the rstudio/master branch. I can see from the network graph that your master branch is off - instead of merging in rstudio/master, you should make it identical to rstudio/master with the following:
Then in the future, when you're on your local master and do a git fetch/pull, it'll get the changes from rstudio/master. You can then base your work branches off of that. Regarding the integer binning, I think being centered above the numbers is good. The present code does that when bin width is 1, as in: data.frame(x=1:40) %>% ggvis(~x) %>% layer_histograms()
# Guessing width = 1 # range / 39 However, when the range is smaller, the behavior isn't ideal. In this example, the bins are too narrow: data.frame(x=1:4) %>% ggvis(~x) %>% layer_histograms()
# Guessing width = 0.1 # range / 30 Also, when the bins are wider, they're not necessarily centered over the integer values. For example, with width 2, there would be bins for [1,3], (3,5], and so on. Bins with ranges [0.5, 2.5], (2.5, 4.5], etc. would be more centered over the values, but I'm not sure that would be preferable. > data.frame(x=1:50) %>% ggvis(~x) %>% layer_histograms()
Guessing width = 2 # range / 25 Regarding the identical bins for groups, you're right that they're always exactly the same. Some groups may have more bins than others. But the bin boundaries should always be aligned. I'll add a test for that case. |
Here are the two principles that I think should govern (default) binning integer data (and perhaps eventually be generalized to other granular data.)
The only place I currently think one might fudge is on the integer + 0.5 boundary when the width is a bit larger than 1, but not very large. In that case, you might see in the histogram that the bins and the tick marks don't line up but are close enough that you might think they ought to line up. This is purely an aesthetic thing. When width is smaller, there are enough ticks that my proposed default looks fine. When the width is really big, you can't see a difference of 0.5 on the plot anyway (especially once the pixel size is bigger than that) so doesn't really matter what you do. It's actually pretty hard to come up with examples in this "fudge area", so I say we just stick with the two principles above. It is easy to describe (easier than having some fudge rules thrown in), and it is based on good statistical properties of the resulting plots. Users can always set My old integer binning code followed items 1 and 2. I think the revisions you did have killed that, but we could go back. |
Regarding the state of my repo. I worked off of the histograms branch and then was going to update master late last night. Part way in, I realized that you had done things to binning in master too (I thought you were doing that all in the histograms branch). It was late, so I pushed off cleaning things up. But I'll do as you suggest and make my master match yours. |
Regarding bins exactly the same across groups. I think this doesn't matter for the plots themselves, but it is probably a good idea -- especially if people look at the intermediate data. I think making the bins identical means that we need to store something in our |
I just noticed that one call to
The Here are some example auto-generated bin widths: > data.frame(x= seq(1.5,18.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 0.6 # approximately range / 29
> data.frame(x= seq(1.5,28.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 1 # approximately range / 27
> data.frame(x= seq(1.5,38.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 1.2 # approximately range / 31
> data.frame(x= seq(1.5,48.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 1.6 # approximately range / 30
> data.frame(x= seq(1.5,88.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 3 # approximately range / 29
> data.frame(x= seq(1.5,98.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 4 # approximately range / 24
> data.frame(x= seq(1.5,90.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 3 # approximately range / 30
> data.frame(x= seq(1.5,94.9, by=0.1)) %>% ggvis(x = ~x) %>% layer_histograms( )
Guessing width = 3.2 # approximately range / 29 Of course, rounding width means that the number of bins won't be exactly 30, but the message indicates the approximate number of bins. |
This patchset is from @rpruim, and committed by me.