Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add Histogram types #6601

Closed
wants to merge 3 commits into from
Closed

Conversation

simonbyrne
Copy link
Contributor

This adds explicit types for histograms. The main advantage is that we can define methods that operate on histograms, notably plot. This approach closely mirrors existing functionality, but there are some alternatives that may be worth thinking about:

  • Use hist to create both 1d and 2d versions: this could be done using tuples (similar to the kde function in KernelDensity.jl).
  • Create a general type for an N-dimensional histogram. For example
immutable Histogram{N,E,T}
   edges::NTuple{N,E}
   weights::Array{T,N}
end

@johnmyleswhite
Copy link
Member

+1 for a Histogram type

@StefanKarpinski
Copy link
Member

+1 for a Histogram type and to @jiahao's suggestion.

@simonbyrne
Copy link
Contributor Author

Okay, there's now a single Histogram type, and a single hist function, which accepts tuples as arguments for higher-order histograms.

The question is now what to do with matrix arguments: if size(X) == (m,n), should hist(X):

  • create an array of Histogram{T,1} objects for each column (this is analogous to current hist behaviour), or
  • create a single Histogram{T,n} object (similar to current hist2d behaviour)?

@simonbyrne
Copy link
Contributor Author

One other thing to think about: which way should we round? (i.e. should the bins be upper- or lower-inclusive)?

At the moment, we round down, i.e. with an edge vector 0:2, an observation of 1 will go into the (0,1) bin:

julia> hist([1],0:2)
Histogram{Int64,1,(UnitRange{Int64},)}((0:2,),[1,0])

This is the same as the R default (which has the option for either), but the opposite of numpy (which doesn't allow a choice).

The historical reason for this choice was that pre-FloatRange, this reduced the number of odd things would happen if using ranges like 0:0.1:1 (since 0.1 > 1//10). This problem can be seen in numpy:

In [8]: np.histogram([0.1,0.2,0.3,0.4],10,(0.0,1.0))
Out[8]: 
(array([0, 1, 2, 0, 1, 0, 0, 0, 0, 0]),
 array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]))

However now this shouldn't be a problem, either way should give reasonable answers.

If we want to allow the option of either, a neat way to do this would be to add another type parameter for RoundUp/RoundDown, and then we can dispatch on searchsortedfirst/searchsortedlast as appropriate.

@nalimilan
Copy link
Member

I'm not sure it matters much, but it seems to me it's more common to use intervals closed on the left and open on the right, i.e. [a; b). This is interesting when you have only positive values, as it means 0 will be included without having to pass artificially a negative value. And integer age ranges (for example) are usually defined as 20-29, 30-39, etc.

@simonbyrne
Copy link
Contributor Author

I've added an optional interval argument that is either :left or :right, indicating the closed side.

However I think it makes sense to keep the current behaviour (interval=:right) as default, since that matches nicely with 1-based indexing:

julia> hist(1:100;interval=:right)
Histogram{Int64,1,(StepRange{Int64,Int64},)}((0:20:100,),[20,20,20,20,20],:right)

julia> hist(1:100;interval=:left)
Histogram{Int64,1,(StepRange{Int64,Int64},)}((0:20:120,),[19,20,20,20,20,1],:left)

@ViralBShah
Copy link
Member

+1 for moving this into StatsBase. Only downside is that many users may expect hist to be part of Base, coming from Matlab, where it is standard. I am assuming that R users get such functionality without jumping through package installation hoops.

is = if h.interval == :right
map((edge, x) -> searchsortedfirst(edge,x) - 1, h.edges, xs)
else
map(searchsortedlast, h.edges, xs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is an argument for having searchsortedfirst return 1 less by default? cf. #5664

@simonbyrne
Copy link
Contributor Author

@ViralBShah That would perhaps be the easiest in terms of upgrade path, as we could deprecate the existing functionality gradually without breaking too much.

@nalimilan
Copy link
Member

@simonbyrne As you like, though I'm not sure the comparison really has practical applications. But wouldn't it be more explicit to call the argument closed rather than interval? In R I can never remember in what direction it works.

@simonbyrne
Copy link
Contributor Author

@nalimilan closed does make more sense.

@BobPortmann
Copy link
Contributor

Not a big thing, but wouldn't counts make more sense then weights for the actual histogram field.

@simonbyrne
Copy link
Contributor Author

@BobPortmann I had in mind that the same structure could be used for weighted observations, or could be normalised to one (which is why I didn't restrict the field to be Int).

@simonbyrne
Copy link
Contributor Author

So I take it that no one is opposed to moving this to StatsBase.jl?

@ViralBShah
Copy link
Member

I would think so.

@johnmyleswhite
Copy link
Member

+1 for putting histograms in StatsBase.jl

@lindahua
Copy link
Contributor

Late to this party. Agree wholeheartedly with the move.

@prcastro
Copy link
Contributor

Why isn't this closed?

@pao
Copy link
Member

pao commented May 16, 2014

Because no one got to it. @simonbyrne want to do the honors, or is there something else here keeping this open?

@ViralBShah
Copy link
Member

Bump.

@simonbyrne
Copy link
Contributor Author

Sorry, been away. Yep, now in StatsBase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants