RFC: add Histogram types #6601

simonbyrne · 2014-04-22T13:51:16Z

This adds explicit types for histograms. The main advantage is that we can define methods that operate on histograms, notably plot. This approach closely mirrors existing functionality, but there are some alternatives that may be worth thinking about:

Use hist to create both 1d and 2d versions: this could be done using tuples (similar to the kde function in KernelDensity.jl).
Create a general type for an N-dimensional histogram. For example

immutable Histogram{N,E,T}
   edges::NTuple{N,E}
   weights::Array{T,N}
end

Methods for combining multiple histograms (for example, with distributed arrays): we could overload + for this?
Move all this out of Base into StatsBase.jl (see Histograms JuliaStats/StatsBase.jl#49)

johnmyleswhite · 2014-04-22T14:48:33Z

+1 for a Histogram type

StefanKarpinski · 2014-04-22T22:13:38Z

+1 for a Histogram type and to @jiahao's suggestion.

simonbyrne · 2014-04-23T10:32:03Z

Okay, there's now a single Histogram type, and a single hist function, which accepts tuples as arguments for higher-order histograms.

The question is now what to do with matrix arguments: if size(X) == (m,n), should hist(X):

create an array of Histogram{T,1} objects for each column (this is analogous to current hist behaviour), or
create a single Histogram{T,n} object (similar to current hist2d behaviour)?

simonbyrne · 2014-04-23T11:07:21Z

One other thing to think about: which way should we round? (i.e. should the bins be upper- or lower-inclusive)?

At the moment, we round down, i.e. with an edge vector 0:2, an observation of 1 will go into the (0,1) bin:

julia> hist([1],0:2)
Histogram{Int64,1,(UnitRange{Int64},)}((0:2,),[1,0])

This is the same as the R default (which has the option for either), but the opposite of numpy (which doesn't allow a choice).

The historical reason for this choice was that pre-FloatRange, this reduced the number of odd things would happen if using ranges like 0:0.1:1 (since 0.1 > 1//10). This problem can be seen in numpy:

In [8]: np.histogram([0.1,0.2,0.3,0.4],10,(0.0,1.0))
Out[8]: 
(array([0, 1, 2, 0, 1, 0, 0, 0, 0, 0]),
 array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ]))

However now this shouldn't be a problem, either way should give reasonable answers.

If we want to allow the option of either, a neat way to do this would be to add another type parameter for RoundUp/RoundDown, and then we can dispatch on searchsortedfirst/searchsortedlast as appropriate.

nalimilan · 2014-04-23T11:58:34Z

I'm not sure it matters much, but it seems to me it's more common to use intervals closed on the left and open on the right, i.e. [a; b). This is interesting when you have only positive values, as it means 0 will be included without having to pass artificially a negative value. And integer age ranges (for example) are usually defined as 20-29, 30-39, etc.

simonbyrne · 2014-04-23T13:24:09Z

I've added an optional interval argument that is either :left or :right, indicating the closed side.

However I think it makes sense to keep the current behaviour (interval=:right) as default, since that matches nicely with 1-based indexing:

julia> hist(1:100;interval=:right)
Histogram{Int64,1,(StepRange{Int64,Int64},)}((0:20:100,),[20,20,20,20,20],:right)

julia> hist(1:100;interval=:left)
Histogram{Int64,1,(StepRange{Int64,Int64},)}((0:20:120,),[19,20,20,20,20,1],:left)

ViralBShah · 2014-04-23T13:26:32Z

+1 for moving this into StatsBase. Only downside is that many users may expect hist to be part of Base, coming from Matlab, where it is standard. I am assuming that R users get such functionality without jumping through package installation hoops.

simonbyrne · 2014-04-23T13:37:05Z

base/statistics.jl

+    is = if h.interval == :right
+        map((edge, x) -> searchsortedfirst(edge,x) - 1, h.edges, xs)
+    else
+        map(searchsortedlast, h.edges, xs)


Perhaps this is an argument for having searchsortedfirst return 1 less by default? cf. #5664

simonbyrne · 2014-04-23T13:49:25Z

@ViralBShah That would perhaps be the easiest in terms of upgrade path, as we could deprecate the existing functionality gradually without breaking too much.

nalimilan · 2014-04-23T14:04:13Z

@simonbyrne As you like, though I'm not sure the comparison really has practical applications. But wouldn't it be more explicit to call the argument closed rather than interval? In R I can never remember in what direction it works.

simonbyrne · 2014-04-23T14:20:13Z

@nalimilan closed does make more sense.

BobPortmann · 2014-04-23T14:24:59Z

Not a big thing, but wouldn't counts make more sense then weights for the actual histogram field.

simonbyrne · 2014-04-23T14:27:41Z

@BobPortmann I had in mind that the same structure could be used for weighted observations, or could be normalised to one (which is why I didn't restrict the field to be Int).

simonbyrne · 2014-04-25T08:49:53Z

So I take it that no one is opposed to moving this to StatsBase.jl?

ViralBShah · 2014-04-25T09:35:48Z

I would think so.

johnmyleswhite · 2014-04-25T18:06:37Z

+1 for putting histograms in StatsBase.jl

lindahua · 2014-05-14T16:09:58Z

Late to this party. Agree wholeheartedly with the move.

prcastro · 2014-05-16T04:11:05Z

Why isn't this closed?

pao · 2014-05-16T12:53:36Z

Because no one got to it. @simonbyrne want to do the honors, or is there something else here keeping this open?

ViralBShah · 2014-05-18T04:25:15Z

Bump.

simonbyrne · 2014-05-18T13:50:10Z

Sorry, been away. Yep, now in StatsBase.

add Histogram types

2fa1de1

create single Histogram type

7479c75

add interval option to Histogram

c24a03e

simonbyrne reviewed Apr 23, 2014
View reviewed changes

This was referenced May 7, 2014

Histogram type and hist methods JuliaStats/StatsBase.jl#61

Merged

deprecate histogram functionality #6842

Closed

simonbyrne closed this May 18, 2014

simonbyrne deleted the hist branch March 10, 2015 12:00

simonbyrne restored the hist branch March 10, 2015 12:00

simonbyrne mentioned this pull request Jul 6, 2016

Change default for "closed" argument for histograms from right to left JuliaStats/StatsBase.jl#184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: add Histogram types #6601

RFC: add Histogram types #6601

simonbyrne commented Apr 22, 2014

johnmyleswhite commented Apr 22, 2014

StefanKarpinski commented Apr 22, 2014

simonbyrne commented Apr 23, 2014

simonbyrne commented Apr 23, 2014

nalimilan commented Apr 23, 2014

simonbyrne commented Apr 23, 2014

ViralBShah commented Apr 23, 2014

simonbyrne Apr 23, 2014

simonbyrne commented Apr 23, 2014

nalimilan commented Apr 23, 2014

simonbyrne commented Apr 23, 2014

BobPortmann commented Apr 23, 2014

simonbyrne commented Apr 23, 2014

simonbyrne commented Apr 25, 2014

ViralBShah commented Apr 25, 2014

johnmyleswhite commented Apr 25, 2014

lindahua commented May 14, 2014

prcastro commented May 16, 2014

pao commented May 16, 2014

ViralBShah commented May 18, 2014

simonbyrne commented May 18, 2014

RFC: add Histogram types #6601

RFC: add Histogram types #6601

Conversation

simonbyrne commented Apr 22, 2014

johnmyleswhite commented Apr 22, 2014

StefanKarpinski commented Apr 22, 2014

simonbyrne commented Apr 23, 2014

simonbyrne commented Apr 23, 2014

nalimilan commented Apr 23, 2014

simonbyrne commented Apr 23, 2014

ViralBShah commented Apr 23, 2014

simonbyrne Apr 23, 2014

Choose a reason for hiding this comment

simonbyrne commented Apr 23, 2014

nalimilan commented Apr 23, 2014

simonbyrne commented Apr 23, 2014

BobPortmann commented Apr 23, 2014

simonbyrne commented Apr 23, 2014

simonbyrne commented Apr 25, 2014

ViralBShah commented Apr 25, 2014

johnmyleswhite commented Apr 25, 2014

lindahua commented May 14, 2014

prcastro commented May 16, 2014

pao commented May 16, 2014

ViralBShah commented May 18, 2014

simonbyrne commented May 18, 2014