Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram type and hist methods #61

Merged
merged 5 commits into from
May 14, 2014
Merged

Histogram type and hist methods #61

merged 5 commits into from
May 14, 2014

Conversation

simonbyrne
Copy link
Member

New histogram functionality: it creates a new type Histogram, and works for arbitrary dimensions. It has been proposed to move this here, and deprecate the current hist function in base (see JuliaLang/julia#6601).

Some decisions:

  • what should happen in the case of hist(x::Matrix)? should this be size(x,2)-dimensional histogram? If so, should we cap it at some dimension (say 5), so people don't accidentally call it on a 100x100 matrix?
  • should the Histogram type be mutable: the one advantage of this is that it would adaptive resizing when appending additional elements (in particular for streaming data).
  • it would be nice to incorporate weighted vectors.

@andreasnoack
Copy link
Member

👍

@simonbyrne
Copy link
Member Author

alternatively, instead of hist(...) we could use fit(Histogram,...)?

@johnmyleswhite
Copy link
Member

I like fit(Histogram, ...) a lot. Standardizing on fit is something I'd really like to see pushed forward.

@simonbyrne
Copy link
Member Author

Okay, I've changed the usage to fit(Histogram,...). That has the additional advantage of not causing conflicts with Base. I've also added some docs.

Unless there are any objections, I'll merge this in tonight.

simonbyrne added a commit that referenced this pull request May 14, 2014
Histogram type and hist methods
@simonbyrne simonbyrne merged commit 075908c into master May 14, 2014
@kmsquire
Copy link
Contributor

I've been thinking about this, and while I'm sympathetic to the idea of standardizing on fit for many things, the concept of "fitting" a histogram seems a little foreign to me. Is there a precedent in other languages, or can someone explain how this concept is the same as, say, fitting a linear model?

@StefanKarpinski
Copy link
Contributor

It seems to me that there are two related ideas – a histogram is just counting the items in bins, whereas you can also estimate what portion of a distribution falls into each bin. For the latter, fit seems applicable, whereas for the former it doesn't really.

@johnmyleswhite
Copy link
Member

Binning and histograms aren't really the same thing: a histogram decides the height of a box based on both the box's width and the probability mass in the region defined by the box's width, whereas counting items in bins ignores the width of the bin. In most conventional histograms, this isn't important because all of the bins are chosen to have the same width, but in general the two concepts are distinct. True binning is much closer to what the cut function does, whereas histograms are much closer to kernel density estimates.

As for the use of fit, I think it's appropriate for use with any statistical model, although I can easily sympathize with the sense that fitting a nonparametric model feels very vague since you can't reason easily about the family of all possible histograms in the absence of a specific fit. Fitting a linear model feels more a lot concrete because the parameter set is clearly defined before observing any data.

@StefanKarpinski
Copy link
Contributor

If it makes more sense to treat histogram construction as a form of non-parametric model fitting – albeit a very simple one – then I think that using fit would be just fine. You guys are the experts :-)

@johnmyleswhite
Copy link
Member

Well, I'm not totally sure we're making the right decision. But if we end up using fit for things like decision trees, it seems only fair to also use it for histograms.

@lindahua
Copy link
Contributor

I think we should add the hist method (as a more friendly interface of fit(Histogram, ...)) after we remove them from the Base.

@simonbyrne
Copy link
Member Author

I'm not 100% sure on this either, but I think it's worth trying to see how it goes. My rough idea is that Histogram represents the mathematical object, i.e. a stepped function, and by "fitting", you're finding the one that best represents the data, in some sense.

I originally did plan to define hist(a...) = fit(Histogram, a...), but now I'm not so sure. In R and Matlab, hist actually does two things: it returns an object representing the mathematical construction, but also, depending on the context in which it was called, can have the side-effect of displaying a graphical representation of the object.

I'm not really sure this is a pattern we should follow, as it doesn't really sit well with the rest of julia. One option worth considering is keeping hist as simply being a plotting function, i.e.

hist(a...) = plot(fit(Histogram,a...))

@StefanKarpinski
Copy link
Contributor

but also, depending on the context in which it was called, can have the side-effect of displaying a graphical representation of the object.

Yeah, let's not do that. It's really absurd this is a random side-effect of computing a histogram, especially since we don't have just one standard graphics package. Each package can have plot methods that apply to Histogram objects, however. Asking the programmer to write plot(hist(a...)) is not onerous.

@lindahua
Copy link
Contributor

Still think that hist(a...) = fit(Histogram, a...) is a useful convenient function.

I agree with @StefanKarpinski that functions for computation should not be entangled with those that do plotting.

@nalimilan
Copy link
Member

If hist is provided as a convenience function, then why wouldn't similar functions be shipped for all types of models? Because you're more likely to use histograms repeatedly? Just trying to find out where to draw the line here.

@andreasnoack
Copy link
Member

Recently I had an argument with myself in a github thread on this in the context of lm which is now deprecated. I started out in favor of keeping lm for convenience, but convinced myself that it was better not to provide a MATLAB/R convenience layer, because it would be better that users get used to our interface instead of thinking of Julia as MATLAB/R with fast loops.

@lindahua
Copy link
Contributor

Histogram, in essence, is statistics of some sort (not a model). Whereas I am fine with the fit(Histogram, ...) api, the hist method feels more natural to me.

@simonbyrne
Copy link
Member Author

To follow up on this, I've also thought of an alternative approach which combines histograms and contingency tables: see #32

@simonbyrne simonbyrne deleted the hist branch March 15, 2015 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants