Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternate way to allow for multiple data points #1

Open
jscargle opened this issue Jul 5, 2018 · 2 comments
Open

alternate way to allow for multiple data points #1

jscargle opened this issue Jul 5, 2018 · 2 comments

Comments

@jscargle
Copy link

jscargle commented Jul 5, 2018

I would like to learn more about your work.
There have been some generalizations and developments in Bayesian Blocks
that have not yet been published. At some point I plan to post this to github ... real soon.

I believe there is a simple way to incorporate multiple data points,
depending on the cost function used ... for example as described in our paper
Astrophysical Journal, 764:167 (26pp), 2013
STUDIES IN ASTRONOMICAL TIME SERIES ANALYSIS. VI. BAYESIAN BLOCK REPRESENTATIONS
Jeffrey D. Scargle, Jay P. Norris, Brad Jackson, and James Chiang
For example in the "workhorse" cost function for event data, we use [eq. (19)]
N ( log N - log T )
for the cost of a block of length T containing N events. Duplicate values
of the event times just get absorbed into this function, except possibly for
some ambiguities in the definition of the beginning and ending times of the block
(i.e. definition of T).

I wonder how this compares to your approach.
Cheers, Jeff Scargle

@janmtl
Copy link
Owner

janmtl commented Jul 10, 2018

Hi @jscargle,

Great to hear from you! I'm looking forward to seeing the code for the new developments. I will have to read your paper in more detail to understand the improved method but I can provide some context for the code in this repo in the meantime.

When I did this work with my collaborators at ID Analytics, we were trying to find a better way to prepare features for a machine-learning algorithm. Specifically, we had received some data that was a mixture of continuous and discrete signals contained in the same column of a given table. The values we observed were effectively sampled from two distributions: (1) a "nice" continuous distribution for which Bayesian Blocks gave a nice density estimate and (2) a discrete distribution of unknown support (i.e.: we did not know a priori the possible values of this distribution).

We tried at first to separate out the two distributions through heuristics (by using unique value counts) but we then discovered that the discrete distribution actually wasn't actually perfectly "discrete". Each discrete "spike" was actually slightly noisy and constituted a significant volume of the total density.

We developed the method in this repo through trial and error. Our criteria were (A) that the "spikes" in the input data were retained in the block representation and (B) that our ML algorithms saw an improvement from features pre-processed using this method. In the end, we achieved both (A) and (B) and saw a moderate lift from being able to capture both the discrete and continuous nature of the data in the column in question.

I would it find difficult to show that this method is useful without also publishing the dataset on which it performed so well but as I'm sure you can understand, there is no way for us to have published credit fraud data. Sadly, this ties my hands since I do not have a suitable dataset on which to perform a reproducible study.

Cheers,

Jan

@jscargle
Copy link
Author

jscargle commented Jul 10, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants