-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: cdfInverse results in wrong order of values on monotonic distribution with large ranges #12
Comments
This is a bug in the core t-digest code, I'll fix it on isarn/isarn-sketches#9 |
Actually it is also a bug here, since the python version of the |
Great reactivity on this @erikerlandson - I guess the same kind of clipping needs to be done on the Python |
@JonathanTaws, yes - I have the bug fixed but I used this as an opportunity to rebuild all the isarn packages with updated dependencies and also for scala 2.12 (in anticipation of spark 2.4). I expect to be finished with the remaining package revs soon. |
When using cdfInverse on a T-Digest created from a dataset, I get the following:
On this graph, I have a distribution of values with their probability on the y axis and on the x axis the actual values. I generate the value using cdfInverse, as follow:
When I dive deeper into the distribution, I can see that, even though my distribution should be monotonically increasing, I get some values in xs that are unordered, and thus I get the following result (look at 1k, just before 4k, and after 8k):
My assumption was that I would get only increasing values in my xs when generating them from the cdfInverse method, as I am increasing the value of the probability/percentile rank when looping.
A workaround for now is to generate the values, order them, and then call cdf on the ordered values, but it adds extra steps and I'm not sure this is the right method.
To give more example, here are the results of the following:
I would expect td.cdfInverse(0.629) to give a smaller value than td.cdfInverse(0.644)(as the probability of the former is smaller than the latter).
The text was updated successfully, but these errors were encountered: