Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: cdfInverse results in wrong order of values on monotonic distribution with large ranges #12

Closed
JonathanTaws opened this issue Aug 1, 2018 · 4 comments · Fixed by #15

Comments

@JonathanTaws
Copy link
Contributor

When using cdfInverse on a T-Digest created from a dataset, I get the following:
screen shot 2018-08-01 at 2 24 34 pm

On this graph, I have a distribution of values with their probability on the y axis and on the x axis the actual values. I generate the value using cdfInverse, as follow:

xs = [td.cdfInverse(i/1000.) for i in range(1001)]
ys = [i/1000. for i in range(1001)]

When I dive deeper into the distribution, I can see that, even though my distribution should be monotonically increasing, I get some values in xs that are unordered, and thus I get the following result (look at 1k, just before 4k, and after 8k):
screen shot 2018-08-01 at 2 25 09 pm

My assumption was that I would get only increasing values in my xs when generating them from the cdfInverse method, as I am increasing the value of the probability/percentile rank when looping.
A workaround for now is to generate the values, order them, and then call cdf on the ordered values, but it adds extra steps and I'm not sure this is the right method.

To give more example, here are the results of the following:

print(td.cdf(8517.442))
>> 0.6443371631522132
print(td.cdfInverse(0.629))
>> 8517.442135224697
print(td.cdfInverse(0.644))
>> 8509.811889971521

I would expect td.cdfInverse(0.629) to give a smaller value than td.cdfInverse(0.644)(as the probability of the former is smaller than the latter).

@JonathanTaws JonathanTaws changed the title Python: cdfInverse results in wrong order of values on monotonic distribution Python: cdfInverse results in wrong order of values on monotonic distribution with large ranges Aug 1, 2018
@erikerlandson
Copy link
Member

This is a bug in the core t-digest code, I'll fix it on isarn/isarn-sketches#9

@erikerlandson
Copy link
Member

Actually it is also a bug here, since the python version of the TDigest class will also have the bug. It will need to be fixed on both the jvm and python.

@JonathanTaws
Copy link
Contributor Author

Great reactivity on this @erikerlandson - I guess the same kind of clipping needs to be done on the Python TDigest?

@erikerlandson
Copy link
Member

@JonathanTaws, yes - I have the bug fixed but I used this as an opportunity to rebuild all the isarn packages with updated dependencies and also for scala 2.12 (in anticipation of spark 2.4). I expect to be finished with the remaining package revs soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants