Python: cdfInverse results in wrong order of values on monotonic distribution with large ranges #12

JonathanTaws · 2018-08-01T14:28:02Z

When using cdfInverse on a T-Digest created from a dataset, I get the following:

On this graph, I have a distribution of values with their probability on the y axis and on the x axis the actual values. I generate the value using cdfInverse, as follow:

xs = [td.cdfInverse(i/1000.) for i in range(1001)]
ys = [i/1000. for i in range(1001)]

When I dive deeper into the distribution, I can see that, even though my distribution should be monotonically increasing, I get some values in xs that are unordered, and thus I get the following result (look at 1k, just before 4k, and after 8k):

My assumption was that I would get only increasing values in my xs when generating them from the cdfInverse method, as I am increasing the value of the probability/percentile rank when looping.
A workaround for now is to generate the values, order them, and then call cdf on the ordered values, but it adds extra steps and I'm not sure this is the right method.

To give more example, here are the results of the following:

print(td.cdf(8517.442))
>> 0.6443371631522132
print(td.cdfInverse(0.629))
>> 8517.442135224697
print(td.cdfInverse(0.644))
>> 8509.811889971521

I would expect td.cdfInverse(0.629) to give a smaller value than td.cdfInverse(0.644)(as the probability of the former is smaller than the latter).

The text was updated successfully, but these errors were encountered:

erikerlandson · 2018-08-01T15:16:07Z

This is a bug in the core t-digest code, I'll fix it on isarn/isarn-sketches#9

erikerlandson · 2018-08-02T15:22:50Z

Actually it is also a bug here, since the python version of the TDigest class will also have the bug. It will need to be fixed on both the jvm and python.

JonathanTaws · 2018-08-06T13:42:28Z

Great reactivity on this @erikerlandson - I guess the same kind of clipping needs to be done on the Python TDigest?

erikerlandson · 2018-08-07T15:24:25Z

@JonathanTaws, yes - I have the bug fixed but I used this as an opportunity to rebuild all the isarn packages with updated dependencies and also for scala 2.12 (in anticipation of spark 2.4). I expect to be finished with the remaining package revs soon.

JonathanTaws changed the title ~~Python: cdfInverse results in wrong order of values on monotonic distribution~~ Python: cdfInverse results in wrong order of values on monotonic distribution with large ranges Aug 1, 2018

erikerlandson mentioned this issue Aug 1, 2018

cdfInverse results in wrong order of values on monotonic distribution with large ranges isarn/isarn-sketches#9

Closed

erikerlandson added a commit to erikerlandson/isarn-sketches-spark that referenced this issue Aug 7, 2018

Fixes isarn#12 - clip values to proper piecewise linear outputs

68e914b

erikerlandson mentioned this issue Aug 7, 2018

Fixes #12 - clip values to proper piecewise linear outputs #15

Merged

erikerlandson closed this as completed in #15 Aug 7, 2018

erikerlandson added a commit that referenced this issue Aug 7, 2018

Fixes #12 - clip values to proper piecewise linear outputs (#15)

a0282e0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: cdfInverse results in wrong order of values on monotonic distribution with large ranges #12

Python: cdfInverse results in wrong order of values on monotonic distribution with large ranges #12

JonathanTaws commented Aug 1, 2018

erikerlandson commented Aug 1, 2018

erikerlandson commented Aug 2, 2018

JonathanTaws commented Aug 6, 2018

erikerlandson commented Aug 7, 2018

Python: cdfInverse results in wrong order of values on monotonic distribution with large ranges #12

Python: cdfInverse results in wrong order of values on monotonic distribution with large ranges #12

Comments

JonathanTaws commented Aug 1, 2018

erikerlandson commented Aug 1, 2018

erikerlandson commented Aug 2, 2018

JonathanTaws commented Aug 6, 2018

erikerlandson commented Aug 7, 2018