Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix upper-bound bin error for auto-ranged data #459

Merged
merged 22 commits into from
Sep 30, 2017

Conversation

jbcrail
Copy link
Contributor

@jbcrail jbcrail commented Sep 12, 2017

Fix #457

@jbednar
Copy link
Member

jbednar commented Sep 13, 2017

Thanks, @jbcrail. I'm trying to decide whether this is truly the appropriate fix.

To start with, I don't think either of the behaviors shown in #457 is correct -- previously, datashader resulted in the point being just over the edge to the top right and thus not visible, but the numpy.histogram2d example had an extra row and column on one side, which doesn't seem correct either. With this PR, datashader now does the right thing:

image

Given our upper-exclusive bounds policy, it achieves this result by enlarging the floating-point auto range just barely enough to ensure that all points will map into the resulting set of pixels, without changing any user-supplied range. This clearly works, but one implication is that if the user does compute a range based on finding the max and min, the max points won't be included in the plot.

One could argue that such behavior is just doing what the user asks, but we should consider whether it would be less surprising to users if we just changed the definition of x_range and y_range to be upper inclusive in general, e.g. by adding np.spacing to compute_scale_and_translate instead. That might surprise users less and is less likely to hide data points unbeknownst to the users.

Any opinions, @jbcrail, @philippjfr, and @hsparra?

@apiszcz
Copy link

apiszcz commented Sep 14, 2017 via email

@jbednar
Copy link
Member

jbednar commented Sep 14, 2017

Internally, datashader has to map every single datapoint to one, and only one, grid cell. That's true everywhere in the array, from the bottom left to the top right, and the only way that can be achieved for all locations, including those precisely on the borders between two grid cells, is if two of the borders on each internal grid cell are inclusive, and two are exclusive. If the borders were all inclusive, a datapoint on the border between two grid cells would be counted in both of them, which is clearly incorrect statistically. Unfortunately, if that policy is applied consistently to all cells, which for good performance is the only reasonable choice, then at the border of the overall array, some of the borders will be exclusive. Having those borders be exclusive is fine as a definition, but with autoranging it's clearly not the right overall answer, and is clearly confusing and surprising. This PR works around the problem by bumping up the two exclusive edges just very slightly so that they include the value on the edge, effectively making the outer borders all be inclusive and thus avoiding the problem in the autoranging case but not if someone naively supplies bounds that they expect to be inclusive.

The behavior of histogram2d is not correct in this case, at least not based on the example you supplied, and so we are certainly not going to match that (adding an extra dummy row and column). But I would like to address the specific problem of not including the edge datapoints, as this PR addresses.

Memory and performance issues should be addressed in other issues, which I think have been raised already, but those will take some investigation.

@apiszcz
Copy link

apiszcz commented Sep 14, 2017 via email

@jbcrail
Copy link
Contributor Author

jbcrail commented Sep 14, 2017

@jbednar I agree that extending the upper-exclusive bounds policy for the entire array could be confusing and surprising to users. Anecdotally, I pre-calculated the min and max when generating my first Datashader examples, so I also ran into this issue. I think upper-inclusive bounds are the least surprising and probably why they are used in other frameworks. Here's why upper-exclusive bounds might cause confusion:

First, the x_range and y_range parameters are similar in name to Python's range, thus, implying an upper-exclusive range (i.e. [a, b)). If the parameters were x_limit and y_limit instead (like other plotting frameworks) or used lists for values instead of tuples, then an upper-inclusive range (i.e. [a, b]) would be natural.

Second, even though Datashader only handles numeric values, if we were plotting dates or strings, then an upper-inclusive range would also be least surprising, e.g. plotting A to Z or 2017-Jan to 2017-Dec.

As you mentioned, I think compute_scale_and_bounds is the right place to add np.spacing. When I added it to _compute_x_bounds and _compute_y_bounds, I had to adjust the tests to account for the new auto-range value. This didn't feel right, since the implementation was bleeding into the tests.

@jbednar
Copy link
Member

jbednar commented Sep 14, 2017

OK, @jbcrail, can you please update the PR to add the fix in the new spot?

This reverts to exclusive ranges both manual and auto for all glyphs
(points/line) without introducing regressions of holoviz#318, holoviz#330, and holoviz#343.

I refactored several tests to make xarray coordinate indices easier to
read and more explicit.
I also fixed extend_line to only calculate with floats until drawing the
line if accepted. We were previously juggling floats and their
associated integer values. This caused incorrect mapping to the grid.
@jbednar
Copy link
Member

jbednar commented Sep 28, 2017

Let me know when it's ready to review, once the tests pass...

@jbcrail
Copy link
Contributor Author

jbcrail commented Sep 29, 2017

I had two strategies to address the uniform distribution of points, which we did not previously test.

The first strategy was to increase the upper bound on the x-axis and y-axis by the next nearest floating point value by using np.spacing. This changed the scale and translate value when mapping to the axis, thus, making some bins slightly larger. The details of this change were messy and I found it hard to get all tests to pass.

The second strategy (currently used) was to adjust the point and line glyphs to decrement data that landed on either upper bound using np.spacing. This made both ranges fully inclusive without changing the bin sizes. One positive side effect was that I refactored the glyph code to be clearer. In particular, the line glyph uses only the non-mapped floating points for range checks before mapping to integer values and then drawing the line. This simplified the code and may address #464 though I haven't tested it yet.

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this seems workable. @philippjfr, would be worth having a second opinion, given how difficult it has been to get this right.

if y == ymax:
yy -= np.spacing(yy)
return int(xx), int(yy)

Copy link
Member

@jbednar jbednar Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tested this, but would it be better to fix it up in the integer domain, instead of relying on np.spacing to be sufficient to bump it back by an integer?

xx = int(x_mapper(x) * sx + tx)
yy = int(y_mapper(y) * sy + ty)

return (xx-1 if x==xmax else xx,
        yy-1 if y==ymax else yy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusting in the integer domain instead fails compared to the original. For instance,

x_mapper, y_mapper = lambda n: n, lambda n: n
xmax, ymax = 1.0, 1.0
sx, tx, sy, ty = 0.1, 0.1, 0.1, 0.1

given the point (1.0, 1.0), the original algorithm returns (0, 0) while the above returns (-1, -1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this script to compare implementations:

import numpy as np


def map_onto_pixel(x, bound, s, t):
    xx = x * s + t
    if x == bound:
        xx -= np.spacing(xx)
    return int(xx)

def map_onto_pixel2(x, bound, s, t):
    xx = int(x * s + t)
    return xx-1 if x == bound else xx

def floats(origin, variance):
    fs = [origin]
    n = origin
    for _ in range(variance):
        n = n - np.spacing(n)
        fs.append(n)
    n = origin
    for _ in range(variance):
        n = n + np.spacing(n)
        fs.append(n)
    fs.sort()
    return fs

bound = 1.0
values = [0.1 * n for n in range(10)]
for s in values:
    for t in values:
        for f in floats(bound, 10):
            a = map_onto_pixel(f, bound, s, t)
            b = map_onto_pixel2(f, bound, s, t)
            if a != b:
                print("{:.20f} {} {} {} {}".format(f, a, b, s, t))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that script, all the cases where the two methods differ are when f==bound, which makes sense given that both versions will otherwise always return int(x * s + t). But what I can't tell is which answer (if either) is correct when the two do differ. When they differ, is that meant to be a valid actual input, given that the points are being cropped against the specified bounds already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of the first implementation, a, is assumed to be the correct value since its the current implementation used by all of the passing tests. The script was meant to verify the second implementation would be a drop-in replacement.

I just verified that the newer implementation passes the tests, which makes me believe more investigation may be needed to improve the test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's true; the test cases could well be fine. It's true that the above script finds differences, but I can't tell whether it's finding them in cases that would ever be able to be encountered in practice, given that s and t are not arbitrary, but are presumably derived from the bounds (and the aggregate array size), and given that the points are already being cropped against the bounds. What would be helpful would be to take a case where the values are observed to differ, and then figure out whether that case could ever happen, given how s and t are actually calculated. My guess is that it couldn't ever happen, but I have not worked it through myself. And given that both approaches pass the tests, I am not at all confident that the existing approach in the PR is safer or more general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your assessment. I took my modified script and created additional uniformity test cases. Both implementations produced the same results with the test suite, so I replaced the mapping with the simpler implementation.

xx -= np.spacing(xx)
if y == ymax:
yy -= np.spacing(yy)
return int(xx), int(yy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above regarding np.spacing vs. -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@holoviz holoviz deleted a comment from jbcrail Sep 29, 2017
Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ready to merge, thanks!

@jbednar jbednar merged commit 73163e7 into holoviz:master Sep 30, 2017
@jbcrail jbcrail deleted the issue-457 branch September 30, 2017 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants