Fix upper-bound bin error for auto-ranged data #459

jbcrail · 2017-09-12T01:44:54Z

jbednar · 2017-09-13T23:38:04Z

Thanks, @jbcrail. I'm trying to decide whether this is truly the appropriate fix.

To start with, I don't think either of the behaviors shown in #457 is correct -- previously, datashader resulted in the point being just over the edge to the top right and thus not visible, but the numpy.histogram2d example had an extra row and column on one side, which doesn't seem correct either. With this PR, datashader now does the right thing:

Given our upper-exclusive bounds policy, it achieves this result by enlarging the floating-point auto range just barely enough to ensure that all points will map into the resulting set of pixels, without changing any user-supplied range. This clearly works, but one implication is that if the user does compute a range based on finding the max and min, the max points won't be included in the plot.

One could argue that such behavior is just doing what the user asks, but we should consider whether it would be less surprising to users if we just changed the definition of x_range and y_range to be upper inclusive in general, e.g. by adding np.spacing to compute_scale_and_translate instead. That might surprise users less and is less likely to hide data points unbeknownst to the users.

Any opinions, @jbcrail, @philippjfr, and @hsparra?

apiszcz · 2017-09-14T00:58:23Z

Naive question, why doesn't auto-scale should include all values from the input. If the user supplies min,max values i'm guessing that is the same. If both methods are in question is there a more fundamental problem, I was 'hoping 2D histogram and datashader would have the same result. BTW: I'm comparing the results with datashader (before this change) and 2d histogram on larger datasets, they are always different, furthermore on very large sparse input datashader exhausts memory on problems where 2D histogram does not.

…

On Wed, Sep 13, 2017 at 7:38 PM, James A. Bednar ***@***.***> wrote: Thanks, @jbcrail <https://github.com/jbcrail>. I'm trying to decide whether this is truly the appropriate fix. To start with, I don't think either of the behaviors shown in #457 <#457> is correct -- previously, datashader resulted in the point being just over the edge to the top right and thus not visible, but the numpy.histogram2d example had an extra row and column on one side, which doesn't seem correct either. With this PR, datashader now does the right thing: [image: image] <https://user-images.githubusercontent.com/1695496/30405426-61ecd42c-98b1-11e7-87fd-0896adc17b48.png> Given our upper-exclusive bounds policy, it achieves this result by enlarging the floating-point auto range just barely enough to ensure that all points will map into the resulting set of pixels, without changing any user-supplied range. This clearly works, but one implication is that if the user does compute an auto range based on the max and min, the max points won't be included in the plot. One could argue that such behavior is just doing what the user asks, but we should consider whether it would be less surprising to users if we just changed the definition of x_range and y_range to be upper inclusive in general, e.g. by adding np.spacing to compute_scale_and_translate instead. That might surprise users less and is less likely to hide data points unbeknownst to the users. Any opinions, @jbcrail <https://github.com/jbcrail>, @philippjfr <https://github.com/philippjfr>, and @hsparra <https://github.com/hsparra> ? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#459 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXVTXH2KmggQKpfH5QdVuSZGE-QX7Mzks5siGddgaJpZM4PT_OT> .

jbednar · 2017-09-14T12:53:34Z

Internally, datashader has to map every single datapoint to one, and only one, grid cell. That's true everywhere in the array, from the bottom left to the top right, and the only way that can be achieved for all locations, including those precisely on the borders between two grid cells, is if two of the borders on each internal grid cell are inclusive, and two are exclusive. If the borders were all inclusive, a datapoint on the border between two grid cells would be counted in both of them, which is clearly incorrect statistically. Unfortunately, if that policy is applied consistently to all cells, which for good performance is the only reasonable choice, then at the border of the overall array, some of the borders will be exclusive. Having those borders be exclusive is fine as a definition, but with autoranging it's clearly not the right overall answer, and is clearly confusing and surprising. This PR works around the problem by bumping up the two exclusive edges just very slightly so that they include the value on the edge, effectively making the outer borders all be inclusive and thus avoiding the problem in the autoranging case but not if someone naively supplies bounds that they expect to be inclusive.

The behavior of histogram2d is not correct in this case, at least not based on the example you supplied, and so we are certainly not going to match that (adding an extra dummy row and column). But I would like to address the specific problem of not including the edge datapoints, as this PR addresses.

Memory and performance issues should be addressed in other issues, which I think have been raised already, but those will take some investigation.

apiszcz · 2017-09-14T19:02:53Z

I recently observed this difference between DS and NP 2DH, in places other than the edges 'obviously' ? due to the approach you describe. CONCUR with counting on the seams. Should there be a reference aggregation data set for benchmark and known results? I have not reviewed all of your test cases perhaps there is one. Something tiny simple, medium and 'large'. Too much to do, however correct or repeatable results would may be helpful. Comparing the methods of implementation through inspection of outputs for: 2DH, DS, (there are some other techniques out there, which 'very' fast with hash codes), etc. I also 'wonder' what the GRID (weather) community has for their aggregations.

…

On Thu, Sep 14, 2017 at 8:53 AM, James A. Bednar ***@***.***> wrote: Internally, datashader has to map every single datapoint to one, and only one, grid cell. That's true everywhere in the array, from the bottom left to the top right, and the only way that can be achieved for all locations, including those precisely on the borders, is if two of the borders on each grid cell are inclusive, and two are exclusive. If the borders were all inclusive, a datapoint on the border between two grid cells would be counted in both of them, which is clearly incorrect statistically. Unfortunately, if that policy is applied consistently to all cells, which for good performance is the only reasonable choice, then at the border of the overall array, some of the borders will be exclusive. Having those borders be exclusive is fine as a definition, but with autoranging it's clearly not the right overall answer, and is clearly confusing and surprising. This PR works around the problem by bumping up the two exclusive edges just very slightly so that they include the value on the edge, effectively making the outer borders all be inclusive and thus avoiding the problem in the autoranging case but not if someone naively supplies bounds that they expect to be inclusive. The behavior of histogram2d is not correct in this case, at least not based on the example you supplied, and so we are certainly not going to match that (adding an extra dummy row and column). But I would like to address the specific problem of not including the edge datapoints, as this PR addresses. Memory and performance issues should be addressed in other issues, which I think have been raised already, but those will take some investigation. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#459 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABXVTR4P1xBe8FJSwYsu3MR6xTE4eE-iks5siSHPgaJpZM4PT_OT> .

jbcrail · 2017-09-14T21:47:12Z

@jbednar I agree that extending the upper-exclusive bounds policy for the entire array could be confusing and surprising to users. Anecdotally, I pre-calculated the min and max when generating my first Datashader examples, so I also ran into this issue. I think upper-inclusive bounds are the least surprising and probably why they are used in other frameworks. Here's why upper-exclusive bounds might cause confusion:

First, the x_range and y_range parameters are similar in name to Python's range, thus, implying an upper-exclusive range (i.e. [a, b)). If the parameters were x_limit and y_limit instead (like other plotting frameworks) or used lists for values instead of tuples, then an upper-inclusive range (i.e. [a, b]) would be natural.

Second, even though Datashader only handles numeric values, if we were plotting dates or strings, then an upper-inclusive range would also be least surprising, e.g. plotting A to Z or 2017-Jan to 2017-Dec.

As you mentioned, I think compute_scale_and_bounds is the right place to add np.spacing. When I added it to _compute_x_bounds and _compute_y_bounds, I had to adjust the tests to account for the new auto-range value. This didn't feel right, since the implementation was bleeding into the tests.

jbednar · 2017-09-14T23:29:13Z

OK, @jbcrail, can you please update the PR to add the fix in the new spot?

This reverts to exclusive ranges both manual and auto for all glyphs (points/line) without introducing regressions of holoviz#318, holoviz#330, and holoviz#343. I refactored several tests to make xarray coordinate indices easier to read and more explicit.

I also fixed extend_line to only calculate with floats until drawing the line if accepted. We were previously juggling floats and their associated integer values. This caused incorrect mapping to the grid.

jbednar · 2017-09-28T23:46:21Z

Let me know when it's ready to review, once the tests pass...

jbcrail · 2017-09-29T00:12:07Z

I had two strategies to address the uniform distribution of points, which we did not previously test.

The first strategy was to increase the upper bound on the x-axis and y-axis by the next nearest floating point value by using np.spacing. This changed the scale and translate value when mapping to the axis, thus, making some bins slightly larger. The details of this change were messy and I found it hard to get all tests to pass.

The second strategy (currently used) was to adjust the point and line glyphs to decrement data that landed on either upper bound using np.spacing. This made both ranges fully inclusive without changing the bin sizes. One positive side effect was that I refactored the glyph code to be clearer. In particular, the line glyph uses only the non-mapped floating points for range checks before mapping to integer values and then drawing the line. This simplified the code and may address #464 though I haven't tested it yet.

jbednar

I think this seems workable. @philippjfr, would be worth having a second opinion, given how difficult it has been to get this right.

jbednar · 2017-09-29T17:46:15Z

datashader/glyphs.py

+                if y == ymax:
+                    yy -= np.spacing(yy)
+                return int(xx), int(yy)
+


I have not tested this, but would it be better to fix it up in the integer domain, instead of relying on np.spacing to be sufficient to bump it back by an integer?

xx = int(x_mapper(x) * sx + tx) yy = int(y_mapper(y) * sy + ty) return (xx-1 if x==xmax else xx, yy-1 if y==ymax else yy)

Adjusting in the integer domain instead fails compared to the original. For instance,

x_mapper, y_mapper = lambda n: n, lambda n: n xmax, ymax = 1.0, 1.0 sx, tx, sy, ty = 0.1, 0.1, 0.1, 0.1

given the point (1.0, 1.0), the original algorithm returns (0, 0) while the above returns (-1, -1).

I used this script to compare implementations:

import numpy as np def map_onto_pixel(x, bound, s, t): xx = x * s + t if x == bound: xx -= np.spacing(xx) return int(xx) def map_onto_pixel2(x, bound, s, t): xx = int(x * s + t) return xx-1 if x == bound else xx def floats(origin, variance): fs = [origin] n = origin for _ in range(variance): n = n - np.spacing(n) fs.append(n) n = origin for _ in range(variance): n = n + np.spacing(n) fs.append(n) fs.sort() return fs bound = 1.0 values = [0.1 * n for n in range(10)] for s in values: for t in values: for f in floats(bound, 10): a = map_onto_pixel(f, bound, s, t) b = map_onto_pixel2(f, bound, s, t) if a != b: print("{:.20f} {} {} {} {}".format(f, a, b, s, t))

For that script, all the cases where the two methods differ are when f==bound, which makes sense given that both versions will otherwise always return int(x * s + t). But what I can't tell is which answer (if either) is correct when the two do differ. When they differ, is that meant to be a valid actual input, given that the points are being cropped against the specified bounds already?

The value of the first implementation, a, is assumed to be the correct value since its the current implementation used by all of the passing tests. The script was meant to verify the second implementation would be a drop-in replacement.

I just verified that the newer implementation passes the tests, which makes me believe more investigation may be needed to improve the test cases.

I'm not sure that's true; the test cases could well be fine. It's true that the above script finds differences, but I can't tell whether it's finding them in cases that would ever be able to be encountered in practice, given that s and t are not arbitrary, but are presumably derived from the bounds (and the aggregate array size), and given that the points are already being cropped against the bounds. What would be helpful would be to take a case where the values are observed to differ, and then figure out whether that case could ever happen, given how s and t are actually calculated. My guess is that it couldn't ever happen, but I have not worked it through myself. And given that both approaches pass the tests, I am not at all confident that the existing approach in the PR is safer or more general.

I agree with your assessment. I took my modified script and created additional uniformity test cases. Both implementations produced the same results with the test suite, so I replaced the mapping with the simpler implementation.

jbednar · 2017-09-29T17:47:24Z

datashader/glyphs.py

+            xx -= np.spacing(xx)
+        if y == ymax:
+            yy -= np.spacing(yy)
+        return int(xx), int(yy)


Same comment as above regarding np.spacing vs. -1.

Same as above.

jbednar

I think it's ready to merge, thanks!

Fix upper-bound bin error for auto-ranged data

bab1044

Fix holoviz#457

jbcrail added 7 commits September 15, 2017 11:15

Clean up docs

2131de4

Merge branch 'master' into issue-457

98f8358

Add more comprehensive tests

80829e9

Refactor bounds expansion into separate method

a0bab32

Revert whitespace updates

af57cbc

Add auto range tests for line glyph

b323b12

jbcrail requested a review from jbednar September 27, 2017 17:23

jbcrail added 6 commits September 27, 2017 13:52

Add tests for uniform distribution of points

3ba7ce9

Revert exclusive range adjustments

245ae5b

Update point glyph to use fully inclusive range

eb30517

Update line glyph to use fully inclusive range

5e36915

I also fixed extend_line to only calculate with floats until drawing the line if accepted. We were previously juggling floats and their associated integer values. This caused incorrect mapping to the grid.

Update docs

6051cd2

Update tests

6ac43a6

jbcrail added 2 commits September 28, 2017 16:56

Update comment

54ac722

Fix glyph tests

343953c

jbcrail and others added 2 commits September 28, 2017 17:14

Remove unused module

025ab07

Fixed typo

8df6599

jbednar approved these changes Sep 29, 2017

View reviewed changes

holoviz deleted a comment from jbcrail Sep 29, 2017

Add more tests for uniform points

90aa4cd

jbcrail and others added 3 commits September 29, 2017 17:31

Simplify mapping to pixels

8c8b9f6

Added comment

72f6851

Fixed typo in comment

43bf9bc

jbednar approved these changes Sep 30, 2017

View reviewed changes

jbednar merged commit 73163e7 into holoviz:master Sep 30, 2017

jbcrail deleted the issue-457 branch September 30, 2017 22:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix upper-bound bin error for auto-ranged data #459

Fix upper-bound bin error for auto-ranged data #459

jbcrail commented Sep 12, 2017

jbednar commented Sep 13, 2017 •

edited

Loading

apiszcz commented Sep 14, 2017 via email

jbednar commented Sep 14, 2017 •

edited

Loading

apiszcz commented Sep 14, 2017 via email

jbcrail commented Sep 14, 2017

jbednar commented Sep 14, 2017 •

edited

Loading

jbednar commented Sep 28, 2017

jbcrail commented Sep 29, 2017

jbednar left a comment

jbednar Sep 29, 2017 •

edited

Loading

jbcrail Sep 29, 2017

jbcrail Sep 29, 2017

jbednar Sep 29, 2017

jbcrail Sep 29, 2017

jbednar Sep 29, 2017

jbcrail Sep 30, 2017

jbednar Sep 29, 2017

jbcrail Sep 29, 2017

jbednar left a comment

Fix upper-bound bin error for auto-ranged data #459

Fix upper-bound bin error for auto-ranged data #459

Conversation

jbcrail commented Sep 12, 2017

jbednar commented Sep 13, 2017 • edited Loading

apiszcz commented Sep 14, 2017 via email

jbednar commented Sep 14, 2017 • edited Loading

apiszcz commented Sep 14, 2017 via email

jbcrail commented Sep 14, 2017

jbednar commented Sep 14, 2017 • edited Loading

jbednar commented Sep 28, 2017

jbcrail commented Sep 29, 2017

jbednar left a comment

Choose a reason for hiding this comment

jbednar Sep 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbednar left a comment

Choose a reason for hiding this comment

jbednar commented Sep 13, 2017 •

edited

Loading

jbednar commented Sep 14, 2017 •

edited

Loading

jbednar commented Sep 14, 2017 •

edited

Loading

jbednar Sep 29, 2017 •

edited

Loading