Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix upper-bound bin error for auto-ranged data #459

Merged
merged 22 commits into from
Sep 30, 2017
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions datashader/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,15 +54,13 @@ def compute_scale_and_translate(self, range, n):
----------
range : tuple
A tuple representing the range ``[min, max]`` along the axis, in
data space. min is inclusive and max is exclusive.
data space. Both min and max are inclusive.
n : int
The number of bins along the axis.

Returns
-------
s, t : floats
Parameters represe

"""
start, end = map(self.mapper, range)
s = n/(end - start)
Expand Down
1 change: 1 addition & 0 deletions datashader/dask.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ def shape_bounds_st_and_axis(df, canvas, glyph):
y_range = canvas.y_range or glyph._compute_y_bounds_dask(df)
x_min, x_max, y_min, y_max = bounds = compute(*(x_range + y_range))
x_range, y_range = (x_min, x_max), (y_min, y_max)

width = canvas.plot_width
height = canvas.plot_height

Expand Down
98 changes: 50 additions & 48 deletions datashader/glyphs.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,14 +92,21 @@ def _build_extend(self, x_mapper, y_mapper, info, append):
def _extend(vt, bounds, xs, ys, *aggs_and_cols):
sx, tx, sy, ty = vt
xmin, xmax, ymin, ymax = bounds

def map_onto_pixel(x, y):
xx, yy = x_mapper(x) * sx + tx, y_mapper(y) * sy + ty
if x == xmax:
xx -= np.spacing(xx)
if y == ymax:
yy -= np.spacing(yy)
return int(xx), int(yy)

Copy link
Member

@jbednar jbednar Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tested this, but would it be better to fix it up in the integer domain, instead of relying on np.spacing to be sufficient to bump it back by an integer?

xx = int(x_mapper(x) * sx + tx)
yy = int(y_mapper(y) * sy + ty)

return (xx-1 if x==xmax else xx,
        yy-1 if y==ymax else yy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusting in the integer domain instead fails compared to the original. For instance,

x_mapper, y_mapper = lambda n: n, lambda n: n
xmax, ymax = 1.0, 1.0
sx, tx, sy, ty = 0.1, 0.1, 0.1, 0.1

given the point (1.0, 1.0), the original algorithm returns (0, 0) while the above returns (-1, -1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this script to compare implementations:

import numpy as np


def map_onto_pixel(x, bound, s, t):
    xx = x * s + t
    if x == bound:
        xx -= np.spacing(xx)
    return int(xx)

def map_onto_pixel2(x, bound, s, t):
    xx = int(x * s + t)
    return xx-1 if x == bound else xx

def floats(origin, variance):
    fs = [origin]
    n = origin
    for _ in range(variance):
        n = n - np.spacing(n)
        fs.append(n)
    n = origin
    for _ in range(variance):
        n = n + np.spacing(n)
        fs.append(n)
    fs.sort()
    return fs

bound = 1.0
values = [0.1 * n for n in range(10)]
for s in values:
    for t in values:
        for f in floats(bound, 10):
            a = map_onto_pixel(f, bound, s, t)
            b = map_onto_pixel2(f, bound, s, t)
            if a != b:
                print("{:.20f} {} {} {} {}".format(f, a, b, s, t))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that script, all the cases where the two methods differ are when f==bound, which makes sense given that both versions will otherwise always return int(x * s + t). But what I can't tell is which answer (if either) is correct when the two do differ. When they differ, is that meant to be a valid actual input, given that the points are being cropped against the specified bounds already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of the first implementation, a, is assumed to be the correct value since its the current implementation used by all of the passing tests. The script was meant to verify the second implementation would be a drop-in replacement.

I just verified that the newer implementation passes the tests, which makes me believe more investigation may be needed to improve the test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that's true; the test cases could well be fine. It's true that the above script finds differences, but I can't tell whether it's finding them in cases that would ever be able to be encountered in practice, given that s and t are not arbitrary, but are presumably derived from the bounds (and the aggregate array size), and given that the points are already being cropped against the bounds. What would be helpful would be to take a case where the values are observed to differ, and then figure out whether that case could ever happen, given how s and t are actually calculated. My guess is that it couldn't ever happen, but I have not worked it through myself. And given that both approaches pass the tests, I am not at all confident that the existing approach in the PR is safer or more general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your assessment. I took my modified script and created additional uniformity test cases. Both implementations produced the same results with the test suite, so I replaced the mapping with the simpler implementation.

for i in range(xs.shape[0]):
x = xs[i]
y = ys[i]
if (xmin <= x < xmax) and (ymin <= y < ymax):
append(i,
int(x_mapper(x) * sx + tx),
int(y_mapper(y) * sy + ty),
*aggs_and_cols)
if (xmin <= x <= xmax) and (ymin <= y <= ymax):
xi, yi = map_onto_pixel(x, y)
append(i, xi, yi, *aggs_and_cols)

def extend(aggs, df, vt, bounds):
xs = df[x_name].values
Expand Down Expand Up @@ -127,16 +134,10 @@ def _build_extend(self, x_mapper, y_mapper, info, append):
y_name = self.y

def extend(aggs, df, vt, bounds, plot_start=True):
# Scale/transform float bounds to integer space and adjust for
# exclusive upper bounds
xmin, xmax, ymin, ymax = map_onto_pixel(vt, *bounds)
mapped_bounds = (xmin, xmax - 1, ymin, ymax - 1)

xs = df[x_name].values
ys = df[y_name].values
cols = aggs + info(df)

extend_line(vt, bounds, mapped_bounds, xs, ys, plot_start, *cols)
extend_line(vt, bounds, xs, ys, plot_start, *cols)

return extend

Expand Down Expand Up @@ -169,13 +170,16 @@ def _compute_outcode(x, y, xmin, xmax, ymin, ymax):

def _build_map_onto_pixel(x_mapper, y_mapper):
@ngjit
def map_onto_pixel(vt, x0, x1, y0, y1):
def map_onto_pixel(vt, bounds, x, y):
"""Map points onto pixel grid"""
sx, tx, sy, ty = vt
return (int(x_mapper(x0) * sx + tx),
int(x_mapper(x1) * sx + tx),
int(y_mapper(y0) * sy + ty),
int(y_mapper(y1) * sy + ty))
_, xmax, _, ymax = bounds
xx, yy = x_mapper(x) * sx + tx, y_mapper(y) * sy + ty
if x == xmax:
xx -= np.spacing(xx)
if y == ymax:
yy -= np.spacing(yy)
return int(xx), int(yy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above regarding np.spacing vs. -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.


return map_onto_pixel

Expand Down Expand Up @@ -234,9 +238,9 @@ def draw_line(x0i, y0i, x1i, y1i, i, plot_start, clipped, *aggs_and_cols):

def _build_extend_line(draw_line, map_onto_pixel):
@ngjit
def extend_line(vt, bounds, mapped_bounds, xs, ys, plot_start, *aggs_and_cols):
def extend_line(vt, bounds, xs, ys, plot_start, *aggs_and_cols):
"""Aggregate along a line formed by ``xs`` and ``ys``"""
xmin, xmax, ymin, ymax = mapped_bounds
xmin, xmax, ymin, ymax = bounds
nrows = xs.shape[0]
i = 0
while i < nrows - 1:
Expand All @@ -252,11 +256,9 @@ def extend_line(vt, bounds, mapped_bounds, xs, ys, plot_start, *aggs_and_cols):
i += 1
continue

x0i, x1i, y0i, y1i = map_onto_pixel(vt, x0, x1, y0, y1)

# Use Cohen-Sutherland to clip the segment to a bounding box
outcode0 = _compute_outcode(x0i, y0i, xmin, xmax, ymin, ymax)
outcode1 = _compute_outcode(x1i, y1i, xmin, xmax, ymin, ymax)
outcode0 = _compute_outcode(x0, y0, xmin, xmax, ymin, ymax)
outcode1 = _compute_outcode(x1, y1, xmin, xmax, ymin, ymax)

accept = False
clipped = False
Expand All @@ -268,34 +270,34 @@ def extend_line(vt, bounds, mapped_bounds, xs, ys, plot_start, *aggs_and_cols):
elif outcode0 & outcode1:
plot_start = True
break

clipped = True
outcode_out = outcode0 if outcode0 else outcode1
if outcode_out & TOP:
x = x0 + (x1 - x0) * (ymax - y0) / (y1 - y0)
y = ymax
elif outcode_out & BOTTOM:
x = x0 + (x1 - x0) * (ymin - y0) / (y1 - y0)
y = ymin
elif outcode_out & RIGHT:
y = y0 + (y1 - y0) * (xmax - x0) / (x1 - x0)
x = xmax
elif outcode_out & LEFT:
y = y0 + (y1 - y0) * (xmin - x0) / (x1 - x0)
x = xmin

if outcode_out == outcode0:
x0, y0 = x, y
outcode0 = _compute_outcode(x0, y0, xmin, xmax, ymin, ymax)
# If x0 is clipped, we need to plot the new start
plot_start = True
else:
clipped = True
outcode_out = outcode0 if outcode0 else outcode1
if outcode_out & TOP:
x = x0i + int((x1i - x0i) * (ymax - y0i) / (y1i - y0i))
y = ymax
elif outcode_out & BOTTOM:
x = x0i + int((x1i - x0i) * (ymin - y0i) / (y1i - y0i))
y = ymin
elif outcode_out & RIGHT:
y = y0i + int((y1i - y0i) * (xmax - x0i) / (x1i - x0i))
x = xmax
elif outcode_out & LEFT:
y = y0i + int((y1i - y0i) * (xmin - x0i) / (x1i - x0i))
x = xmin

if outcode_out == outcode0:
x0i, y0i = x, y
outcode0 = _compute_outcode(x0i, y0i, xmin, xmax,
ymin, ymax)
# If x0i is clipped, we need to plot the new start
plot_start = True
else:
x1i, y1i = x, y
outcode1 = _compute_outcode(x1i, y1i, xmin, xmax,
ymin, ymax)
x1, y1 = x, y
outcode1 = _compute_outcode(x1, y1, xmin, xmax, ymin, ymax)

if accept:
x0i, y0i = map_onto_pixel(vt, bounds, x0, y0)
x1i, y1i = map_onto_pixel(vt, bounds, x1, y1)
draw_line(x0i, y0i, x1i, y1i, i, plot_start, clipped, *aggs_and_cols)
plot_start = False
i += 1
Expand Down
1 change: 1 addition & 0 deletions datashader/pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ def pandas_pipeline(df, schema, canvas, glyph, summary):

x_range = canvas.x_range or glyph._compute_x_bounds(df[glyph.x].values)
y_range = canvas.y_range or glyph._compute_y_bounds(df[glyph.y].values)

width = canvas.plot_width
height = canvas.plot_height

Expand Down
151 changes: 120 additions & 31 deletions datashader/tests/test_dask.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,22 +25,31 @@

ddf = dd.from_pandas(df, npartitions=3)

c = ds.Canvas(plot_width=2, plot_height=2, x_range=(0, 2), y_range=(0, 2))
c_logx = ds.Canvas(plot_width=2, plot_height=2, x_range=(1, 11),
y_range=(0, 2), x_axis_type='log')
c_logy = ds.Canvas(plot_width=2, plot_height=2, x_range=(0, 2),
y_range=(1, 11), y_axis_type='log')
c_logxy = ds.Canvas(plot_width=2, plot_height=2, x_range=(1, 11),
y_range=(1, 11), x_axis_type='log', y_axis_type='log')

coords = [np.arange(2, dtype='f8')+0.5, np.arange(2, dtype='f8')+0.5]
c = ds.Canvas(plot_width=2, plot_height=2, x_range=(0, 1), y_range=(0, 1))
c_logx = ds.Canvas(plot_width=2, plot_height=2, x_range=(1, 10),
y_range=(0, 1), x_axis_type='log')
c_logy = ds.Canvas(plot_width=2, plot_height=2, x_range=(0, 1),
y_range=(1, 10), y_axis_type='log')
c_logxy = ds.Canvas(plot_width=2, plot_height=2, x_range=(1, 10),
y_range=(1, 10), x_axis_type='log', y_axis_type='log')

axis = ds.core.LinearAxis()
lincoords = axis.compute_index(axis.compute_scale_and_translate((0, 1), 2), 2)
coords = [lincoords, lincoords]
dims = ['y', 'x']


def assert_eq(agg, b):
assert agg.equals(b)


def floats(n):
"""Returns contiguous list of floats from initial point"""
while True:
yield n
n = n + np.spacing(n)


def test_count():
out = xr.DataArray(np.array([[5, 5], [5, 5]], dtype='i4'),
coords=coords, dims=dims)
Expand All @@ -56,12 +65,12 @@ def test_count():
def test_any():
out = xr.DataArray(np.array([[True, True], [True, True]]),
coords=coords, dims=dims)
assert_eq(c.points(df, 'x', 'y', ds.any('i64')), out)
assert_eq(c.points(df, 'x', 'y', ds.any('f64')), out)
assert_eq(c.points(df, 'x', 'y', ds.any()), out)
assert_eq(c.points(ddf, 'x', 'y', ds.any('i64')), out)
assert_eq(c.points(ddf, 'x', 'y', ds.any('f64')), out)
assert_eq(c.points(ddf, 'x', 'y', ds.any()), out)
out = xr.DataArray(np.array([[True, True], [True, False]]),
coords=coords, dims=dims)
assert_eq(c.points(df, 'x', 'y', ds.any('empty_bin')), out)
assert_eq(c.points(ddf, 'x', 'y', ds.any('empty_bin')), out)


def test_sum():
Expand Down Expand Up @@ -151,18 +160,76 @@ def test_multiple_aggregates():
assert_eq(agg.i32_count, f(np.array([[5, 5], [5, 5]], dtype='i4')))


def test_auto_range_points():
n = 10
data = np.arange(n, dtype='i4')
df = pd.DataFrame({'time': np.arange(n),
'x': data,
'y': data})
ddf = dd.from_pandas(df, npartitions=3)

cvs = ds.Canvas(plot_width=n, plot_height=n)
agg = cvs.points(ddf, 'x', 'y', ds.count('time'))
sol = np.zeros((n, n), int)
np.fill_diagonal(sol, 1)
np.testing.assert_equal(agg.data, sol)

cvs = ds.Canvas(plot_width=n+1, plot_height=n+1)
agg = cvs.points(ddf, 'x', 'y', ds.count('time'))
sol = np.zeros((n+1, n+1), int)
np.fill_diagonal(sol, 1)
sol[5, 5] = 0
np.testing.assert_equal(agg.data, sol)

n = 4
data = np.arange(n, dtype='i4')
df = pd.DataFrame({'time': np.arange(n),
'x': data,
'y': data})
ddf = dd.from_pandas(df, npartitions=3)

cvs = ds.Canvas(plot_width=2*n, plot_height=2*n)
agg = cvs.points(ddf, 'x', 'y', ds.count('time'))
sol = np.zeros((2*n, 2*n), int)
np.fill_diagonal(sol, 1)
sol[[range(1, 4, 2)]] = 0
sol[[range(4, 8, 2)]] = 0
np.testing.assert_equal(agg.data, sol)

cvs = ds.Canvas(plot_width=2*n+1, plot_height=2*n+1)
agg = cvs.points(ddf, 'x', 'y', ds.count('time'))
sol = np.zeros((2*n+1, 2*n+1), int)
sol[0, 0] = 1
sol[3, 3] = 1
sol[6, 6] = 1
sol[8, 8] = 1
np.testing.assert_equal(agg.data, sol)


def test_uniform_points():
n = 101
df = pd.DataFrame({'time': np.ones(2*n, dtype='i4'),
'x': np.concatenate((np.arange(n, dtype='f8'),
np.arange(n, dtype='f8'))),
'y': np.concatenate(([0.] * n, [1.] * n))})

cvs = ds.Canvas(plot_width=10, plot_height=2, y_range=(0, 1))
agg = cvs.points(df, 'x', 'y', ds.count('time'))
sol = np.array([[10] * 9 + [11], [10] * 9 + [11]], dtype='i4')
np.testing.assert_equal(agg.data, sol)

def test_log_axis_points():
# Upper bound for scale/index of x-axis
start, end = map(np.log10, (1, 11))
s = 2/(end - start)
t = -start * s
px = np.arange(2)+0.5
logcoords = 10**((px-t)/s)
axis = ds.core.LogAxis()
logcoords = axis.compute_index(axis.compute_scale_and_translate((1, 10), 2), 2)

axis = ds.core.LinearAxis()
lincoords = axis.compute_index(axis.compute_scale_and_translate((0, 1), 2), 2)

sol = np.array([[5, 5], [5, 5]], dtype='i4')
out = xr.DataArray(sol, coords=[np.array([0.5, 1.5]), logcoords],
out = xr.DataArray(sol, coords=[lincoords, logcoords],
dims=['y', 'log_x'])
assert_eq(c_logx.points(ddf, 'log_x', 'y', ds.count('i32')), out)
out = xr.DataArray(sol, coords=[logcoords, np.array([0.5, 1.5])],
out = xr.DataArray(sol, coords=[logcoords, lincoords],
dims=['log_y', 'x'])
assert_eq(c_logy.points(ddf, 'x', 'log_y', ds.count('i32')), out)
out = xr.DataArray(sol, coords=[logcoords, logcoords],
Expand All @@ -171,11 +238,14 @@ def test_log_axis_points():


def test_line():
axis = ds.core.LinearAxis()
lincoords = axis.compute_index(axis.compute_scale_and_translate((-3., 3.), 7), 7)

df = pd.DataFrame({'x': [4, 0, -4, -3, -2, -1.9, 0, 10, 10, 0, 4],
'y': [0, -4, 0, 1, 2, 2.1, 4, 20, 30, 4, 0]})
ddf = dd.from_pandas(df, npartitions=3)
cvs = ds.Canvas(plot_width=7, plot_height=7,
x_range=(-3, 4), y_range=(-3, 4))
x_range=(-3, 3), y_range=(-3, 3))
agg = cvs.line(ddf, 'x', 'y', ds.count())
sol = np.array([[0, 0, 1, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 1, 0],
Expand All @@ -184,25 +254,44 @@ def test_line():
[1, 0, 0, 0, 0, 0, 1],
[0, 2, 0, 0, 0, 1, 0],
[0, 0, 1, 0, 1, 0, 0]], dtype='i4')
out = xr.DataArray(sol, coords=[np.arange(-3., 4.)+0.5, np.arange(-3., 4.)+0.5],
out = xr.DataArray(sol, coords=[lincoords, lincoords],
dims=['y', 'x'])
assert_eq(agg, out)


def test_log_axis_line():
# Upper bound for scale/index of x-axis
start, end = map(np.log10, (1, 11))
s = 2/(end - start)
t = -start * s
px = np.arange(2)+0.5
logcoords = 10**((px-t)/s)
axis = ds.core.LogAxis()
logcoords = axis.compute_index(axis.compute_scale_and_translate((1, 10), 2), 2)

axis = ds.core.LinearAxis()
lincoords = axis.compute_index(axis.compute_scale_and_translate((0, 1), 2), 2)

sol = np.array([[5, 5], [5, 5]], dtype='i4')
out = xr.DataArray(sol, coords=[np.array([0.5, 1.5]), logcoords],
out = xr.DataArray(sol, coords=[lincoords, logcoords],
dims=['y', 'log_x'])
assert_eq(c_logx.line(ddf, 'log_x', 'y', ds.count('i32')), out)
out = xr.DataArray(sol, coords=[logcoords, np.array([0.5, 1.5])],
out = xr.DataArray(sol, coords=[logcoords, lincoords],
dims=['log_y', 'x'])
assert_eq(c_logy.line(ddf, 'x', 'log_y', ds.count('i32')), out)
out = xr.DataArray(sol, coords=[logcoords, logcoords],
dims=['log_y', 'log_x'])
assert_eq(c_logxy.line(ddf, 'log_x', 'log_y', ds.count('i32')), out)


def test_auto_range_line():
axis = ds.core.LinearAxis()
lincoords = axis.compute_index(axis.compute_scale_and_translate((-10., 10.), 5), 5)

df = pd.DataFrame({'x': [-10, 0, 10, 0, -10],
'y': [ 0, 10, 0, -10, 0]})
ddf = dd.from_pandas(df, npartitions=3)
cvs = ds.Canvas(plot_width=5, plot_height=5)
agg = cvs.line(ddf, 'x', 'y', ds.count())
sol = np.array([[0, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[2, 0, 0, 0, 1],
[0, 1, 0, 1, 0],
[0, 0, 1, 0, 0]], dtype='i4')
out = xr.DataArray(sol, coords=[lincoords, lincoords],
dims=['y', 'x'])
assert_eq(agg, out)
Loading