Support for row-oriented coordinates in Canvas.line #694

jonmmease · 2019-01-22T00:08:25Z

Overview

This PR introduces a new Canvas.lines method for rendering collections of polylines (here called "lines" for short) where each polyline corresponds to a single row in an input DataFrame. This is in contrast to the current Canvas.line method which renders a single polyline (potentially containing gaps if there are nan values) per DataFrame.

I propose that this method eventually be overloaded to support the following 3 use-cases. Note: This PR only implements (1), but I'm hopeful that the addition of (2) and (3) will not require much additional refactoring.

Each line has the same length, N, and is represented by N columns containing the x coordinates and N columns containing the y coordinates. Here's an example in the case of two-vertex line segments (N=2):

cvs.lines(df, x=['x0', 'x1'], y=['y0', 'y1'])

Each line has the same length, N, but either the x or y vertex coordinates are constant for all lines. A common use-case here would be time-series data, where the x values (the time bins) are constant across all lines and each row of the DataFrame contains only the y-coordinates of each line. This would nicely handle cases like the one brought up in Plotting many lines as separate columns takes a long time #286 (comment), without the need to concatenate and nan separate each line. Here's what the aggregation step for this example could look like:

data = ... # 10000 x 50 array representing 10000 lines over 50 time bins
cvs.lines(pd.DataFrame(data), y=list(range(50)), x_constants=np.linspace(0, 10, 50))

Note that the data numpy array is converted into a DataFrame before being passed to cvs.lines. This DataFrame will have columns with integer labels that match the column index, and so the y=list(range(50)) argument is specifying a list of the column labels where the column labels are integers. As suggested in #283, it would also be nice to allow DataShader to accept a numpy (or Dask) array directly without requiring this DataFrame conversion step. I see this as a somewhat orthogonal feature that would require support across several other parts of the API.

Each line may have a different length. In this case the x and y vertex coordinates could be stored in a pair of RaggedArray columns (Add pandas ExtensionArray for storing homogeneous ragged arrays #687).

Additional consideration: For cases 1 and 2, we should also have a way to specify whether lines are
represented by rows (as I originally had in mind writing the descriptions above) or by columns. Representing lines by rows is useful when the line lengths are manageable, but there are lots and lots of lines. Specifying lines by columns is useful when the number of lines is manageable, but each line has lots and lots of points.

Example

Drawing line segments

import numpy as np
import pandas as pd
import dask.dataframe as dd
import datashader as ds
import datashader.utils as du, datashader.transfer_functions as tf

n = 100
np.random.seed(2)

x0 = np.random.uniform(size=n)
x1 = np.random.uniform(size=n)
y0 = np.random.uniform(size=n)
y1 = np.random.uniform(size=n)

pts = np.stack((x0,x1,y0,y1)).T
df = pd.DataFrame(np.stack((x0,x1,y0,y1)).T, columns=['x0', 'x1', 'y0', 'y1'])

cvs = ds.Canvas(plot_height=400,plot_width=400, x_range=[0, 1], y_range=[0, 1])
tf.shade(cvs.lines(df, ['x0', 'x1'], ['y0', 'y1'], agg=ds.count()))

Performance

Edit: See updated performance results in #694 (comment)

Here are some performance comparisons comparing cvs.lines with the equivalent cvs.line command for the segment example above. Timing does not take into account the time to generate the nan separated DataFrame needed by cvs.line.

And here's what scaling looks like on my quad-core 2015 MBP.

So the current cvs.line approach is consistently a little bit faster in the single-threaded case. But if the conversion time were taken into account this difference would probably get washed out. And for some reason the new cvs.lines approach seems to scale a little better in the multi-threaded case (at least for 3 and 4 cores).

TODO

Evaluate unifying this functionality with the current cvs.line method by adding a new "lines-by-rows-or-columns" argument. Either way, sketch out the future API for the by-rows and by-columns cases.
Rewrite last section of http://datashader.org/user_guide/3_Timeseries.html

philippjfr · 2019-01-22T14:44:14Z

datashader/glyphs.py

+    extend_line = _build_extend_line(draw_line, map_onto_pixel)
+
+    @ngjit
+    def extend_lines_xy(vt, bounds, xs, ys, plot_start, *aggs_and_cols):


I've had a go at rewriting this by adapting the extend_line code and that seems somewhat more efficient than unpacking this format and reusing the existing codepath. Probably needs some more fixes, but here's what my 10 minutes of hacking produced:

@ngjit def extend_lines_xy(vt, bounds, xs, ys, plot_start, *aggs_and_cols): """ here xs and ys are tuples of arrays and non-empty """ xmin, xmax, ymin, ymax = bounds nrows = xs[0].shape[0] ncols = len(xs) i = 0 while i < nrows - 1: plot_start = True j = 0 while j < ncols - 1: x0 = xs[j][i] y0 = ys[j][i] x1 = xs[j+1][i] y1 = ys[j+1][i] # Use Liang-Barsky (1992) to clip the segment to a bounding box if outside_bounds(x0, y0, x1, y1, xmin, xmax, ymin, ymax): plot_start = True j += 1 continue t0, t1 = 0, 1 dx = x1 - x0 t0, t1, accept = clipt(-dx, x0 - xmin, t0, t1) if not accept: j += 1 continue t0, t1, accept = clipt(dx, xmax - x0, t0, t1) if not accept: j += 1 continue dy = y1 - y0 t0, t1, accept = clipt(-dy, y0 - ymin, t0, t1) if not accept: j += 1 continue t0, t1, accept = clipt(dy, ymax - y0, t0, t1) if not accept: j += 1 continue if t1 < 1: clipped = True x1 = x0 + t1 * dx y1 = y0 + t1 * dy if t0 > 0: # If x0 is clipped, we need to plot the new start clipped = True plot_start = True x0 = x0 + t0 * dx y0 = y0 + t0 * dy x0i, y0i = map_onto_pixel(vt, bounds, x0, y0) x1i, y1i = map_onto_pixel(vt, bounds, x1, y1) draw_line(x0i, y0i, x1i, y1i, i, plot_start, clipped, *aggs_and_cols) plot_start = False j += 1 i += 1 return extend_lines_xy

Cool! @jonmmease, can you benchmark that like you did above?

Using this on a 10M vertex trimesh wireframe using dask this renders in 0.5 seconds as opposed to 1.5 seconds with cvs.line.

Awesome, thanks for giving this a try @philippjfr! I'll incorporate this approach and add it to my benchmarks

jbednar

Looks great! I think this will be very valuable to users, and I agree with your proposed directions 2 and 3. Very nice!

Can you please also update the user guide notebook http://datashader.org/user_guide/3_Timeseries.html to suggest using this data structure when your polylines are all of the same length, instead of the approaches proposed there now?

datashader/glyphs.py

jonmmease · 2019-01-22T15:48:37Z

@jbednar for documentation I was thinking of adding a section to http://datashader.org/user_guide/4_Trajectories.html for the case where you have a collection of fixed length trajectories (just make up some more random walks).

Until we have (2) from the description above I think the last section of http://datashader.org/user_guide/3_Timeseries.html would still be a bit awkward because the user would need to create columns for the x-data that would repeat for every row. With (2) it will be a one-liner without any utility functions or work-arounds needed.

What do you think?

jbednar · 2019-01-22T16:00:40Z

Ok, sounds good. In that case please briefly mention the new approach in 3_Timeseries, for now, and once approach 2 is implemented that section can be rewritten to use the new way.

BTW, can approach 2 be approximated already in the current PR's code simply by passing the y column n numbers of times as the y_label, for n x columns? It might not be quite as fast as having code that knows the y column is shared, but wouldn't require the user to duplicate the actual column, right?

jonmmease · 2019-01-22T16:09:36Z

BTW, can approach 2 be approximated already in the current codebase simply by passing the y column n numbers of times as the y_label, for n x columns? It might not be quite as fast, but wouldn't require duplicating the actual column, right?

I'm not sure, I'll give it another look when I incorporate @philippjfr 's suggestion above. And if something looks relatively straightforward, I'll just go ahead and add it.

jbednar · 2019-01-22T16:11:00Z

Note that I'm just suggesting a (untested) way for the user to use the current code, not a change to the code...

jonmmease · 2019-01-22T23:37:49Z

Updated performance results after implementing optimization suggested by @philippjfr in bc7631c:

Performance is now basically equivalent for single threaded-case and is much faster for multi-threaded case.

jonmmease · 2019-01-22T23:52:53Z

I added a new API consideration to the main description:

Additional consideration: For cases 1 and 2, we should also have a way to specify whether lines are
represented by rows (as I originally had in mind writing the descriptions above) or by columns. Representing lines by rows is useful when the line lengths are manageable, but there are lots and lots of lines. Specifying lines by columns is useful when the number of lines is manageable, but each line has lots and lots of points.

I need to think about it a little more, but if we have an axis style argument that determines whether lines are specified by rows or columns, then we might be able to unify this functionality with the current cvs.line method, rather than introduce a new cvs.lines method.

jonmmease · 2019-01-25T00:28:58Z

For the sake of discussion, here is a potential API that would allow this new functionality to be integrated into the existing cvs.line API. The basic idea is to add a pandas-style axis argument that determines whether lines are being aggregated per column (axis=0) or per row (axis=1).

New method signature:

def line(self, source, x=None, y=None, agg=None, axis=0, x_constants=None, y_constants=None):
    ...

Example source is a DataFrame with N rows

df = pd.DataFrame({
    'A1': [...],
    'A2': [...],
    'B1': [...],
    'B2': [...],
    'A_ragged': RaggedArray(...),
    'B_ragged': RaggedArray(...)
})

One line across all rows

Here is an example of the current behavior of aggregating one line per DataFrame. The following example will produce a single length N line with coordinates df.A1 by df.B1

cvs.line(df, x='A1', y='B1')

or with axis specified explicitly

cvs.line(df, x='A1', y='B1', axis=0)

Multiple lines across all rows

An extension of the current behavior would be to allow x and y to be set to lists of columns in order to support aggregating multiple lines. The following example would produce two length N lines. The first with coordinates df.A1 by df.B1 and the second with coordinates df.A2 by df.B2.

cvs.line(df, x=['A1', 'A2'], y=['B1', 'B2'], axis=0)

Multiple lines across all rows with one shared dimension

If all of the lines share the same x coordinates, then this column could be specified as a string rather
than a list. The following example would produce two length N lines. The first with coordinates df.A1 by df.B1 and the second with coordinates df.A1 by df.B2.

cvs.line(df, x='A1', y=['B1', 'B2'], axis=0)

Many fixed length lines

The following example would aggregate one length-2 line per row, where the ith line has coordinates
[df.A1[i], df.A2[i]] by [df.B1[i], df.B2[i]]

cvs.line(df, x=['A1', 'A2'], y=['B1', 'B2'], axis=1)

Many fixed length lines with one shared dimension

The following example would aggregate one length-2 line per row, where the ith line has coordinates
[0.0, 1.0] by [df.B1[i], df.B2[i]]

cvs.line(df, x_constants=[0.0, 1.0], y=['B1', 'B2'], axis=1)

Many variable length lines

The following example would aggregate one line per row, where the ith line has coordinates df.A_ragged[i] by df.B_ragged[i].

cvs.line(df, x='A_ragged', y='B_ragged', axis=1)

Invalid API usage

In my mind the biggest drawback to this API is that some of the combinations of parameters will be invalid and it may be confusing for users to understand why.

For example, x_constants/y_constants couldn't be used with axis=0 because in this case we don't know the line length in advance.

Also, specifying RaggedArray columns would only make sense in the axis=1 case.

But, I expect that a good docstring and helpful error messages should be enough to help people find the options that they want.

What do you think?

jbednar · 2019-01-25T03:22:04Z

It looks very flexible and powerful to me!

I wonder if there is some way to avoid having to introduce x_constants and y_constants as different names. I see it's not practical to deduce that [0.0, 1.0] is a constant coordinate array rather Pandas column names, since Pandas allows numeric column names. Could use tuples instead of lists, but that seems arbitrary. Maybe if it's an array, treat it as a constant axis to broadcast, but otherwise as column names?

jonmmease · 2019-01-25T11:42:41Z

I wonder if there is some way to avoid having to introduce x_constants and y_constants as different names

Yeah, I was hoping to avoid the need for x_constants/y_constants as well. But I've struggled to come up with a natural way to distinguish column labels from numeric vectors as well. Using a numeric numpy array for this might work out alright, but then I'm not sure what to do with a numpy array of strings or a numeric pandas Series. Do you have any thoughts @philippjfr ?

jbednar · 2019-01-25T11:47:17Z

I guess to refine my proposal then, an array or Pandas series type would be treated as a constant numeric vector, while if one has a numpy array of strings, you'd have to convert it to a list of strings for it to be treated as a list of column names.

jonmmease · 2019-01-25T22:47:14Z

I guess to refine my proposal then, an array or Pandas series type would be treated as a constant numeric vector, while if one has a numpy array of strings, you'd have to convert it to a list of strings for it to be treated as a list of column names.

Works for me!

jonmmease · 2019-01-27T00:20:29Z

Ok, I added the axis argument to Canvas.line and removed Canvas.lines. So now the following two usages are supported.

One line across all rows

cvs.line(df, x='A1', y='B1')
cvs.line(df, x='A1', y='B1', axis=0)

Many fixed length lines

cvs.line(df, x=['A1', 'A2'], y=['B1', 'B2'], axis=1)

The other usages described above can be added in a follow-on PR.

jonmmease · 2019-01-28T00:46:45Z

Alright, I went ahead and implemented all of the aggregation approaches in the proposed API except for RaggedArray.

Examples from the new `Canvas.line` docstring

Define a canvas and a pandas DataFrame with 6 rows

import pandas as pd
import numpy as np
from datashader import Canvas
import datashader.transfer_functions as tf
cvs = Canvas(plot_height=400, plot_width=400)
df = pd.DataFrame({
   'A1': [1, 1.5, 2, 2.5, 3, 4],
   'A2': [1.5, 2, 3, 3.2, 4, 5],
   'B1': [10, 12, 11, 14, 13, 15],
   'B2': [11, 9, 10, 7, 8, 12],
}, dtype='float64')
df

Aggregate one line across all rows, with coordinates df.A1 by df.B1

agg = cvs.line(df, x='A1' , y='B1')
tf.shade(agg)

Aggregate two lines across all rows. The first with coordinates df.A1 by df.B1 and the second with coordinates df.A2 by df.B2

agg = cvs.line(df, x=['A1', 'A2'], y=['B1', 'B2'], axis=0)
tf.shade(agg)

Aggregate two lines across all rows where the lines share the same x coordinates. The first line will have coordinates df.A1 by df.B1 and the second will have coordinates df.A1 by df.B2.

agg = cvs.line(df, x='A1', y=['B1', 'B2'], axis=0)
tf.shade(agg)

Aggregate 6 length-2 lines, one per row, where the ith line has coordinates [df.A1[i], df.A2[i]] by [df.B1[i], df.B2[i]]

agg = cvs.line(df, x=['A1', 'A2'], y=['B1', 'B2'], axis=1)
tf.shade(agg)

Aggregate 6 length-4 lines, one per row, where the x coordinates of every line are [0, 1, 2, 3] and the y coordinates of the ith line are [df.A1[i], df.A2[i], df.B1[i], df.B2[i]].

agg = cvs.line(df, x=np.arange(4), y=['A1', 'A2', 'B1', 'B2'], axis=1)
tf.shade(agg)

Documentation

I haven't updated the documentation notebooks yet, but with these changes I'll be able to rewrite the last section of http://datashader.org/user_guide/3_Timeseries.html and remove the use of the dataframe_from_multiple_sequences utility function all together.

Performance

Updated performance results for aggregation time of length-2 line segments

jbednar · 2019-01-28T13:12:24Z

Looks great!!! Thanks for rounding that out. Seems like a very reasonable API now.

This allows subclasses to override how the DataFrame is used to compute the bounds

…h_dispatch

_compute_x_bounds -> compute_x_bounds Co-Authored-By: jonmmease <jon.mease@gmail.com>

calling existing extend_lines function.

…vas.line

jonmmease · 2019-01-30T14:08:08Z

I added the Dask tests, and rewrote the last section of the Timeseries documentation to use the new approach.

Ready for review @jbednar

jbednar

Looks fabulous! This work really helps make all the various scenarios that come up in practice be handled efficiently and conveniently. Great work. I did look at every line of the code involved, but I have to admit that my eyes glaze over a bit in some of the deep line-drawing bits, so I can't honestly say that I'd have detected any problems there if any exist. :-) But from what I could focus on, it looks good, and is ready to merge when tests pass.

jbednar · 2019-01-30T22:18:05Z

datashader/glyphs.py

+        return minval, maxval
+
+    @staticmethod
+    def maybe_expand_bounds(bounds):


Useful thing to refactor, thanks.

jbednar · 2019-01-30T22:54:01Z

I modified the last example in the user-guide notebook so that it was a random walk with a fixed starting point:

This now seems like a very reasonable synthesized approximation to something people will use in practice, i.e. overlaying huge numbers of lines with the same starting point.

jonmmease · 2019-01-31T00:03:32Z

Hmm, hitting an AppVeyor permission denied error on Python 2.7. Is this the same one we've been fighting with?

to work around conda/conda#8155

jonmmease · 2019-01-31T23:59:12Z

Merging so I can go ahead and rebase #696. Thanks for the review and doc updates @jbednar!

philippjfr reviewed Jan 22, 2019

View reviewed changes

jbednar requested changes Jan 22, 2019

View reviewed changes

datashader/glyphs.py Outdated Show resolved Hide resolved

datashader/glyphs.py Outdated Show resolved Hide resolved

jonmmease force-pushed the enh_lines_glyph branch from 47779fb to 72b2c68 Compare January 22, 2019 15:20

jonmmease mentioned this pull request Jan 24, 2019

Initial Dask trimesh support #696

Merged

jonmmease and others added 12 commits January 30, 2019 08:41

wrap _compute_*_bounds static methods with compute_*_bounds methods

0c119e6

This allows subclasses to override how the DataFrame is used to compute the bounds

Small refactor to remove the need for a specialized _PolygonLike glyp…

2d3b3b5

…h_dispatch

Refactor to extract required_columns glyph method

2accbca

Initial cvs.lines and LinesXY glyph

7ad8d5e

Simplify the cvs.lines signature and docstring for LinesXY case

81f1cc3

Add autorange support to LinesXY

6ba6b9b

Implement validate in LinesXY

b0d2932

Add dask autorange support to LinesXY

9d49e6b

Comment updates

6ba1fd4

flake8, unused imports

f3f749f

Update datashader/glyphs.py

49e0fe7

_compute_x_bounds -> compute_x_bounds Co-Authored-By: jonmmease <jon.mease@gmail.com>

Update datashader/glyphs.py

b3bb597

_compute_x_bounds -> compute_x_bounds Co-Authored-By: jonmmease <jon.mease@gmail.com>

jonmmease added 11 commits January 30, 2019 08:41

Optimize extend_lines_xy with direct implementation rather than

d6ecd14

calling existing extend_lines function.

flake8: Unused variable

2d210a0

Add glyph hash functions so that memoization behaves predictably

67194f6

Add axis argument to Canvas.line and move Canvas.lines logic into Can…

feb5381

…vas.line

Python 2 super

9f5b853

try again. Python 2 super

ecf6878

Implement additional line aggregation approaches

62162d6

doctest skip and docstring updates

24d5946

Added dask dataframe tests for line axis combinations

5060e53

Numeric column labels are converted to strings by datashape

8696a95

Update Timeseries user-guide

dc8e83f

jonmmease force-pushed the enh_lines_glyph branch from 9d78c9e to dc8e83f Compare January 30, 2019 14:05

Minor cleanup and adaptation for the timeseries nb

7e02687

jbednar approved these changes Jan 30, 2019

View reviewed changes

datashader/glyphs.py

return minval, maxval

@staticmethod

def maybe_expand_bounds(bounds):

Copy link

Member

jbednar Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful thing to refactor, thanks.

jonmmease added 4 commits January 30, 2019 19:48

Pin conda < 4.6 on AppVeyor

9886175

to work around conda/conda#8155

Move conda pin alongside pyctdev install

82b9bee

poke CI

93bd61d

Try pinning conda version after ecosystem_setup

cf1041a

jbednar changed the title ~~Add Canvas.lines for rendering a collection of lines (one per row)~~ Support for row-oriented coordinates in Canvas.line Jan 31, 2019

jonmmease merged commit 1238bfb into master Jan 31, 2019

jonmmease deleted the enh_lines_glyph branch January 31, 2019 23:59

philippjfr mentioned this pull request May 6, 2019

Roadmap: Task 6b: Optimize rendering of meshes and polygons holoviz-topics/EarthSim#85

Closed

jonmmease mentioned this pull request Aug 21, 2019

Add polygon support #181

Closed

jbednar mentioned this pull request May 24, 2021

Point antialiasing #1010

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for row-oriented coordinates in Canvas.line #694

Support for row-oriented coordinates in Canvas.line #694

jonmmease commented Jan 22, 2019 •

edited

Loading

philippjfr Jan 22, 2019 •

edited

Loading

jbednar Jan 22, 2019

philippjfr Jan 22, 2019

jonmmease Jan 22, 2019

jbednar left a comment

jonmmease commented Jan 22, 2019

jbednar commented Jan 22, 2019 •

edited

Loading

jonmmease commented Jan 22, 2019

jbednar commented Jan 22, 2019

jonmmease commented Jan 22, 2019

jonmmease commented Jan 22, 2019

jonmmease commented Jan 25, 2019

jbednar commented Jan 25, 2019 •

edited

Loading

jonmmease commented Jan 25, 2019

jbednar commented Jan 25, 2019

jonmmease commented Jan 25, 2019

jonmmease commented Jan 27, 2019

jonmmease commented Jan 28, 2019 •

edited

Loading

jbednar commented Jan 28, 2019

jonmmease commented Jan 30, 2019

jbednar left a comment

jbednar Jan 30, 2019

jbednar commented Jan 30, 2019

jonmmease commented Jan 31, 2019

jonmmease commented Jan 31, 2019

Support for row-oriented coordinates in Canvas.line #694

Support for row-oriented coordinates in Canvas.line #694

Conversation

jonmmease commented Jan 22, 2019 • edited Loading

Overview

Example

Performance

TODO

philippjfr Jan 22, 2019 • edited Loading

Choose a reason for hiding this comment

jbednar Jan 22, 2019

Choose a reason for hiding this comment

philippjfr Jan 22, 2019

Choose a reason for hiding this comment

jonmmease Jan 22, 2019

Choose a reason for hiding this comment

jbednar left a comment

Choose a reason for hiding this comment

jonmmease commented Jan 22, 2019

jbednar commented Jan 22, 2019 • edited Loading

jonmmease commented Jan 22, 2019

jbednar commented Jan 22, 2019

jonmmease commented Jan 22, 2019

jonmmease commented Jan 22, 2019

jonmmease commented Jan 25, 2019

One line across all rows

Multiple lines across all rows

Multiple lines across all rows with one shared dimension

Many fixed length lines

Many fixed length lines with one shared dimension

Many variable length lines

Invalid API usage

jbednar commented Jan 25, 2019 • edited Loading

jonmmease commented Jan 25, 2019

jbednar commented Jan 25, 2019

jonmmease commented Jan 25, 2019

jonmmease commented Jan 27, 2019

One line across all rows

Many fixed length lines

jonmmease commented Jan 28, 2019 • edited Loading

Examples from the new Canvas.line docstring

Documentation

Performance

jbednar commented Jan 28, 2019

jonmmease commented Jan 30, 2019

jbednar left a comment

Choose a reason for hiding this comment

jbednar Jan 30, 2019

Choose a reason for hiding this comment

jbednar commented Jan 30, 2019

jonmmease commented Jan 31, 2019

jonmmease commented Jan 31, 2019

jonmmease commented Jan 22, 2019 •

edited

Loading

philippjfr Jan 22, 2019 •

edited

Loading

jbednar commented Jan 22, 2019 •

edited

Loading

jbednar commented Jan 25, 2019 •

edited

Loading

jonmmease commented Jan 28, 2019 •

edited

Loading

Examples from the new `Canvas.line` docstring