Skip to content
Nicolas Silva edited this page Oct 14, 2016 · 9 revisions

Here is a list some thoughts in a rough state, mostly things learned along the way.

No one-size-fits-all solution

The tessellator-based approach implemented here is meant to be used to render paths covering a large amount of pixels. The idea is that the GPU is a lot better than the CPU at filling a large amount of pixels if we provide it with the information in the right form, so we go through a rather expensive tessellation process on the CPU side in order to be able to fill many pixels on the GPU at a reduced cost. For very small paths (most text in web pages for example), tessellation is probably not going to be the best trade-off, since generating the tessellation is may take as long as rasterizing the shape using a good CPU rasterizer.

Some other path rendering strategies involve less CPU-side processing, and trade it with more complex fragment shaders. It would be interesting to experiment with the various strategies. I assume they'll have different performance profiles depending on the type of shapes and the resolution at which they are being rendered.

The tessellator should be good at:

  • Working reliably on any hardware that can render a triangle, even with old versions of OpenGL and whatnot.
  • Simple integration in any gpu-based rendering pipeline, easy to batch with other types of meshes.
  • Filling a lot of pixels (large full-screen SVG drawings, etc.).
  • Less subject to driver bugs than any other approach I know.

The main drawbacks:

  • Expensive CPU-side preparation step that needs to happen at the first frame.
  • Can generate a lot of geometry.

I doubt the tessellation approach will ever be a good fit for small text for example.

Using the GPU as a triangle-processing device or a parallel-computing devide.

Right now I believe that working with triangles is what yields the best performance if a lot fragments are to be shaded. It is also the most portable approach since GPU's have been designed to process triangles since the very beginning.

On the other hand it is not the most elegant way to think about rendering, and reauires some non-trivial CPU work at initialization to massage vector shapes and express them into the soup of triangles that GPUs like to eat.

I think that in the long run, GPUs will get flexible enough that an entire scanline-type path rendering algorithm could happen on the GPU in the most efficient way, relying mostly on things like compute shaders. It could already be the case on recent desktop GPUs, but for now a large part of the mobile hardware out there does not provide this kind of flexibility.

Using connectivity information in the algorithm

The first versions of the fill tessellator were using the connectivity of the edges to determine the type of vertex that the sweep-line was processing at each step. For example, when looking at a vertex V, we'd ask the Path data structure it's previous and next vertices in order to determine the type of event (left, right, start, end, merge, split). working this way is very simple, but it falls down with certain edge cases, such as when two start events are interleaved (TODO, illustration).

The tessellator inherited this strategy from the computational geometry book in which the monotone polygon tessellation algortithm assumes an polygons with no self-intersection and therefore does not have to handle these types of problematic cases.

In order to fix this, the tessellator now accumulates all edges that start at the current point and handle them all at once. This is adds a fair amount of complexity, but makes it possible to consider all cases including several pairs of edges intersecting at the same position.

Another benefit of working this way is that the tessellator is now independent from the path data structure. The input is a sorted list of edges that does not need to retain extra information connectivity information, and can be constructed from an iterator or built directly with the PathBuilder interface. As a result, if a program uses the tessellator with a custom path data structure (such as a half-edge mesh), tessellating does not require re-building a particular type of path object, as long as the custom path provides an iterator of path events or SVG events.

That said, all of the other open-source tessellators (including GLU/libtess2 and Skia) that I have seen so far use a half-egde data structure and multiple passes, so it might turn out that doing this differently does not make a big difference.

Floating point precision versus integer overflows

The tessellator was initially implemented with 32 bits floating point numbers. This caused a lot of bugs in the intersection detection code, and with the way the sweep line accumulates edges at the current position by comparing positions. I would have saved a lot of time and effort if I had followed cairo and skia's footsteps and went with integer or fixed point coordinates from the start.

TODO: a lot of war stories to tell here. Variable precision is terribly bug-prone for this type of algorithm.

Integer and fixed point number arithmetic can also have different kinds of precision issues when performing divisions and multiplications.

The problem with fixed point and integer coordinates is that it is very easy to run into overflows because they can describe a much smaller range than floating point number. There's a trade-off to make between how much precision we want in the fractional part and the size of the range. Currently the tessellator works with 32 bit fixed point precision numbers with 16 bits for fractional part, while the line segment intersection computation is implemented with f64 floating point numbers. There's been experiments with various fixed point precision math and just integer arithmetic, but it's quite hard to not run into overflows with the them when postponing division (otherwise the precision loss is worse than float math so why bother).

The challenge, when trying to not lose any precision, is that every time the internal representations of two fixed point numbers are multiplied, we need to potentially double the amount of bits required to hold the result, which grows quickly over the 64 bits we are comfortable working with. So postponing all divisions to the end of a long series of computation is not always a viable option.

One way to mitigate the overflow issue could be to enforce tiled tessellation, with small enough tiles that the limited fixed point range is not an issue anymore. The tessellated geometry can be a lot bigger than the range can describe, by simply adding tiles.

For reference Skia's tessellator stores positions using 16.16 fixed point numbers and uses doubles to compute edge intersections. Since the resulting intersections are not exact, there are corrective measures that are taken later in the algorithm to correct cases where the imprecision cause some invariants of the algorithm to be broken. This may turn out to be the best compromise (unfortunately).

Tessellator performance

Self-intersections

Initial perf measurements show that about half of the time is spent dealing with intersections. Optimizing the segment-segment intersection routine can improve the situation, but it could be that running a pre-pass that checks as fast as possible whether or not there are self-intersections and skip that part may be the best approach. I think that FastUIDraw's tessellator does something along those lines.

Tessellating curves

Tessellating the rust logo without the curves is more than twice faster than tessellating a flattened version of the shape with tolerance 0.05. This encourages me to pursue the idea of handling curves in the sweep line rather than flattening beforehand.

Working with iterators

Path flattening used to be done in by PathBuilder, which meant that we would build and store a path object containing a lot of small line segments instead of the bezier curves. If all we need to do is iterate over flattened paths, storing the flattened path is not necessary. We can store the compact and resolution-independent bezier path and flatten it on the fly while iterating, which saves memory and does not require throwing away and re-building the path when the zoom level changes.

Overdraw

It would be interesting to render all overlapping shapes in one pass and only write to the frame buffer once, a bit like WebRender does. This would probably be a bit complicated to set up with a tessellator: The tessellated geometry would contain all paths and if one of the path start animating, the whole shape would need to be tessellated again. To do this we'd need to go back to using a path representation that stores the connectivity (half-edge mesh probably) and tessellate everything at once. It would certainly be more expensive on the CPU side (first frame), and cheaper on the GPU side (all frames) as long as all paths stay together.

Or simply do the good old depth pre-pass trick for opaque paths (I suspect this will play better with msaa than vertex-aa, though).

Other rendering approaches like the vector texture would probably play better with the idea of minimal overdraw (maybe).

Tiled virtual textures

There is a growing trend of splitting rendering into small fixed-size tiles internally in modern game engines. For example see idTech 6's shadow atlas, texture-space shading of the particles, and more generally the sparse virtual texture, lots of other engines are also using similar tricks. FastUIDraw also uses a tiled atlas for its image cache.

There are some nice advantages of organizing image data this way:

  • Managing the atlas memory becomes very easy and space-efficient, regardless of the sizes of the elements in the atlas.
  • culling fixed size tiles is easier than variable sized elements.
  • If there is a lot of per-tile work to do on the CPU, it is often possible to do it in parallel
  • It becomes easy to only allocate atlas memory for visible tiles of a resource
  • managing invalidation and partial updates in the atlas is simpler. idTech6's shadow don't recompute tiles that have not changed since the previous frame.

One downside is that most implementation have to use an indirection texture adding an extra texture lookup. It's also important to be careful about sampling artifacts at tile boundary (a lot of implementations seem to add 1px padding around each tile).

For 2d rendering, this decoupling between an object's total size and the portion that is actually rendered is particularly interesting. If a big object is several times larger than the viewport, all of the memory and work going into processing content that is off the viewport is wasted. Working at the tile level makes culling easier and removes the object size from the equation.

Font-rs

source code: https://github.com/google/font-rs

Font-rs renders on the CPU but is interesting (and could be ported over to the GPU with some compute shaders trickery). It is abeautiful illustration of the "simpler is faster" rule of thumb. Instead of sorting edges from top to bottom and doing a sweep-line type of algorithm, font-rs renders each segments immediately in a buffer (the accumulation buffer) which contains a signed contribution of the outline. All texels in the accumulation buffer that don't intersect an outline are equal to 0 and texels that touch the outline have a positive or negative value depending on the direction of the outline. The accumulation buffer is then integrated in scan-line order with a prefix sum (for each pixel, the integrated value is the sum of its own value and the previous values). All pixels with a non-zero integrated value are in the shape. Font-rs differs with freetype in that the latter uses a sparse representation for the signed contributions instead of writing them in an image, and freetype has to sort its edges beforehand.

Font-rs is a good fit for parallelism because it doesn't sort its edges (draws them directly in the accumulation buffer) and doesn't have do sweep-line type of traversal like freetype (or lyon's tessellator) which is very sequential in nature, and the integration step is very easy to do in parallel.

The cost of the image representation of the accumulation buffer goes up (both in processing time and memory usage) with the resolution of texture. Fonts being typically very small, this approach works very well. For very large vector graphics, this approach has some interesting challenges (especially when ported to the gpu). I suspect that to get good performance out of this, one would want to batch several paths in the same accumulation buffer atlas to avoid alternating between the accumulation and integration phases.

It would be fun to try to separate the output target into tiles, for each tile render the contribution of all paths in an atlas accumulation texture and integrate all of the paths at once.