Add optional character joiner #1460

princjef · 2018-05-19T22:05:11Z

The xterm.js renderer currently renders all text cell by cell, which prevents font ligatures from being rendered in fonts like Fira Code. This PR addresses part of #958 by modifying the interface of the xterm.js renderer to allow a "character joiner" to be present during the rendering of foreground text.

When present, the joiner is called with each input text sequence that has styles which can be joined (i.e. has the same foreground color and flags). The function returns an array of tuples, where each tuple holds a start and end index of a subrange that should be rendered together. Here's an example for a function which joins -> sequences:

joiner('a -> b -> c') // [[2, 4], [7, 9]]

Any ranges returned by the joiner will be rendered together by the renderer. While this is not terribly useful all by itself, it enables font ligature support when paired with a joiner that understands ligatures for the current font (which will be provided as part of a separate addon).

The character joiner can be added and removed by calling registerCharacterJoiner() and deregisterCharacterJoiner() on the terminal instance, respectively.

Some questions/things to consider:

The current interface requires that registerCharacterJoiner() be called after terminal.open() so that the renderer instance is present. No error is thrown currently if the renderer is not present. Is there an alternative approach that is preferred for this kind of setup?
Only one character joiner is allowed to be registered at a time. Subsequent calls to registerCharacterJoiner() will override the previously registered joiner. While additional joiners could theoretically be supported by combining any joined ranges from the various joiners, it seemed like more complexity than it was worth, especially since I can't see a scenario where someone would need such a setup.
I'm not 100% confident that double-width characters are handled properly, so I would especially appreciate any suggestions/pointers regarding that logic.
Is there a recommended way to test the logic in the renderer? I have manually verified the code with a variety of inputs, but I can't find tests of the existing logic that I could build on for this change.
I'm open to name suggestions to replace "character joiner." I just wanted to make the terminology generic enough so that people don't think of it as being just for font ligatures.
When a range of characters is combined, I am currently passing Infinity as the code to the character drawing function to avoid caching. I also considered passing -1 to avoid any future interpretation as an actual character, but I see several places where comparisons like code < 256 are made, so it seems like the rest of that codebase makes an assumption that the code is always 0 or greater. I'm open to changing it and updating the necessary places if people want me to

jerch · 2018-05-20T12:40:55Z

@princjef Nice, this would also be a good entry point for grapheme support in the future, maybe as an addon. Can you tell if the ligature detection will be able to spot most grapheme clusters too (since most end up being rendered differently) or if we have to build those blocks beforehand?

Some notes on grapheme support:
Basic idea is to check whether two consequtive codepoints can be split without altering the perceived character meaning (algorithm is here: http://unicode.org/reports/tr29/). Those non breaking characters can be stacking, in some ancient letter systems they can be >5 chars long. To decide whether they can be split or should not be split only the previous and the following char is needed.
Nowadays affected by this are some Indian systems and I think Thai and Korean (not sure about the last two). Also not sure, if modern Chinese and Japanese are free of those constructions. Note that umlauts in German and French also can be build from several chars as grapheme clusters, but those are already handled in wcwidth atm.

princjef · 2018-05-20T16:55:46Z

@jerch I'm not very familiar with grapheme clusters but from what i can tell the principles are similar in certain cases. If the graphemes are rendered as ligatures in the font it's likely that they would use very similar logic to programming ligatures and could potentially be covered with the same logic/addon. However, the link you sent also mentions this:

Display of Grapheme Clusters. Grapheme clusters are not the same as ligatures. For example, the grapheme cluster “ch” in Slovak is not normally a ligature and, conversely, the ligature “fi” is not a grapheme cluster. Default grapheme clusters do not necessarily reflect text display. For example, the sequence <f, i> may be displayed as a single glyph on the screen, but would still be two grapheme clusters.

So in that regard, I think that this feature will work for the purposes of rendering grapheme clusters that have special rendering as a unit within the current font. However, the implementation in this PR does not treat such constructs as single units for purposes of selection or deletion as described in the link you provided. In the case of programming ligatures (or font ligatures of other types), the character joining is purely a visual treatment, so such behavior doesn't make sense.

Perhaps there would need to be a companion function to handle the selection/deletion properties of the grapheme clusters, since it sounds like that behavior doesn't necessarily map 1-1 to a rendered ligature

Tyriar · 2018-05-20T17:49:12Z

@jerch I was thinking we would leverage the eventual ligature support for these languages as well. Could we do that based on unicode ranges? For example all adjacent characters from U+0E00-U+0E7F (Thai) are drawn consecutively. If this is the approach we take maybe we should allow multiple character joiners and have a built in one to handle i18n? This would also mirror how link handlers work.

However, the implementation in this PR does not treat such constructs as single units for purposes of selection or deletion

I think selection will probably be fine still, it might look a little off (for only RTL languages?) but we can handle that later. We don't have to worry about deletion as it's handled by the shell.

Here's the closest issue tracking that atm #701, we would probably want to support grapheme clusters for LTR languages first.

Tyriar · 2018-05-20T17:50:55Z

@princjef I'm in the process of adding a way to swap the renderer out #1432, these character joiners only apply to the canvas renderer which makes the API a little confusing. Perhaps calling it out that it only applies to the canvas renderer in the jsdoc is enough?

@bgw any thoughts on how it integrates with TextRenderLayer?

jerch · 2018-05-20T18:19:32Z

I was thinking we would leverage the eventual ligature support for these languages as well. Could we do that based on unicode ranges? For example all adjacent characters from U+0E00-U+0E7F (Thai) are drawn consecutively. If this is the approach we take maybe we should allow multiple character joiners and have a built in one to handle i18n? This would also mirror how link handlers work.

Yes multiple character joiners should do the trick. Grapheme support should also cover all mandatory ligatures (even the emoji color joiners are defined as graphemes), but not the "nice to have" alternatives. Not sure yet if we need the full algo or can get away with some cheaper range assumptions. The algo is kinda expensive since every char needs to be evaluate twice (as next and as previous) with much more workload on every eval step than the old wcwidth. 😞

Edit:
Ah about the character joiner - would be good to get rid of joiners that join stuff that got split beforehand (kinda #791 related).

princjef · 2018-05-21T15:30:38Z

If this is the approach we take maybe we should allow multiple character joiners and have a built in one to handle i18n? This would also mirror how link handlers work.

@Tyriar I think that makes sense. I almost implemented it in the first pass just so it aligned with the link matchers. Having a second use case for the joiner seems like sufficient reason to add the extra logic.

these character joiners only apply to the canvas renderer which makes the API a little confusing. Perhaps calling it out that it only applies to the canvas renderer in the jsdoc is enough?

Because of how different the canvas and DOM are from a technology perspective, I would expect that more features for only one or the other will pop up over time. Maybe we can segment these kinds of capabilities in a way that more strongly indicates which renderer they're used for. Perhaps namespacing it with something like this.canvasRenderer.registerCharacterJoiner() (though that raises the question of what that canvasRenderer object contains and whether it exists if you're not using the canvas renderer). Another option is to put it in the name: this.registerCanvasCharacterJoiner()

Ah about the character joiner - would be good to get rid of joiners that join stuff that got split beforehand

@jerch can you elaborate on what you mean by this?

Tyriar · 2018-05-21T15:39:00Z

Because of how different the canvas and DOM are from a technology perspective, I would expect that more features for only one or the other will pop up over time. Maybe we can segment these kinds of capabilities in a way that more strongly indicates which renderer they're used for.

Well the DOM renderer is intentionally barebones. I'm thinking we just add a disclaimer in the jsdoc of rendererType (added in my PR) that ligatures are not supported in the DOM renderer.

jerch · 2018-05-21T15:49:48Z

@jerch can you elaborate on what you mean by this?

This is just a little rant into my own direction for this reason: at an early stage the terminal splits the chars into the cell model - just to join them later back together. Can be optimized if done right at the early stage. Its partly my fault that the early stage does that.

princjef · 2018-06-02T17:33:46Z

Updated with support for mulitple character joiners

Tyriar · 2018-06-09T10:21:33Z

@princjef sorry for the delay in looking at this as I've been a bit busy. I just wanted a status update on this, would you say this PR is good to go from your perspective and just needs to react to feedback? Trying to budget time for it 😄

princjef · 2018-06-12T03:48:12Z

@Tyriar no worries I know how it goes 😄

I'm happy with it in its current state. The main place that I'd love some extra attention for feedback is the treatment of double width/zero width/etc. characters when determining the width of the replacement. I'm not familiar with all of the ins and outs there so I'm sure there are some bugs.

Let me know if it would be helpful to see the usage in the xterm-ligature-support plugin. I've written it all up and tested it (sans some polish) but wanted to hold off throwing it in the repo until I had a valid version of xterm.js to point at.

Tyriar · 2018-06-12T08:39:29Z

@princjef great, I'll try get some time to look at this over the next couple of weeks (depending on my other priorities).

Tyriar · 2018-06-20T08:16:34Z

src/renderer/Renderer.ts

+    };
+
+    this._renderLayers.forEach(l => {
+      if (l.registerCharacterJoiner) {


Moving registerCharacterJoiner into BaseRenderLayer with a default no-op implementation means you don't need this if plus you'll get strong typing.

as a result of moving things into a CharacterJoinerRegistry, there is no longer a need to register the joiners directly with the layers. The registry is just passed to the constructor of the TextRenderLayer now

Tyriar · 2018-06-20T08:17:58Z

src/renderer/Renderer.ts

+  public registerCharacterJoiner(handler: CharacterJoinerHandler): number {
+    const joiner: ICharacterJoiner = {
+      id: this._nextJoinerId++,
+      handler: handler


This can simply be handler

Tyriar · 2018-06-20T08:24:33Z

src/renderer/TextRenderLayer.ts

@@ -52,6 +54,7 @@ export class TextRenderLayer extends BaseRenderLayer {
    terminal: ITerminal,
    firstRow: number,
    lastRow: number,
+    foreground: boolean,


How about if joiners get passed in instead of marking as foreground? That way it can be passed in when foreground is drawn and null for background?

i like that. feels cleaner

Tyriar · 2018-06-20T08:30:59Z

src/renderer/TextRenderLayer.ts

@@ -190,7 +242,7 @@ export class TextRenderLayer extends BaseRenderLayer {
        } else {
          this._ctx.fillStyle = this._colors.foreground.css;
        }
-        this.fillBottomLineAtCells(x, y);
+        this.fillBottomLineAtCells(x, y, width);
        this._ctx.restore();
      }
      this.drawChar(


I think code can be infinite here if characters are being joined? I'm not totally sure how that will affect drawing/caching

Yeah the trouble with the joined characters is that there isn't one code that defines the whole sequence. I did a pass through that logic and the code didn't seem to have any bearing on uncached drawing. Since the code seems to be primarily used for determining caching, I made it infinity to avoid clashing with any existing characters and to steer clear of the range that is actually cached at the moment. If the dynamic character atlas eventually expands to try to cache all character codes, we could definitely end up with a problem down the line.

I think the most reasonable alternative is to pass -1 for the code, as no valid character will ever have that code. We can then use that for cache control and essentially ignore any negative codes. I didn't use -1 initially because I saw at least one or two checks that didn't handle negative character codes and wanted to keep my changes as localized as possible at least initially.

Caching of ligatures is definitely possible but would require something more expressive than a single number to identify the characters drawn. I think we can punt on that part of it for the time being.

Tyriar · 2018-06-20T08:38:28Z

src/renderer/TextRenderLayer.ts

+      currentIndex++;
+    }
+
+    // Process any trailing ranges


Can you give a high level explanation of how this ranges/sub-ranges thing works? It's not totally clear yet.

I guess I'm confused about the use of range vs subrange

They mean the same thing relative to the text for the line being processed, though the naming I chose is definitely confusing as I read it back.

A range/subrange is a consecutive sequence of characters in the input text represented as a start and end index. In the context of this method (this._getJoinedCharacters), the range that I'm talking about with "trailing ranges" and so on is the start and end index of a sequence of characters within the input text that have the same foreground color and attributes (background is specifically excluded because it doesn't affect joining characters).

The ranges returned by this._getSubRanges() are zero or more start/end index pairs contained within the range mentioned above that represent the locations of the ligatures that were found.

For example, lets say I'm processing a line "a -> b -> c -> d", where "->" forms a ligature in the font and "a -> b -> " is a different color than "c -> d".

The initial ranges identified as same-styled (via rangeStartIndex and currentIndex) would be 0-10 and 10-16 (corresponding with "a -> b -> " and "c -> d")

We call this._getSubRanges("a -> b -> c -> d", 0, 10), and find ligatures at subranges 2-4 and 7-9, returned as [[2, 4], [7, 9]]

We then call this._getSubRanges("a -> b -> c -> d", 10, 16), and find one ligature at the subrange 12-14, returned as [[12, 14]]

The 'subranges' returned are all combined into the final array of ranges of characters that we should join [[2, 4], [7, 9], [12, 14]], which is returned by this._getJoinedCharacters()

The principal reason for breaking the line into chunks when passing to the joiner is to cache more effectively, as the subsequences are more likely to be present for multiple renders than a full line. It also has the side benefit that at the end of this._getJoinedCharacters() we know any ligatures we find are valid because the ranges passed to the joiner have the same style.

Hopefully that clears things up (albeit in a long-winded way)

The subranges concept was confusing me as I refactored so I renamed it to joinedRanges, which better describes the concept

princjef · 2018-06-27T04:56:49Z

@Tyriar let me know what you think about the character code stuff and if you think any changes should be made to the range/subrange parts. once everything has been resolved I'll rebase and push a new version

Tyriar · 2018-07-03T18:52:20Z

I rebased it myself as some pretty significant changes happened (there is src/Terminal.ts now routes everything through src/public/Terminal.ts being the major one (because of #1507). I also added empty implementations to DomRenderer which needs to implement IRenderer

Tyriar

A few more comments, it looks like this should be relatively safe to merge in and iterate on once these comments are resolved since joinedRanges.length is checked for the majority of new code paths.

Tyriar · 2018-07-03T18:59:50Z

typings/xterm.d.ts

+     * (exclusive) indexes of ranges that should be rendered as a single unit.
+     * @return The ID of the new joiner, this can be used to deregister
+     */
+    registerCharacterJoiner(handler: (text: string) => [number, number][]): number;


I think the API looks good overall and just have some suggestions on the documenation.

It should mention that performance is extremely important as this will be run every single time a line is rendered as I understand it.

It should mention how multiple registered character joiners interact, graphemes will eventually be built in using this same API in a similar way to how web links are supported out of the box so this is an important detail.

Tyriar · 2018-07-03T20:35:45Z

src/renderer/TextRenderLayer.ts

@@ -79,6 +84,41 @@ export class TextRenderLayer extends BaseRenderLayer {
          continue;
        }

+        // Just in case we ended up in the middle of a range, lop off any


Would this happen only when the character joiner returned invalid results? Could we move this validation into _getJoinedCharacters if it's needed?

The only way to end up in here is if a joined range somehow starts in the middle of a sequence of characters that is handled as a group (basically if it starts with a zero-width character). This seems exceedingly unlikely to happen but I included it just in case. If we're not worried about that case I can just remove it from here.

The alternative would be to pass the understanding of widths and overlaps into the underlying character joiner logic, but I'm pretty sure I'd have to do another pass over the full sequence of characters to compute it there in what is already a pretty hot code path.

Decided to translate the ranges to cell ranges, which guarantees validity and remove a bunch of logic here so this is no longer needed. removed

Tyriar · 2018-07-03T20:41:32Z

src/renderer/TextRenderLayer.ts

        const code: number = <number>charData[CHAR_DATA_CODE_INDEX];
-        const char: string = charData[CHAR_DATA_CHAR_INDEX];
+        let char: string = charData[CHAR_DATA_CHAR_INDEX];


Let's rename this to chars and make a note that it can include a single character for a single cell, or multiple cells worth of characters when a character joiner says to

Tyriar · 2018-07-03T20:42:31Z

src/renderer/TextRenderLayer.ts

+        callback(
+          char.length === 1 ? code : Infinity,
+          char,
+          char.length + width - 1,


I think char.length will break for emojis as they often have more than 1 character for a single emoji.

It could also lead to weird behavior if there are multiple double width characters throughout the joined range

I now have the joiner registry convert the string ranges back to cell ranges so that things like double width characters and emojis are accounted for. I've added tests for each and it seems to be working as expected.

I'm admittedly still a bit confused by the way emojis work. It appears that you end up with a >1 length string for the emoji itself (with width 1) that is always followed by a space with width 1, since the emoji character is treated as single width. Is this accurate?

I tested emojis/fullwidth characters with my actual xterm-ligature-support code and things seem to be fine for fullwidth characters in all cases I tested. There are no ligatures in any fonts I'm aware of that contain such characters so I just tested ligatures before/after/in between. Emojis always seem to render fine, but if a ligature comes after the emoji, it is not rendered properly. This appears to be because the parsing logic of opentype.js treats the emoji + space as a single character rather than using javascript string lengths. It's a fringe case with low impact, so I'll look into it later.

I'm admittedly still a bit confused by the way emojis work. It appears that you end up with a >1 length string for the emoji itself (with width 1) that is always followed by a space with width 1, since the emoji character is treated as single width. Is this accurate?

Yes this is correct, emojis can be multiple "characters" (char.length > 1). We also render them as single width (width===1), they will probably be double width at some point (width===2) but there's another issue for that, so whenever emoji stuff comes up I want to make sure we're not making a mistake assuming that width is always 1.

Tyriar · 2018-07-03T20:45:23Z

src/renderer/TextRenderLayer.ts

@@ -260,6 +325,78 @@ export class TextRenderLayer extends BaseRenderLayer {
    return overlaps;
  }

+  private _getJoinedCharacters(terminal: ITerminal, row: number): [number, number][] {


I think we should keep some CharacterJoinerRegistry or something which encapsulates as much complexity around character joining as possible. I'm concerned about all this extra code being added to TextRenderLayer as it's already quite long/complex. We could also move MergeRanges.ts into this file.

Great suggestion. I've made the change which allows this to get cleaned up a lot and makes the joining code much more testable. It's also led to a couple of other structural changes as a result of the fact that the joiners are tracked by the registry rather than the individual render layers.

Tyriar · 2018-07-03T20:49:36Z

src/renderer/TextRenderLayer.ts

@@ -116,7 +156,19 @@ export class TextRenderLayer extends BaseRenderLayer {
          }
        }

-        callback(code, char, width, x, y, fg, bg, flags);
+        callback(
+          char.length === 1 ? code : Infinity,


This will change the code that emojis pass in when no character joiners are registered.

I now set the code to infinity directly in the logic where i detect a range so this problem goes away.

While writing tests I discovered that zero-width cells are given a code of null, so that is another potential option for joined ranges

princjef · 2018-07-08T18:46:32Z

Made all of the recommended changes and also cleaned up the logic to convert the string ranges returned by the joiners to cell ranges so that there is proper support for fullwidth characters and such. This should resolve most of the residual edge cases around character widths and string lengths.

Tyriar

Looks great, I say let's merge after 3.5 goes out.

Tyriar · 2018-07-11T00:01:56Z

🎉 Thanks @princjef! Let me know how progress on the addon goes and if you have feedback on the API.

princjef · 2018-07-11T02:07:02Z

Sounds good. The addon is all written and tested with the latest code from this PR. I'll clean the repo files up and put together a PR over there so you can take a look.

princjef mentioned this pull request May 19, 2018

Support font ligatures #958

Open

Tyriar assigned bgw and Tyriar May 20, 2018

Tyriar mentioned this pull request May 20, 2018

Buffer performance improvements #791

Closed

add ability to register/deregister character joiner

eabb25c

princjef force-pushed the character-joiner branch from 27d099d to eabb25c Compare June 2, 2018 17:33

kieferrm mentioned this pull request Jun 11, 2018

Iteration Plan for June 2018 microsoft/vscode#51483

Closed

52 tasks

Tyriar reviewed Jun 20, 2018

View reviewed changes

Tyriar mentioned this pull request Jun 27, 2018

Why does clicking some files open on a comment and some not? microsoft/vscode-pull-request-github#49

Closed

Tyriar closed this Jun 27, 2018

Tyriar reopened this Jun 27, 2018

This was referenced Jun 27, 2018

Code snippets are not colored when they are marked up with language microsoft/vscode-pull-request-github#54

Closed

Only the first code snippet in a review is shown microsoft/vscode-pull-request-github#55

Closed

Tyriar and others added 2 commits July 3, 2018 11:50

Merge remote-tracking branch 'origin/master' into pr/princjef/1460

6d9d2a3

Merge branch 'master' into character-joiner

5e3ebbd

Tyriar requested changes Jul 3, 2018

View reviewed changes

Factor logic into registry, make matches cell-based and document

63b3a6f

princjef force-pushed the character-joiner branch from 0adf426 to 63b3a6f Compare July 8, 2018 18:31

Merge branch 'master' into character-joiner

1bad71f

Tyriar added this to the 3.6.0 milestone Jul 9, 2018

Tyriar approved these changes Jul 9, 2018

View reviewed changes

Inline private member to ctor

d4b507d

Tyriar merged commit a52aab4 into xtermjs:master Jul 11, 2018

princjef deleted the character-joiner branch July 11, 2018 02:07

princjef mentioned this pull request Jul 11, 2018

Initial implementation xtermjs/xterm-addon-ligatures#1

Merged

Add optional character joiner #1460

Add optional character joiner #1460

Conversation

princjef commented May 19, 2018 • edited Loading

jerch commented May 20, 2018

princjef commented May 20, 2018

Tyriar commented May 20, 2018

Tyriar commented May 20, 2018 • edited Loading

jerch commented May 20, 2018 • edited Loading

princjef commented May 21, 2018

Tyriar commented May 21, 2018

jerch commented May 21, 2018

princjef commented Jun 2, 2018

Tyriar commented Jun 9, 2018

princjef commented Jun 12, 2018 • edited Loading

Tyriar commented Jun 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

princjef commented Jun 27, 2018

Tyriar commented Jul 3, 2018

Tyriar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

princjef Jul 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

princjef Jul 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

princjef commented Jul 8, 2018

Tyriar left a comment

Choose a reason for hiding this comment

Tyriar commented Jul 11, 2018

princjef commented Jul 11, 2018

princjef commented May 19, 2018 •

edited

Loading

Tyriar commented May 20, 2018 •

edited

Loading

jerch commented May 20, 2018 •

edited

Loading

princjef commented Jun 12, 2018 •

edited

Loading

princjef Jul 8, 2018 •

edited

Loading

princjef Jul 8, 2018 •

edited

Loading