Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional character joiner #1460

Merged
merged 6 commits into from
Jul 11, 2018
Merged

Conversation

princjef
Copy link
Contributor

@princjef princjef commented May 19, 2018

The xterm.js renderer currently renders all text cell by cell, which prevents font ligatures from being rendered in fonts like Fira Code. This PR addresses part of #958 by modifying the interface of the xterm.js renderer to allow a "character joiner" to be present during the rendering of foreground text.

When present, the joiner is called with each input text sequence that has styles which can be joined (i.e. has the same foreground color and flags). The function returns an array of tuples, where each tuple holds a start and end index of a subrange that should be rendered together. Here's an example for a function which joins -> sequences:

joiner('a -> b -> c') // [[2, 4], [7, 9]]

Any ranges returned by the joiner will be rendered together by the renderer. While this is not terribly useful all by itself, it enables font ligature support when paired with a joiner that understands ligatures for the current font (which will be provided as part of a separate addon).

The character joiner can be added and removed by calling registerCharacterJoiner() and deregisterCharacterJoiner() on the terminal instance, respectively.

Some questions/things to consider:

  • The current interface requires that registerCharacterJoiner() be called after terminal.open() so that the renderer instance is present. No error is thrown currently if the renderer is not present. Is there an alternative approach that is preferred for this kind of setup?
  • Only one character joiner is allowed to be registered at a time. Subsequent calls to registerCharacterJoiner() will override the previously registered joiner. While additional joiners could theoretically be supported by combining any joined ranges from the various joiners, it seemed like more complexity than it was worth, especially since I can't see a scenario where someone would need such a setup.
  • I'm not 100% confident that double-width characters are handled properly, so I would especially appreciate any suggestions/pointers regarding that logic.
  • Is there a recommended way to test the logic in the renderer? I have manually verified the code with a variety of inputs, but I can't find tests of the existing logic that I could build on for this change.
  • I'm open to name suggestions to replace "character joiner." I just wanted to make the terminology generic enough so that people don't think of it as being just for font ligatures.
  • When a range of characters is combined, I am currently passing Infinity as the code to the character drawing function to avoid caching. I also considered passing -1 to avoid any future interpretation as an actual character, but I see several places where comparisons like code < 256 are made, so it seems like the rest of that codebase makes an assumption that the code is always 0 or greater. I'm open to changing it and updating the necessary places if people want me to

@jerch
Copy link
Member

jerch commented May 20, 2018

@princjef Nice, this would also be a good entry point for grapheme support in the future, maybe as an addon. Can you tell if the ligature detection will be able to spot most grapheme clusters too (since most end up being rendered differently) or if we have to build those blocks beforehand?

Some notes on grapheme support:
Basic idea is to check whether two consequtive codepoints can be split without altering the perceived character meaning (algorithm is here: http://unicode.org/reports/tr29/). Those non breaking characters can be stacking, in some ancient letter systems they can be >5 chars long. To decide whether they can be split or should not be split only the previous and the following char is needed.
Nowadays affected by this are some Indian systems and I think Thai and Korean (not sure about the last two). Also not sure, if modern Chinese and Japanese are free of those constructions. Note that umlauts in German and French also can be build from several chars as grapheme clusters, but those are already handled in wcwidth atm.

@princjef
Copy link
Contributor Author

@jerch I'm not very familiar with grapheme clusters but from what i can tell the principles are similar in certain cases. If the graphemes are rendered as ligatures in the font it's likely that they would use very similar logic to programming ligatures and could potentially be covered with the same logic/addon. However, the link you sent also mentions this:

Display of Grapheme Clusters. Grapheme clusters are not the same as ligatures. For example, the grapheme cluster “ch” in Slovak is not normally a ligature and, conversely, the ligature “fi” is not a grapheme cluster. Default grapheme clusters do not necessarily reflect text display. For example, the sequence <f, i> may be displayed as a single glyph on the screen, but would still be two grapheme clusters.

So in that regard, I think that this feature will work for the purposes of rendering grapheme clusters that have special rendering as a unit within the current font. However, the implementation in this PR does not treat such constructs as single units for purposes of selection or deletion as described in the link you provided. In the case of programming ligatures (or font ligatures of other types), the character joining is purely a visual treatment, so such behavior doesn't make sense.

Perhaps there would need to be a companion function to handle the selection/deletion properties of the grapheme clusters, since it sounds like that behavior doesn't necessarily map 1-1 to a rendered ligature

@Tyriar
Copy link
Member

Tyriar commented May 20, 2018

@jerch I was thinking we would leverage the eventual ligature support for these languages as well. Could we do that based on unicode ranges? For example all adjacent characters from U+0E00-U+0E7F (Thai) are drawn consecutively. If this is the approach we take maybe we should allow multiple character joiners and have a built in one to handle i18n? This would also mirror how link handlers work.

However, the implementation in this PR does not treat such constructs as single units for purposes of selection or deletion

I think selection will probably be fine still, it might look a little off (for only RTL languages?) but we can handle that later. We don't have to worry about deletion as it's handled by the shell.


Here's the closest issue tracking that atm #701, we would probably want to support grapheme clusters for LTR languages first.

@Tyriar Tyriar assigned bgw and Tyriar May 20, 2018
@Tyriar
Copy link
Member

Tyriar commented May 20, 2018

@princjef I'm in the process of adding a way to swap the renderer out #1432, these character joiners only apply to the canvas renderer which makes the API a little confusing. Perhaps calling it out that it only applies to the canvas renderer in the jsdoc is enough?

@bgw any thoughts on how it integrates with TextRenderLayer?

@jerch
Copy link
Member

jerch commented May 20, 2018

I was thinking we would leverage the eventual ligature support for these languages as well. Could we do that based on unicode ranges? For example all adjacent characters from U+0E00-U+0E7F (Thai) are drawn consecutively. If this is the approach we take maybe we should allow multiple character joiners and have a built in one to handle i18n? This would also mirror how link handlers work.

Yes multiple character joiners should do the trick. Grapheme support should also cover all mandatory ligatures (even the emoji color joiners are defined as graphemes), but not the "nice to have" alternatives. Not sure yet if we need the full algo or can get away with some cheaper range assumptions. The algo is kinda expensive since every char needs to be evaluate twice (as next and as previous) with much more workload on every eval step than the old wcwidth. 😞

Edit:
Ah about the character joiner - would be good to get rid of joiners that join stuff that got split beforehand (kinda #791 related).

@princjef
Copy link
Contributor Author

If this is the approach we take maybe we should allow multiple character joiners and have a built in one to handle i18n? This would also mirror how link handlers work.

@Tyriar I think that makes sense. I almost implemented it in the first pass just so it aligned with the link matchers. Having a second use case for the joiner seems like sufficient reason to add the extra logic.

these character joiners only apply to the canvas renderer which makes the API a little confusing. Perhaps calling it out that it only applies to the canvas renderer in the jsdoc is enough?

Because of how different the canvas and DOM are from a technology perspective, I would expect that more features for only one or the other will pop up over time. Maybe we can segment these kinds of capabilities in a way that more strongly indicates which renderer they're used for. Perhaps namespacing it with something like this.canvasRenderer.registerCharacterJoiner() (though that raises the question of what that canvasRenderer object contains and whether it exists if you're not using the canvas renderer). Another option is to put it in the name: this.registerCanvasCharacterJoiner()

Ah about the character joiner - would be good to get rid of joiners that join stuff that got split beforehand

@jerch can you elaborate on what you mean by this?

@Tyriar
Copy link
Member

Tyriar commented May 21, 2018

Because of how different the canvas and DOM are from a technology perspective, I would expect that more features for only one or the other will pop up over time. Maybe we can segment these kinds of capabilities in a way that more strongly indicates which renderer they're used for.

Well the DOM renderer is intentionally barebones. I'm thinking we just add a disclaimer in the jsdoc of rendererType (added in my PR) that ligatures are not supported in the DOM renderer.

@jerch
Copy link
Member

jerch commented May 21, 2018

@jerch can you elaborate on what you mean by this?

This is just a little rant into my own direction for this reason: at an early stage the terminal splits the chars into the cell model - just to join them later back together. Can be optimized if done right at the early stage. Its partly my fault that the early stage does that.

@princjef princjef force-pushed the character-joiner branch from 27d099d to eabb25c Compare June 2, 2018 17:33
@princjef
Copy link
Contributor Author

princjef commented Jun 2, 2018

Updated with support for mulitple character joiners

@Tyriar
Copy link
Member

Tyriar commented Jun 9, 2018

@princjef sorry for the delay in looking at this as I've been a bit busy. I just wanted a status update on this, would you say this PR is good to go from your perspective and just needs to react to feedback? Trying to budget time for it 😄

@princjef
Copy link
Contributor Author

princjef commented Jun 12, 2018

@Tyriar no worries I know how it goes 😄

I'm happy with it in its current state. The main place that I'd love some extra attention for feedback is the treatment of double width/zero width/etc. characters when determining the width of the replacement. I'm not familiar with all of the ins and outs there so I'm sure there are some bugs.

Let me know if it would be helpful to see the usage in the xterm-ligature-support plugin. I've written it all up and tested it (sans some polish) but wanted to hold off throwing it in the repo until I had a valid version of xterm.js to point at.

@Tyriar
Copy link
Member

Tyriar commented Jun 12, 2018

@princjef great, I'll try get some time to look at this over the next couple of weeks (depending on my other priorities).

};

this._renderLayers.forEach(l => {
if (l.registerCharacterJoiner) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving registerCharacterJoiner into BaseRenderLayer with a default no-op implementation means you don't need this if plus you'll get strong typing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a result of moving things into a CharacterJoinerRegistry, there is no longer a need to register the joiners directly with the layers. The registry is just passed to the constructor of the TextRenderLayer now

public registerCharacterJoiner(handler: CharacterJoinerHandler): number {
const joiner: ICharacterJoiner = {
id: this._nextJoinerId++,
handler: handler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can simply be handler

@@ -52,6 +54,7 @@ export class TextRenderLayer extends BaseRenderLayer {
terminal: ITerminal,
firstRow: number,
lastRow: number,
foreground: boolean,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if joiners get passed in instead of marking as foreground? That way it can be passed in when foreground is drawn and null for background?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like that. feels cleaner

@@ -190,7 +242,7 @@ export class TextRenderLayer extends BaseRenderLayer {
} else {
this._ctx.fillStyle = this._colors.foreground.css;
}
this.fillBottomLineAtCells(x, y);
this.fillBottomLineAtCells(x, y, width);
this._ctx.restore();
}
this.drawChar(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code can be infinite here if characters are being joined? I'm not totally sure how that will affect drawing/caching

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the trouble with the joined characters is that there isn't one code that defines the whole sequence. I did a pass through that logic and the code didn't seem to have any bearing on uncached drawing. Since the code seems to be primarily used for determining caching, I made it infinity to avoid clashing with any existing characters and to steer clear of the range that is actually cached at the moment. If the dynamic character atlas eventually expands to try to cache all character codes, we could definitely end up with a problem down the line.

I think the most reasonable alternative is to pass -1 for the code, as no valid character will ever have that code. We can then use that for cache control and essentially ignore any negative codes. I didn't use -1 initially because I saw at least one or two checks that didn't handle negative character codes and wanted to keep my changes as localized as possible at least initially.

Caching of ligatures is definitely possible but would require something more expressive than a single number to identify the characters drawn. I think we can punt on that part of it for the time being.

currentIndex++;
}

// Process any trailing ranges
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a high level explanation of how this ranges/sub-ranges thing works? It's not totally clear yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm confused about the use of range vs subrange

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They mean the same thing relative to the text for the line being processed, though the naming I chose is definitely confusing as I read it back.

A range/subrange is a consecutive sequence of characters in the input text represented as a start and end index. In the context of this method (this._getJoinedCharacters), the range that I'm talking about with "trailing ranges" and so on is the start and end index of a sequence of characters within the input text that have the same foreground color and attributes (background is specifically excluded because it doesn't affect joining characters).

The ranges returned by this._getSubRanges() are zero or more start/end index pairs contained within the range mentioned above that represent the locations of the ligatures that were found.

For example, lets say I'm processing a line "a -> b -> c -> d", where "->" forms a ligature in the font and "a -> b -> " is a different color than "c -> d".

  • The initial ranges identified as same-styled (via rangeStartIndex and currentIndex) would be 0-10 and 10-16 (corresponding with "a -> b -> " and "c -> d")
  • We call this._getSubRanges("a -> b -> c -> d", 0, 10), and find ligatures at subranges 2-4 and 7-9, returned as [[2, 4], [7, 9]]
  • We then call this._getSubRanges("a -> b -> c -> d", 10, 16), and find one ligature at the subrange 12-14, returned as [[12, 14]]
  • The 'subranges' returned are all combined into the final array of ranges of characters that we should join [[2, 4], [7, 9], [12, 14]], which is returned by this._getJoinedCharacters()

The principal reason for breaking the line into chunks when passing to the joiner is to cache more effectively, as the subsequences are more likely to be present for multiple renders than a full line. It also has the side benefit that at the end of this._getJoinedCharacters() we know any ligatures we find are valid because the ranges passed to the joiner have the same style.

Hopefully that clears things up (albeit in a long-winded way)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subranges concept was confusing me as I refactored so I renamed it to joinedRanges, which better describes the concept

@princjef
Copy link
Contributor Author

@Tyriar let me know what you think about the character code stuff and if you think any changes should be made to the range/subrange parts. once everything has been resolved I'll rebase and push a new version

@Tyriar
Copy link
Member

Tyriar commented Jul 3, 2018

I rebased it myself as some pretty significant changes happened (there is src/Terminal.ts now routes everything through src/public/Terminal.ts being the major one (because of #1507). I also added empty implementations to DomRenderer which needs to implement IRenderer

Copy link
Member

@Tyriar Tyriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, it looks like this should be relatively safe to merge in and iterate on once these comments are resolved since joinedRanges.length is checked for the majority of new code paths.

* (exclusive) indexes of ranges that should be rendered as a single unit.
* @return The ID of the new joiner, this can be used to deregister
*/
registerCharacterJoiner(handler: (text: string) => [number, number][]): number;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the API looks good overall and just have some suggestions on the documenation.

  • It should mention that performance is extremely important as this will be run every single time a line is rendered as I understand it.
  • It should mention how multiple registered character joiners interact, graphemes will eventually be built in using this same API in a similar way to how web links are supported out of the box so this is an important detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -79,6 +84,41 @@ export class TextRenderLayer extends BaseRenderLayer {
continue;
}

// Just in case we ended up in the middle of a range, lop off any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this happen only when the character joiner returned invalid results? Could we move this validation into _getJoinedCharacters if it's needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to end up in here is if a joined range somehow starts in the middle of a sequence of characters that is handled as a group (basically if it starts with a zero-width character). This seems exceedingly unlikely to happen but I included it just in case. If we're not worried about that case I can just remove it from here.

The alternative would be to pass the understanding of widths and overlaps into the underlying character joiner logic, but I'm pretty sure I'd have to do another pass over the full sequence of characters to compute it there in what is already a pretty hot code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to translate the ranges to cell ranges, which guarantees validity and remove a bunch of logic here so this is no longer needed. removed

const code: number = <number>charData[CHAR_DATA_CODE_INDEX];
const char: string = charData[CHAR_DATA_CHAR_INDEX];
let char: string = charData[CHAR_DATA_CHAR_INDEX];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to chars and make a note that it can include a single character for a single cell, or multiple cells worth of characters when a character joiner says to

callback(
char.length === 1 ? code : Infinity,
char,
char.length + width - 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think char.length will break for emojis as they often have more than 1 character for a single emoji.

It could also lead to weird behavior if there are multiple double width characters throughout the joined range

Copy link
Contributor Author

@princjef princjef Jul 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now have the joiner registry convert the string ranges back to cell ranges so that things like double width characters and emojis are accounted for. I've added tests for each and it seems to be working as expected.

I'm admittedly still a bit confused by the way emojis work. It appears that you end up with a >1 length string for the emoji itself (with width 1) that is always followed by a space with width 1, since the emoji character is treated as single width. Is this accurate?

I tested emojis/fullwidth characters with my actual xterm-ligature-support code and things seem to be fine for fullwidth characters in all cases I tested. There are no ligatures in any fonts I'm aware of that contain such characters so I just tested ligatures before/after/in between. Emojis always seem to render fine, but if a ligature comes after the emoji, it is not rendered properly. This appears to be because the parsing logic of opentype.js treats the emoji + space as a single character rather than using javascript string lengths. It's a fringe case with low impact, so I'll look into it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm admittedly still a bit confused by the way emojis work. It appears that you end up with a >1 length string for the emoji itself (with width 1) that is always followed by a space with width 1, since the emoji character is treated as single width. Is this accurate?

Yes this is correct, emojis can be multiple "characters" (char.length > 1). We also render them as single width (width===1), they will probably be double width at some point (width===2) but there's another issue for that, so whenever emoji stuff comes up I want to make sure we're not making a mistake assuming that width is always 1.

@@ -260,6 +325,78 @@ export class TextRenderLayer extends BaseRenderLayer {
return overlaps;
}

private _getJoinedCharacters(terminal: ITerminal, row: number): [number, number][] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep some CharacterJoinerRegistry or something which encapsulates as much complexity around character joining as possible. I'm concerned about all this extra code being added to TextRenderLayer as it's already quite long/complex. We could also move MergeRanges.ts into this file.

Copy link
Contributor Author

@princjef princjef Jul 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. I've made the change which allows this to get cleaned up a lot and makes the joining code much more testable. It's also led to a couple of other structural changes as a result of the fact that the joiners are tracked by the registry rather than the individual render layers.

@@ -116,7 +156,19 @@ export class TextRenderLayer extends BaseRenderLayer {
}
}

callback(code, char, width, x, y, fg, bg, flags);
callback(
char.length === 1 ? code : Infinity,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change the code that emojis pass in when no character joiners are registered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now set the code to infinity directly in the logic where i detect a range so this problem goes away.

While writing tests I discovered that zero-width cells are given a code of null, so that is another potential option for joined ranges

@princjef princjef force-pushed the character-joiner branch from 0adf426 to 63b3a6f Compare July 8, 2018 18:31
@princjef
Copy link
Contributor Author

princjef commented Jul 8, 2018

Made all of the recommended changes and also cleaned up the logic to convert the string ranges returned by the joiners to cell ranges so that there is proper support for fullwidth characters and such. This should resolve most of the residual edge cases around character widths and string lengths.

@Tyriar Tyriar added this to the 3.6.0 milestone Jul 9, 2018
Copy link
Member

@Tyriar Tyriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I say let's merge after 3.5 goes out.

@Tyriar Tyriar merged commit a52aab4 into xtermjs:master Jul 11, 2018
@Tyriar
Copy link
Member

Tyriar commented Jul 11, 2018

🎉 Thanks @princjef! Let me know how progress on the addon goes and if you have feedback on the API.

@princjef
Copy link
Contributor Author

Sounds good. The addon is all written and tested with the latest code from this PR. I'll clean the repo files up and put together a PR over there so you can take a look.

@princjef princjef deleted the character-joiner branch July 11, 2018 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants