VT Unicode Core Specification #1178

christianparpart · 2021-09-01T20:15:19Z

christianparpart
Sep 1, 2021
Maintainer

EDIT: Title changed to reflect the discussion we ended up with. Also see https://github.com/contour-terminal/terminal-unicode-core for the outcome.

Currently all (or almost all?) terminals which do handle VS15/VS16 do not change the width to narrow/wide but leave it to their base value or the preceding Unicode codepoint.

I think that is wrong. I.e. if the visual is changing, then the cursor should be moved accordingly.

But to keep that backwards compatible to all other applications / TEs, I think a soft migration should take place and I think that could be best done via DECRQM / DECSM / DECRM and a free DEC mode number to query/set/unset proper unicode width handling (including VS15/VS16).

I wonder what @jerch (or others) think about this idea? :-)

jerch · 2021-09-01T20:49:31Z

jerch
Sep 1, 2021

My idea on those things? Broken for our terminal world, because a "maybe render it like this" never gonna work for appside (or a multiplexer). It breaks with the promised runwidth idea of wcwidth, thus appside cannot reliable "paint" its screen anymore.

How to fix? Only way I see here is to extend the wcwidth promise by further speccing out, how these shall behave. Since appside will not know on its own, whether the terminal can render the shiny big glyph or just the ugly text representation, this render caps dependency needs to be addressed either by:

extend terminal interface, so a terminal can announce / appside can ask, how a certain (compound) character will be rendered, or
make fixated promises like switching the output representation never will chance the runwidth (thats what you think is wrong)

For 1. I see the problem, that it would involve alot of forth and back communication between terminal and app, which is always cumbersome. Thus I tend more towards 2., but thats more restrictive and will lead to poor output in edge cases.

Btw all new shiny compound features of unicode have that problem, they simply left the width issue to the renderer to decide. It totally screws up the separated render idea of cmdline apps, where the terminal works as the screen.

Closing the rant above: In long term imho the only solution is to give up the wcwidth promise, and let things flow more freely. But thats hard to swallow at least for canvas like curses apps, because it questions the grid mechanics per se.

0 replies

j4james · 2021-09-02T00:15:38Z

j4james
Sep 2, 2021

Currently all (or almost all?) terminals which do handle VS15/VS16 do not change the width to narrow/wide but leave it to their base value or the preceding Unicode codepoint.

I think that is wrong. I.e. if the visual is changing, then the cursor should be moved accordingly.

Lets say the cursor is positioned in the bottom right corner of the screen, and the next character you receive is wide by default. It doesn't fit in that cell, so you're forced to wrap to the next line and scroll the screen. Then you receive another character, which is a variant selector, indicating that the previous character was actually meant to be narrow. How do you recover from that situation?

Technically the character is meant to be narrow, and thus should have fit on the previous line, and the screen should never have scrolled. But there's no way you now can go back and undo all of that. So this leaves you with a situation where the character looks narrow, but the number of cells it occupies may be 1 or 2, depending on where/when it is output.

I can't seem to access gitlab now, but I'm sure this has all been discussed before in terminal-wg, and I don't think anyone proposed a workable solution to the problem. It wasn't a matter of getting everyone to agree - it's just that there wasn't any solution that could be agreed to (at least that was my recollection).

0 replies

jerch · 2021-09-02T12:40:18Z

jerch
Sep 2, 2021

@christianparpart Further note that those runwidth decisions cannot be made from an unicode version flag alone, as unicode explicitly left that up to the output system. E.g. take a compound family emoji like 👨‍👩‍👧‍👦 - both the multi glyph and the compound glyph representation are valid from unicode perspective, but for a terminal it means it can either render it as 👪 spanning 2 cells, or as 👨👩👧👦 spanning 8 cells.

Since we have no proposed solution for this yet, lets make one. Rough idea:

extend runwidth expectations with a default behavior
Ideally here we find a least denominator, which will work across most terminals / output systems. For the example above I suggest to go with 8 cells as default. Note that this is really hard work to layout, as it basically means to go through all the compound and variation rules and find & define that default behavior fitting most systems.
extend terminal interface with a "render as you please" mode in DECSET or SM, plus a sequence to ask for runwidths
By default a terminal would do what should be described under 1. With the new mode an app can tell the terminal to freely render things. Now it is up to the terminal to use the compound or the sequential glyphs, whatever it is capable of. The additional sequence allows the the app to ask the terminal, how wide it would render that emoji above, thus it can calculate with that in its screen layouts. The sequence needs abit thinking, prolly asking multiple chars at once is a good idea to reduce the request-response communication needs between app and terminal.

Edit:
The sequence could be shaped like DECRQSS request-response cycling:

request:    DCS <TBD>  👨‍👩‍👧‍👦  ;  🏁  ;  ...  ST       // up to 16 (32|64?) entries
response:   CSI         8  ;   2  ; ...  <TBD>    // runwidths in CSI params

0 replies

j4james · 2021-09-02T13:23:42Z

j4james
Sep 2, 2021

The sequence could be shaped like DECRQSS request-response cycling:

+1 to this. The only thing I'd add, is that it might be useful to follow the pattern of some of the other string-based queries that have a parameter to choose between pure text and hex encoding (for the characters you're wanting to test). See for example DECLBAN and DECDMAC. Although as long as you don't need to test the semicolon separator, or any of the control characters, that's probably not necessary. Worst case you could add that later.

0 replies

jerch · 2021-09-02T13:55:08Z

jerch
Sep 2, 2021

The only thing I'd add, is that it might be useful to follow the pattern of some of the other string-based queries that have a parameter to choose between pure text and hex encoding (for the characters you're wanting to test). See for example DECLBAN and DECDMAC.

Yes thats indeed important, as not all parsers might allow UTF8 within sequence payload.

Although as long as you don't need to test the semicolon separator, or any of the control characters, that's probably not necessary. Worst case you could add that later.

My thinking with the separator was, that it would allow to handle arbitrary unicode content, thus also multiple chars from complicated scripting systems at once. The returned value would then rather denote the whole runwidth of that "phrase". (Works abit like a wswidth DB request and would allow to skip those half way broken wcwidth impls seen in same clibs.) For this I think we cannot make it collision-free without an explicit separator (maybe I miss something).

0 replies

j4james · 2021-09-02T14:12:17Z

j4james
Sep 2, 2021

My thinking with the separator was, that it would allow to handle arbitrary unicode content, thus also multiple chars from complicated scripting systems at once.

Yeah, that seemed like a sensible approach to me. And as I said, I don't think the choice of separator is likely to be a problem because it's assumedly not something anyone would want to measure. But if they did, and we had the hex option, then they could still use the hex representation for it.

0 replies

jerch · 2021-09-02T14:42:58Z

jerch
Sep 2, 2021

Some more thinking about such a mode extension:

Apps operating on the normal scroll buffer prolly dont care about individual runwidths, maybe this new mode could be set as default on that buffer. Not so for apps on alternate buffer, they normally have a strict idea about the screen layout, I think that mode cannot be set as default there without breaking many "canvas apps". They still can use it to their advantage, if they set the mode explicitly and do the sequence belly dance.

Roughly this leads to this scheme for default settings:

normal buffer - left to decide by the terminal itself (not sure, might cause frictions, then unset)
alternate buffer - always unset at beginning
buffer switching - always flip to the target buffer's default
multiplexer detached - only unset can give reliable results
multiplexer attached - might be able to derive a common setting from sub terminals, fallback to unset

Note that this mode doesnt care about the rendered glyph in the end. It is just a promise about the taken runwidth. Technically a terminal capable to do compound glyphs can render the family emoji from above in unset mode like:

compound/left-aligned:   |👨‍👩‍👧‍👦 | | | | | | |
compound/centered:       | | | |👨‍👩‍👧‍👦 | | | |
compound/right-aligned:  | | | | | | |👨‍👩‍👧‍👦 |
single glyphs:           |👨 |👩  |👧 |👦 |

Wow thats really hard to align here, hope you get the idea. To me only single glyphs make sense in unset mode, as it does not screw up output too much. But I think that should be left to terminal devs to decide.

Another problem directly arising from this is grapheme segmentation, and how to deal with those "spreaded compounds" at line ends. Currently I tend to treat them as non-breakable, if the segmentation algo says so, means we would get "ragged-right" line ends with early wrap-around, where more than 2 cells can stack up to one perceived character (flags for example must not break and would take 4 cells). Note that this is really hard to achieve, it means that all terminals would have to revamp their combining cell logic to support more than 2 at once, same with the cursor advance. So while this feels "more natural" to me, it might be way over what can be done / asked for.

(@christianparpart I hope I didnt go too much offtopic, as you only asked for the variation selectors. Imho to properly handle those we need to talk about fundamental handling of newer unicode concepts in the terminal first.)

0 replies

j4james · 2021-09-02T16:49:22Z

j4james
Sep 2, 2021

normal buffer - left to decide by the terminal itself (not sure, might cause frictions, then unset)

I suppose it's up to the terminal to decide, but personally I'd expect this to be unset by default. Even in a normal buffer, people tend to do all sorts of fancy things with emojis in their prompts, and if you get the width wrong, those layouts are likely to break.

Currently I tend to treat them as non-breakable, if the segmentation algo says so, means we would get "ragged-right" line ends with early wrap-around

That seems sensible to me. You're already getting ragged-right when using emojis and ideographic languages, so this isn't much different is it? I don't know about the technical side of things, so maybe it's more complicated than I'm imagining, but I expect this issue will come up anyway if/when terminals try to support some of the more complicated writing systems, where characters can't reasonably fit in 2 cells.

0 replies

jerch · 2021-09-02T18:13:13Z

jerch
Sep 2, 2021

That seems sensible to me. You're already getting ragged-right when using emojis and ideographic languages, so this isn't much different is it? I don't know about the technical side of things, so maybe it's more complicated than I'm imagining, but I expect this issue will come up anyway if/when terminals try to support some of the more complicated writing systems, where characters can't reasonably fit in 2 cells.

@j4james
True, it basically extends the current CJK behavior with 2 cells to n cells for grapheme clusters. I am not deep enough into unicode to tell, whether this will help peeps in foreign scripting systems or not. Still I see a chance here to fix most of the problems above while sticking to the grid idea (until unicode brings the next absurd shenanigans haha). So yes, I am up for that.

But will we convince term devs to adopt to a much more complicated grid model? Grapheme clusters? Arbitrary "long" cells? There it is again - the hen-egg problem of serious everyday infrastructure. 😸

Edit:
Ok, lets bring more problems attached to clusters on the table. What about the cursor? How shall a cluster be addressable from cursor movements? As one single char always? How to construct/edit during input?
Some ideas:

A cluster spanning multiple cells is treated as one big cell during cursor marking, but leaves the individual movement underneath intact (you'd have to go 8 cells backwards to get to the cell right before the family emoji). Again similar to CJK handling.
Input prolly already needs IME helper in the first place. So clusters are likely to get added at once, not by individual codepoint input. If still things flow in as separated codepoints, the terminal would have to revise the old output after adding the incoming chars that (prolly new!) cluster. This only works reliable in consecutive input mode, not for overwriting existing cells after cursor jumps. This is also directly related to the question, where "to park" a cluster in the buffer (all in the first cell? spread across taken cells?). To me putting a cluster into the first cell it occured, seems easier to grasp and handle for follow-up writes and cursor moves (but this also means, that the terminal might have to do late wrapping, when the cluster starts to overflow at margin).
Edit though is more tricky to solve - what should be rendered after one BS+erase, or something got removed from the middle, or makes the cluster break apart? I have no good answer to that yet, maybe the terminal wants to blackout the cluster in question in normal output, and brings instead the IME helper up again. Idk if such a behavior is feasible at all. It touches all those problems we already have in richtext editors for more complicated char input. Maybe we can learn from them to some degree. A simple straight forward implementation would do the same as with CJK (cluster in first cell, touching the cell area with erase actions deletes the cluster).

0 replies

christianparpart · 2021-09-02T20:00:21Z

christianparpart
Sep 2, 2021
Maintainer Author

Closing the rant above: In long term imho the only solution is to give up the wcwidth promise, and let things flow more freely.

My stance is that wcwidth should be deprecated (recommended not to be used) and wcswidth be used instead (assuming that wcswidth is aware of grapheme clusters and VS15/VS16 as maybe-future-defined of course :) ).

Lets say the cursor is positioned in the bottom right corner of the screen, and the next character you receive is wide by default.
[...] How do you recover from that situation?

Okay, @j4james here you got me on the cold foot-step. I did not think of that case just yet, or in other words: I assumed that the base character will cause an auto-wrap and a following VS15 would indeed change width from X (say 2) back to 1 but not changing back to the prior cursor position (even though, now that I think of, that is technically possible and should not be too expensive, because at least in my case, I do keep track of the previous grid coordinate, so that unwrapping would be possible with some additional logic when U+FE0E is received and previous text character changed coordinate due to auto-wrap. I think that's not a deal breaker as long as it's well defined somewhere).

I can't seem to access gitlab now, but I'm sure this has all been discussed before in terminal-wg

OT: thinking positive here: we have a non-prejudicial clean-room discussion then :-)

[... ...] but for a terminal it means it can either render it as family spanning 2 cells, or as manwomangirlboy spanning 8 cells.

I think with a well defined DECRQM-mode (or similar) this can be very well defined, too. I may be dreaming a little bit too much of a perfect world here.
When I was reading the TR#51 it was indeed stating that ZWJ emoji can "alternatively" be rendered with each emoji individually.
Now, to me, that reads like a compromise to ZWJ-non-supporting unicode-rendering implementations still "conforming". Alacritty for example does render that emoji 👪 as 4 individual even (non-colored) emoji. But Alacritty gets a lot of emoji wrong, so I'd consider that TE as non-supporting and can be easily distinguished from those that do expose support for proper ZWJ emoji rendering (for example via DECRQM (not saying it has to be that way, it's convenient though :) )).

WRT your proposals, I think I ruled out proposal number 1 with that idea. Number 2 is interesting and actually something I had in my mind already too. I neglected it due to impractical use (IMHO).
Then one may ask when it would use that (I'm eyeing at notcurses here for example). I think notcurses (or other apps/libs) would then use that VT sequence just with one or at most a few sequences to determine whether or not ZWJ sequences are rendered in the legacy way or the (i'd call it) the proper way. That's binary and may again fit more efficient into a RQM-style test that (on success) would give the guarantee to follow some "spec" as being well defined somewhere publically on the net.

To your DECRQM-style VT sequence, continueing to think about that idea: I cannot remember out of my head, but I am sure there are DCS that do also respond with DCS, so no need to allocate another CSI-response.

See for example DECLBAN and DECDMAC.

I am sure there are other CSI sequences that expect a textual parameter in there decimal representation even. With a quick check I found DECFRA which I have recently implemented - I actually thought there were more than just DECFRA :) .

Some more thinking about such a mode extension:

I'm glad that idea was picked up by you at least :-D ... Well, I am not sure a TE needs to have that flag based on buffer. Because the canvas-style app can use DECRQM to read the current state before going to work and just restore that upon application exit.

Multiplexers are always a special case that I at least also like to not forget about either. I think they won't have any issue because they can test the connecting client TEs for support too, and if the connecting TEs do not expose support or even a multiplexer is having multiple client TEs connected with varying support, then the multiplexer can inject a space character after those emoji / double-width characters that do not support this spec.

Note that this mode doesnt care about the rendered glyph in the end. It is just a promise about the taken runwidth

I hope I'm not acting too blunt here. But I think with runwith you actually mean how many grid cells a given grapheme cluster will occupy, right? So yeah, such a spec would be about:

support detectability (e.g. DECRQM)
mode enable/disable (enabling gives the cursor-movement guarantees and proper rendering of emoji)

Mind, I am still strongly against the idea of supporting the alternate emoji representation (e.g. 4 individual emoji instead of a family emoji).

Another problem directly arising from this is grapheme segmentation, and how to deal with those "spreaded compounds" at line ends. Currently I tend to treat them as non-breakable, if the segmentation algo says so

In my implementation I am strictly adhering to the grapheme cluster segmentation algorithm. So consecutive characters that are by definition unbreakable will always end up into the same grid cell. That is a guarantee that should then be made by such a fictional spec, too (if that mode is enabled).

@christianparpart I hope I didnt go too much offtopic, as you only asked for the variation selectors

No you didn't. It's kinda connected topic anyways. However, my whole point here was (and is) to take care of complex emoji with ZWJ and VS15/VS16 overrides, their cursor movement implications and display representations. I think that's small enough to not fear the talked-to-death syndrome.

p.s.: I didn't manage to process ALL posts yet, will resume later, but maybe we end up productively? ;-)

0 replies

christianparpart · 2021-09-02T20:20:33Z

christianparpart
Sep 2, 2021
Maintainer Author

but I expect this issue will come up anyway if/when terminals try to support some of the more complicated writing systems, where characters can't reasonably fit in 2 cells.

Isn't it that all codepoints are mapped with an east asian width that can at most be Wide, which is interpreted as 2 grid cells? So it can AFAIU never exceed 2 grid cells.

What about the cursor? How shall a cluster be addressable from cursor movements?

A grapheme cluster occupies one grid cell, so no need to change cursor semantics. A grapheme cluster may trigger the cursor to move 2 columns instead of just 1 column forward. So technically you can address the empty grid cell that was jumped over earlier. I do not see an issue here because it's not simpler (nor more complex) as with DECDHL / DECDWL (double width/height characters).

How to construct/edit during input?

That's the job of the application, no need to specc that out.

0 replies

jerch · 2021-09-02T21:24:07Z

jerch
Sep 2, 2021

My stance is that wcwidth should be deprecated (recommended not to be used) and wcswidth be used instead (assuming that wcswidth is aware of grapheme clusters and VS15/VS16 as maybe-future-defined of course :) ).

Agreed. To my understanding the whole wcwidth idea is flakey, if provided from standard system libs. I have a small hope, that a very fundamental definition, in combination with a sequence as described above for the more complicated things, would do in the end. Makes all those wrong wcwidth table issues obsolete. If in doubt, ask the terminal. A dream would come true 🍭

I think with a well defined DECRQM-mode (or similar) this can be very well defined, too. I may be dreaming a little bit too much of a perfect world here.

Agreed.

When I was reading the TR#51 it was indeed stating that ZWJ emoji can "alternatively" be rendered with each emoji individually.
Now, to me, that reads like a compromise to ZWJ-non-supporting unicode-rendering implementations still "conforming".

Yes, there are many parts in unicode phrased either vaguely as maybe's, or directly as "left to the output system". Thats is not helpful for us at all, thus we have to do the dirty job of some "after-speccing".

To your DECRQM-style VT sequence, continueing to think about that idea: I cannot remember out of my head, but I am sure there are DCS that do also respond with DCS, so no need to allocate another CSI-response.

Well, I dont really care, if the response changes the sequence realm or not, a DCS ofc would be more free in its payload format. Problem I see here - its abit more involved to parse DCS correctly, while CSI is pretty darn simple (note we are talking here about appside digesting those responses, not terminal that should get that right in the first place). Furthermore most apps do a lousy job in reading back data from the terminal, I think we should make sure, that the response never exceeds POSIX's minimal PIPE_BUF size (imho defined as 512 bytes), otherwise the OS might chunkify things and the app/script goes bonkers.

I'm glad that idea was picked up by you at least :-D ... Well, I am not sure a TE needs to have that flag based on buffer. Because the canvas-style app can use DECRQM to read the current state before going to work and just restore that upon application exit.

I dont think so either, but I think we should get the defaults straight to not break half of the curses world, just for proper emojis. By making unset the default on alternate buffer, a canvas app not supporting the new mode, does not have to ask the terminal (it prolly isnt even aware, that it could ask for that mode), and will just keep working as before.

I hope I'm not acting too blunt here. But I think with runwith you actually mean how many grid cells a given grapheme cluster will occupy, right?

Yep, lol idk if there is a better english term for it, its the german "Laufweite".

Mind, I am still strongly against the idea of supporting the alternate emoji representation (e.g. 4 individual emoji instead of a family emoji).

Well, thats what I was trying to depict with those different alignment ideas above. I also dont think terminals should be pushed into those single glyphs, if they can just render the compound thingy well. Still the cell/cursor advance should follow the basic promise, if the new mode is unset. If the new mode is set, do as you please. Thats the idea.

In my implementation I am strictly adhering to the grapheme cluster segmentation algorithm. So consecutive characters that are by definition unbreakable will always end up into the same grid cell. That is a guarantee that should then be made by such a fictional spec, too (if that mode is enabled).

Yepp, sounds good to me. I dont think such a spec would need any claims about where to store things up, still the behavior must be clearly laid out. E.g. consecutive data, even across several chunks, will feed to the same cluster, if the segmentation algo says so. After cursor jumps or any other in between data I think that should not be the case, but instead treat cluster data as "broken" effectively overwriting previous cell content (which makes sense in terms of unicode data stream, a cluster should not magically continue after some terminal sequence in between following default unicode breaking rules). I am stating this explicitly here, as this is quite easy to be overlooked during "print handling" in the terminal.

Isn't it that all codepoints are mapped with an east asian width that can at most be Wide, which is interpreted as 2 grid cells? So it can AFAIU never exceed 2 grid cells.

To my understanding in some scripting systems like indian languages there are clustering constellations, that might lead to weird quarter/half width, stacking up to some bigger thingy in the end. In a normal wordprocessor those are dealt with by the font renderer from the font glyphs and their composition/ligature hints. We kinda have no easy way to do that in an (offscreen) terminal, plus we really dont want that (depending on font caps? haha). Therefore I think we need to get them specced in a certain way. Here it would be good to have someone with more experience in those scriptings onboard. I am not that one, so read my comment as hearsay.

0 replies

christianparpart · 2021-09-02T21:46:17Z

christianparpart
Sep 2, 2021
Maintainer Author

After cursor jumps or any other in between data I think that should not be the case, but instead treat cluster data as "broken" effectively overwriting previous cell content (which makes sense in terms of unicode data stream, a cluster should not continue after some terminal sequence in between following default unicode breaking rules).

Exactly. That is what I meant with consecutive and that is how I implemented it.

I do not think that s minimal spec must include any definitions of the day mode is disabled (or better: not enabled). Because IMHO, that's the whole point, to get a well defined environment that you can access with this mode being enabled. If it is not enabled then the app must not expect anything as the behavior is as undefined as it is today. I care about a well defined environment that is surely not enabled by default (backwards compatibility.....) but if enabled we have all those guarantees we talked about so far.

I do not know such a minimal spec would need to take care of weird scripts with regards to east Asian width, as in the end we talk about this here in order to get emoji trending and it's cursor positioning right .

Sure, more could be defined and included into such a mode. But i fear that we then run into the rabbit-hole where we will not finish the idea.

What do you think?

0 replies

j4james · 2021-09-03T22:58:21Z

j4james
Sep 3, 2021

I do keep track of the previous grid coordinate, so that unwrapping would be possible with some additional logic when U+FE0E is received and previous text character changed coordinate due to auto-wrap.

But how do you "unscroll" the screen when the wrapping happens on the last line? And bear in mind that the scrolling may have occurred within margins, in which case the line that scrolled off the top would have been erased completely, so it's not like you can just go back in the scrollback buffer.

And even if you did something where you kept a record of the last line scrolled, so you could unwind that as well, this doesn't seem to me like a workable solution, because whenever the unwind occurs, the screen is going to jump as it scrolls up and down.

The only solution that I thought might be reasonable, was something like a delayed wrap. So if you write a wide character on the last column of the page (and assuming it was capable of being narrowed), then you don't actually wrap immediately, but just display half the character (or maybe the narrow version, or nothing at all). Then when you receive the next character, either it's going to shrink and can be left where it is, or it's definitely wide and you can then safely trigger the wrap.

I don't particularly like that solution either, but if I absolutely had to support width-changing variants, that seemed like the least worst option to me.

Also note that wrapping is only one example of the problems you get with width-changing variants. Another case to consider is when Insert/Replace Mode is set (i.e. you're inserting) and you write out a wide character that pushes two cells off the right edge of the screen. Then you receive a variant selector which narrows that character, so now you need to undo one column of the insert, and somehow recover one of the characters that had been pushed off screen.

0 replies

christianparpart · 2021-09-04T05:07:14Z

christianparpart
Sep 4, 2021
Maintainer Author

@j4james ooh right. I forgot about margins. Sorry.

I remember i once checked against web browsers and it turns out that (i tested with Chrome) VS15 does indeed change the presentation to text (try with any emoji) but it keeps the width of "Wide", which would be great for us. I forgot that I just wanted to have is a user experience convergence, so web emoji and terminal emoji should behave equally.
I think we all may have had a temporary misunderstanding? IIRC VS16/VS16 is about changing presentation (colored vs text of emoji).

With that in mind, the only problem i see might be the copyright symbol. I think that by default has width 1 but can have VS16 applied too, so it does grow. There may be other symbols like that. But for the grow case i think we all can agree on a workable solution.

Did Imiss anything?

Trying to recap a small checklist of potential spec requirements:

consecutively (!) written non-breakable Codepoints will always end up in the same grid cell, leading to a grapheme cluster aware TE.
emoji symbols are always rendered in square (as required by TR51), implying a East Asian Width of Wide (2 grid cells), and requiring compound (ZWJ) emoji to always be rendered as compound emoji. The alternate rendering of ZWJ emoji therefore is considered invalid / not supported.
VS16 upgrades symbols to emoji presentation, leading to width 2, and potentially reflowing that symbol to the next line if on right margin with AutoWrap on
VS15 changes emoji presentation from emoji emoji to text emoji but retains width of 2. (This matches web Browser behaviors too)
emoji symbols regardless of Variation selectors (15/16) will move the cursor visually next to it, so move it by 2 columns instead of 1.
emoji written with the cursor at the right margin and with AutoWrap on will first trigger AutoWrap and then write the emoji character into the grid (aligns with CJK)
emoji written at right margin and with AutoWrap OFF will yield that character to be rendered only it's first half.
All of the above must be adhered to if the TBD (DEC) mode is on. Otherwise the behavior is as undefined as it is today.

Does this sound like it could convince other TE devs?

What did we miss in this list? What do you think? :)

0 replies

j4james · 2021-09-04T17:17:32Z

j4james
Sep 4, 2021

What needs to be done in order to get WT buying in?

There's an issue in the WT tracker (microsoft/terminal#8000) where they've been discussing support for more advanced features of Unicode, as well as complex scripts. Initially I didn't think it was a good idea, because I assumed they would just break existing applications, but this mode idea of yours seems like it would be a solution to that problem (and if not a mode, then possibly the cluster measuring sequence that was discussed earlier).

But the first thing would be to decide whether you think your ideas align with what they're planning. There's much detail in that issue, but broadly speaking you can probably tell if you're likely to be in agreement with them or not. If you are in agreement, then maybe leave a note there describing your plans for the mode, and see whether they'll be interested in collaborating. Personally I'm in favour of the idea, but I'm just a contributor there - I can't speak for the WT team.

On the plus side, there are people at MS that are genuine experts on the subject, which would be helpful in covering areas of Unicode that you may not know about. The down side is that you may have to wait some time before they're ready to agree to anything.

0 replies

christianparpart · 2021-09-04T17:51:23Z

christianparpart
Sep 4, 2021
Maintainer Author

Thanks @j4james . I keep you posted.

0 replies

christianparpart · 2021-09-04T19:27:34Z

christianparpart
Sep 4, 2021
Maintainer Author

@jerch @j4james I'd like to kindly ask you to read https://github.com/contour-terminal/terminal-unicode-core/releases/tag/v0.1.0_prerelease_1 and maybe give some feedback on it. I hope I did address it all. We can use this document (it's source code / git repo) as base of the current state of discussion.

I try to keep that up-to-date and that is the document I'd like to forward to microsoft/terminal#8000 once we've found a consensus at least all of us are comfortable with so we can get more feedback from others.

0 replies

j4james · 2021-09-04T21:31:06Z

j4james
Sep 4, 2021

That looks good to me. I was expecting it to be more complicated, but if everything else is covered by the linked Unicode documentation then that's brilliant.

Answers to some of the questions in the sidebar:

For the Unicode version issue, I'd be happy to ignore it until it becomes a problem. We may be worrying about something that never happens again.
For feature detection, I think it's better not to even mention DA1. While I'm in favour of DA1 for feature detection in general, I'd rather reserve it for features that can't be detected in any other way, so it doesn't get overloaded unnecessarily.
Regarding skipped grid cells in the emoji section, I'm really not sure whether that needs to be explicit. I'm happy to leave that an open question for now and see what others have to say.

Minor nit regarding references: when you say "as described in 9", it would be a little clearer if you referenced the actual document name, e.g. "as described in UTS 29" (with a [9] reference link following that).

I also think some of the wording could possibly be made clearer, but that's something that can be polished later, once you've got feedback on the actual substance of the spec.

0 replies

jerch · 2021-09-05T19:08:28Z

jerch
Sep 5, 2021

Wow, that draft is pretty on point and I am impressed, that you got it sorted that short. And I like it for being that short and concise. 👍

For the Unicode version issue, I'd be happy to ignore it until it becomes a problem. We may be worrying about something that never happens again.

Yepp, I feel the same way here. If you care about different unicode version rules, maybe just point out in an additional sentence, that this was made with rules for unicode 11-13 in mind. That way we know later on, where the relevance might get thin again, because unicode introduced some new fancy stuff with 14+ or so.

For feature detection, I think it's better not to even mention DA1. While I'm in favour of DA1 for feature detection in general, I'd rather reserve it for features that can't be detected in any other way, so it doesn't get overloaded unnecessarily.

Agreed. Currently I also would not mess with DA1, furthermore stating something important like feature detection as maybe again ("The DA1 could be extended to also indicate support") is not helpful for a spec like thingy (either tells peeps to do that, so apps can grow confidence to find it there, or dont mention at all. I lean towards "dont mention" for now). Furthermore doubled feature reporting is awkward and will just lead to implementation/request frictions later on, so I am good with "request it exactly this way, period".

Regarding skipped grid cells in the emoji section, I'm really not sure whether that needs to be explicit. I'm happy to leave that an open question for now and see what others have to say.

Agreed. And if in doubt, well TEs prolly gonna do what they already do for CJK. So most likely there is no issue from that at all.

Note: above I said something about picogram and SGR handling - what I meant there was to make clear, how SGR attributes would apply to picograms. Should a TE make attempts to underline a picogram? BG color applied? What about FG? Bold? Thin? While I have a personal stance here, I also think it is not needed to be specced out in detail, but maybe encourage TEs to apply them in a sensible way. What really wins here - idk yet myself. (Prolly color masking from FG is way too much, but BG/underline etc makes totally sense to me)

About the performance considerations:
I would not put something like that into the main document, as thats not part of the "spec". If at all, maybe into some addendum for implementation hints/details.

0 replies

christianparpart · 2021-09-05T20:08:32Z

christianparpart
Sep 5, 2021
Maintainer Author

Thx guys for the feedback. I will integrate that she hopefully can give news ASAP, currently short on time. :)

0 replies

christianparpart · 2021-09-06T09:08:13Z

christianparpart
Sep 6, 2021
Maintainer Author

https://github.com/contour-terminal/terminal-unicode-core/releases/tag/v0.1.0_prerelease_2

This is now having integrated your feedback. - I hope I did not miss anything. - but ping me if so, or if we can improve on anything else. :)

0 replies

jerch · 2021-09-06T09:27:34Z

jerch
Sep 6, 2021

I wonder, if regional indicator (RI, country flags) should also be standardized by this? When I was initially dealing with grapheme rules, I found them to be more tricky, but cannot remember why (was it because of stacking and right margin handling? Idk...)

Edit: Oh right, it was because of their 1+next rule, had kinda troubles to get their bounderies right for multiple flags in a row...

0 replies

christianparpart · 2021-09-06T10:05:51Z

christianparpart
Sep 6, 2021
Maintainer Author

@jerch in the other hand, country flags are just working fine on my end with the above rul s Plus proper Text shaping (maybe that is what can mess some TE devs up).

Because most TEs don't do any proper Text shaping at all but only render per text character. Kitty she's some tricks manually to get for example ZWJ emoji working. I chose to trust harfbuzz more than my own code.

I can do some additional tests later though.

0 replies

jerch · 2021-09-06T10:13:06Z

jerch
Sep 6, 2021

@christianparpart Hmm yeah, the rules prolly cover RI just fine. Well it was more an issue on my end, how I ended up building the carry for cluster additions during single codepoint input (choosing a subpar abstraction).

0 replies

christianparpart · 2021-09-06T10:21:11Z

christianparpart
Sep 6, 2021
Maintainer Author

@christianparpart Hmm yeah, the rules prolly cover RI just fine. Well it was more an issue on my end, how I ended up building the carry for cluster additions during single codepoint input (choosing a subpar abstraction).

If of interest we could do that implementation Details / recommendations addendum that covers some helpful insights on how to implement

grapheme cluster segmentation
emoji presentation segmentation
properly text shaping in the context of a terminal

We could also propose a C API for the unicode (not the Text shaping part) and a reference implementation. I think you still remember my RFC to https://github.com/contour-terminal/libunicode/blob/master/src/unicode/capi.h

0 replies

jerch · 2021-09-06T10:25:58Z

jerch
Sep 6, 2021

If of interest we could do that implementation Details / recommendations addendum that covers some helpful insights on how to implement

Yes that would be good, as it would be valuable information to get things done (to me the scattered resources were more of a problem than the spec stuff itself).

0 replies

christianparpart · 2021-09-06T11:01:30Z

christianparpart
Sep 6, 2021
Maintainer Author

Yes that would be good,

Okay. I will create that addendum based off my terminal text stack document then. As soon as I have some more dedicated time tonight or next night and notify you guys then.

0 replies

j4james · 2023-05-09T20:14:58Z

j4james
May 9, 2023

I came across an old discussion of the VS15/VS16 selectors in the VTE issue tracker the other day (see issue 2317), and they highlighted something in the Unicode spec which I hadn't noticed before: namely that it doesn't actually recommend VS15 changing the width.

Quoting from UAX11 East Asian Width:

UTS51 emoji presentation sequences behave as though they were East Asian Wide, regardless of their assigned East_Asian_Width property value.

And an emoji presentation sequence is defined as an emoji characters follower by VS16 (for the official definition see here and here).

So that recommendation is clearly suggesting that VS16 would make a narrow emoji wide, but there isn't an equivalent recommendation saying a text presentation sequence should be narrow. That implies that they don't expect VS15 to have any affect on the width.

I know we reached the same conclusion here anyway, but I thought it was nice to know that the Unicode specs are in agreement are on that point.

0 replies

christianparpart · 2023-06-07T13:21:11Z

christianparpart
Jun 7, 2023
Maintainer Author

Thanks, @j4james. And sorry for the late response!

it doesn't actually recommend VS15 changing the width.

Yeah, I settled with that myself now. IIRC, I had some discussion on VS15 not changing width recently (past few months) with someone and it made sense to leave it, while VS16 should indeed increase the width (as in: ensure it's wide).

I will make sure the VS Unicode Core Spec I drafted is reflecting that ASAP. (and also make sure we finish this ticket here) :-)

Have a sunny day,
Christian.

0 replies

VT Unicode Core Specification #1178

christianparpart Sep 1, 2021 Maintainer

Replies: 35 comments

christianparpart Sep 2, 2021 Maintainer Author

christianparpart Sep 2, 2021 Maintainer Author

christianparpart Sep 2, 2021 Maintainer Author

christianparpart Sep 4, 2021 Maintainer Author

christianparpart Sep 4, 2021 Maintainer Author

christianparpart Sep 4, 2021 Maintainer Author

christianparpart Sep 5, 2021 Maintainer Author

christianparpart Sep 6, 2021 Maintainer Author

christianparpart Sep 6, 2021 Maintainer Author

christianparpart Sep 6, 2021 Maintainer Author

christianparpart Sep 6, 2021 Maintainer Author

christianparpart Jun 7, 2023 Maintainer Author

christianparpart
Sep 1, 2021
Maintainer

christianparpart
Sep 2, 2021
Maintainer Author

christianparpart
Sep 2, 2021
Maintainer Author

christianparpart
Sep 2, 2021
Maintainer Author

christianparpart
Sep 4, 2021
Maintainer Author

christianparpart
Sep 4, 2021
Maintainer Author

christianparpart
Sep 4, 2021
Maintainer Author

christianparpart
Sep 5, 2021
Maintainer Author

christianparpart
Sep 6, 2021
Maintainer Author

christianparpart
Sep 6, 2021
Maintainer Author

christianparpart
Sep 6, 2021
Maintainer Author

christianparpart
Sep 6, 2021
Maintainer Author

christianparpart
Jun 7, 2023
Maintainer Author