Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-width unicode characters are not supported #8

Closed
muesli4 opened this issue Dec 26, 2019 · 15 comments
Closed

Multi-width unicode characters are not supported #8

muesli4 opened this issue Dec 26, 2019 · 15 comments

Comments

@muesli4
Copy link
Owner

muesli4 commented Dec 26, 2019

Unicode has some characters that even in monospace have different widths (multiples of the base-width is my guess). In that case, any cell formatting is done wrongly because it uses the assumption that all characters have the same width.

It is unclear how one could determine the width of a unicode character. Sometimes it even seems to depend on the locale.

This can be fixed solely within the Cell type class because the algorithms rely only on that. Cutting within a character is an issue. In that case it is possible to replace it with spaces (in the drop functions). Unfortunately, all operations now require linear time.

@ony
Copy link

ony commented Jan 2, 2020

@simonmichael says in this comment:

As it says: "From Pandoc." I guess Pandoc gets it from the official Unicode standard.

Looks like Pandoc extracted charWidth to doclayout but still use hard-coded values without providing information about their origin. It should be Unicode standard, but which version?

@simonmichael, I think both libraries (this and hledger) can benefit from doclayout, but it is not yet on Stackage.

@muesli4
Copy link
Owner Author

muesli4 commented Jan 4, 2020

What I read is that those things are not completely standardized. With different locales there is ambiguity for some characters and it also depends on the font.

Another solution would be to use https://github.com/JuliaStrings/utf8proc/blob/20672dba69bf463be22f6c9c216d858c9d116bb6/utf8proc.h#L646 but that adds utf8proc as dependency.

@ony
Copy link

ony commented Jan 5, 2020

Another solution would be to use https://github.com/JuliaStrings/utf8proc/blob/20672dba69bf463be22f6c9c216d858c9d116bb6/utf8proc.h#L646 but that adds utf8proc as dependency.

This falls under the same category "parse Unicode report". First they convert EastAsianWidth.txt to CharWidths.txt with this parser. And then generate table in C code.

@ony
Copy link

ony commented Jan 5, 2020

What I read is that those things are not completely standardized. With different locales there is ambiguity for some characters and it also depends on the font.

This is "self-driving car"... When many applications do not ad-here standard this is called absence of standard. This way applications start to add quirks on tangent points between each others instead of following common interface.
By giving up on confirming with Unicode recommendations you contribute to that.

@hasufell
Copy link

I guess I have this problem in ghcup:

ghcup-table-layout

code is here: https://gitlab.haskell.org/haskell/ghcup-hs/-/blob/master/app/ghcup/Main.hs#L1411-1458

@muesli4
Copy link
Owner Author

muesli4 commented Sep 22, 2020

@hasufell Are you using multi-width characters? Because it doesn't seem that way. It seems this caused by the backend-specific control characters (see #4). Please have a look at the documentation of the Formatted type. You may be able to write an instance of System.Console.Pretty that uses Formatted. Unfortunately, in the implementation the format instructions are not separate from the text. But this is necessary to measure the text width. However, it should be relatively easy to refactor this in the library.

If your problem is not related to multi-width characters and the Formatted type does not solve your problem, would you be so kind and open a new issue?

edit: I just noticed that there is #11 which may be relevant to your use-case. You could also put the values on different lines, then the per-cell color is not an issue.

@muesli4
Copy link
Owner Author

muesli4 commented Sep 22, 2020

What I read is that those things are not completely standardized. With different locales there is ambiguity for some characters and it also depends on the font.

This is "self-driving car"... When many applications do not ad-here standard this is called absence of standard. This way applications start to add quirks on tangent points between each others instead of following common interface.
By giving up on confirming with Unicode recommendations you contribute to that.

@ony Trust me, I want to adhere to the standard as much as possible. In fact, that is the main reason why I do not want to adopt it yet for the default instance. If you can show me that there is a standardized way to determine the character width of unicode characters, I will be the first to accept it. That doesn't mean we can't write an instance at all in the meantime. Contributions are welcome and I'm happy to work on this together or provide any support that is necessary.

@hasufell
Copy link

Please have a look at the documentation of the Formatted type. You may be able to write an instance of System.Console.Pretty that uses Formatted. Unfortunately, in the implementation the format instructions are not separate from the text. But this is necessary to measure the text width. However, it should be relatively easy to refactor this in the library.

Sorry, I can't really follow this or how to fix it.

You could also put the values on different lines, then the per-cell color is not an issue.

That's not a possibility

@simonmichael
Copy link

@muesli4, FWIW: there are some helpers (charWidth, strWidth, textWidth, stripAnsi) in hledger-lib which could give inspiration for this and #11.

@hasufell
Copy link

hasufell commented Sep 22, 2020

Yes, the functions @simonmichael describes work well. I dropped my use of table-layout and reimplemented simple row-column padding with that: https://gitlab.haskell.org/haskell/ghcup-hs/-/commit/40a1cc98c6ea7eb06eeca7a37915a5075451420b#c84b8cca7fc11e84e49df98e5e56e35d46791361_1560_1558

@ony
Copy link

ony commented Sep 27, 2020

@ony .... If you can show me that there is a standardized way to determine the character width of unicode characters, I will be the first to accept it. ...

I think I gave some links already in my comment to simonmichael/hledger#905 . Related standards:

  • Unicode standard annex #11 for East Asian Width that define relative width in fonts and associated with it table EastAsianWidth.txt.
  • C function wcwidth which is part of POSIX.1-2001 and POSIX.1-2008 standards and puts some meaning into relation between "width" and terminal columns. Should be available on any system that promise that including Linux, FreeBSD, Windows, MacOS. For Haskell we have unmaintained wcwidth and as I know no other Haskell standard libraries that provides this information.

I agree that there is no clear standard "Unicode for terminals".
But if you look into my comment where I traced origins of code similar to what @hasufell adopted from hledger, that in its turn adopted from pandoc, you'll see how other tries to adhere Unicode and bypass wcwidth.
Since this library is quiet generic and needs to pad with spaces to align to specific columns, I thought it might want to implement it properly or better spin dependency that provides terminal specific interpretation of Unicode, or help in reviving Haskell bindings to wcwidth.

P.S. This cross-repo thread started with cheese 🧀 (part of Unicode 8) in someones financial report.
P.P.S. My memory tells me that I also went through Julia language, but it is not mentioned in my comment :( . Anyway check JuliaStrings/utf8proc#114 for example where they refer to EastAsianWidth.txt.
P.P.P.S @hasufell, tight/tough world of Haskell developers/enthusiast using Exherbo Linux.

@ony
Copy link

ony commented Sep 28, 2020

There are more hardship with "ZERO WIDTH JOINER" that may turn 5 "characters" in a single glyph if both terminal and font supports it. To make it predictable we may want to strip ZWJ.

@Xitian9
Copy link
Collaborator

Xitian9 commented Aug 31, 2021

Would you be open to a PR incorporating the functionality @ony and @simonmichael suggested above? Is the blocker here just the work needed to do it, or are there other considerations?

@muesli4
Copy link
Owner Author

muesli4 commented Sep 2, 2021

I am sorry for the delay with this issue, I simply do not have a lot of time at the moment. However, I created a type class Cell that was intended to be used for this purpose. The functionality may be implemented as a (parametrized) newtype for now either in this library or another one. Perhaps, it is better to provide it as another library, then we do not need to change the dependencies (if there are any).

Would you be open to a PR incorporating the functionality @ony and @simonmichael suggested above?

I am happy to accept pull requests. However, it would be good if you could give an idea of your implementation.

Is the blocker here just the work needed to do it, or are there other considerations?

When I was looking into this I read that there is not really a standard and it seemed more like a hack to me that sometimes works and other times not. But I may be wrong and I don't exactly remember. But then again, if we provide this as an opt-in feature I see no problem at all.

@muesli4
Copy link
Owner Author

muesli4 commented Sep 27, 2021

I manually added the changes from the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants