-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Select by character or display indices #2724
Comments
This affects kak-lsp too: kakoune-lsp/kakoune-lsp#98 "character" is kind of a hazy concept in the world of Unicode, and the official documentation tries to avoid it. Some common alternatives include:
Kakoune's "UTF-8 bytes" coordinates are great for other tools that use UTF-8 internally, but not great for tools that use other encodings. "code units" coordinates would likewise be great for tools that use UTF-16 internally, but not great for tools that use UTF-8. It also wouldn't be great for tools that use UTF-32, but the only difference would be astral plane characters (U+10000 and above) which aren't that common so maybe it would still be worthwhile. "codepoints" is a nice encoding-agnostic system, but for selections it has the same problems as the current "UTF-8 bytes": since codepoints are not fixed-size in most encodings, you still need the original buffer content to convert codepoint-coordinates into UTF-8 bytes or code-units that you program can work with. Also, if these selection coordinates are visible in the UI, it would be weird that "grapheme clusters" is encoding-agnostic, but has the same "needs the whole buffer content" issues as UTF-8 bytes, code units, and codepoints. It also requires a copy of the Unicode character database to determine the edge of each cluster, which can make for Fun Times when (for example) Kakoune is using data from libc, and a plugin is using data from the JRE, and the two data sources disagree. It does make more sense in the UI, though. "character cells" is basically like "grapheme clusters", except that it's not strictly-speaking defined by the Unicode standard. Unicode defines particular character cell widths for some characters, but not all characters, so implementors have to guess. See the comment at the top of Markus Kuhn's |
Hello, So first, one thing to keep in mind is that Kakoune does not enforce the buffer content to be unicode, it will interpret it as UTF-8, but should tolerate non-UTF-8 contents (although there is not big guarantees on what it will do with it). That means that giving anything else that byte offsets is going to be ambiguous on non-utf8 buffer contents, as there is no well defined method to handle invalid utf8 text that I know of. I think we can all agree that in retrospect, the choice of UTF-16 by Java and Windows was a mistake, UTF-16 lost, UTF-8 won. I would be tempted to say that tools using another encoding are the ones that needs fixing, if your language of choice's strings enforce UTF-16 (or any encoding, including UTF-8) maybe using strings to store the buffer contents is not a good idea and you should be using a byte array instead... Unfortunately, that does not solve the status quo, language-server-protocol is going to be using UTF-16 for the time being. Internally, Kakoune actually uses 3 different horizontal coordinates: bytes, characters, and columns.
We could relatively easily expose any of those, but as described by @Screwtapello, none is entirely satisfying by itself. One additional complication is timestamping, Kakoune accepts input that does not match the current buffer state (say for a ranges-highlighter input), it will update those coordinates by using the buffer changes vector (which tracks modifications made to the buffer), but those only give information on the byte level, and supporting anything else would mean we need either to store not only the count of bytes added/removed, but also the count of columns and/or the count of codepoint... I am not really looking forward to that. All that to say I have no real solution to this problem, I am unconvinced any of the alternatives to the status quo is significantly better, and I think the status quo is the most robust solution, because its the only one which is unambiguous. But that means we cannot solve the mismatched encoding problem from Kakoune, and other tools needs to be fixed to stop enforcing an encoding on arbitrary bytes (because frankly, there is no technical reasons to do that, lsp uses UTF-16 because the designers were lazy and exposed an VSCode implementation detail to the world). |
|
Oh, I skipped part of my reaction, which is: alright, I get that, for binary file reasons, that we'll want byte coordinates. |
OK... updated proposal:
2 and 3 are what I need, minimally, to make To fully fix |
I've changed the title to be more current. I've started working on this. Currently, it looks like this: Column values given to
Using suffixes instead of a The biggest complication to implementing this is when |
I hope you mean "codepoints" here.
I feel like it should be pretty important for external, asynchronous processes like parinfer-rust and kak-lsp, since the user might have done more stuff between the message being sent and the response received. Does |
Yes |
Expansions such as
%val{selection_desc}
expose columns in byte index within the line, and commands such asselect
take selection in terms of byte index within line.This makes integrations hard. So far as I've experienced, byte-index columns aren't useful external to Kakoune. So far, I have unicode issues with
parinfer-rust
that I will have to fix by converting coordinates to characters, and I have unicode issues with my new selection tool based on this as well.The selection tool is written in Clojure, therefore using Java strings, so reading the file produces UTF16 characters anyway, so the byte index of UTF8 can't be used directly for converting, and if there are multiple encodings of a character in UTF8, I'll have to guess which way it occurred to interpret the bytes.
Further, converting these requires extra content from the buffer. If I have a $kak_selection that start > column 1, I need to know the entire text of the starting line in addition to the text of the selection I probably want. We don't have access to the text of the buffer outside of selections unless we run some keys in draft mode and save the value to an option.
So, I propose that byte indices are not exposed for scripting. Instead, all expansions and commands use character indices. The user can split diacritics and other compound characters, but not multibyte character encodings.
No matter how it is sliced, this would be a bit of work. There's a couple ways to do it. One way is to audit everything at user-interface layer (commands and expansions), and ensure they provide the right data. Another way is to encapsulate this at the buffer layer and not expose byte indices to the rest of Kakoune.
IMHO, both add internal complexity. Both reduce interface complexity. The latter way contains the complexity better, but is harder. It would be nice to figure out how to break it down into smaller steps.
(Note: It would be possible to make parallel expansions that use char indices - some already exist - and add options to commands like
select
. I'm proposing the big change here because I think it'll be better for Kakoune.)The text was updated successfully, but these errors were encountered: