Convert positions from LSP coordinates to Kakoune coordinates #98

Screwtapello · 2018-10-08T06:51:19Z

The LSP specification (version 3.0) says:

A position inside a document (see Position definition below) is expressed as a zero-based line and character offset. The offsets are based on a UTF-16 string representation. So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16.

Meanwhile, Kakoune uses one-based line and character offsets, and seems to count 1 for every kind of character, including basic ASCII, Basic Multilingual Plane characters, astral plane characters like emoji, and individual combining characters.

Currently kak-lsp converts positions by adding 1 (converting from zero-based to one-based), but does not account for the difference between codepoints and UTF-16 code units.

ul · 2018-10-08T07:14:26Z

Good catch!

ul · 2018-10-08T09:45:18Z

Do you have any ideas how to fix it efficiently? Looks like generic solution requires kak-lsp to track and analyze contents of open buffers =(

@mawww Do you know anything already implemented on Kakoune side which might help with such conversion?

mawww · 2018-10-08T22:40:44Z

Argh, Microsoft, again ? I thought we were friends... More seriously it bothers me they cannot let utf-16 die in the MS world, utf-8 won, everybody uses utf-8 except to access the win32 api... Its even stupider as the text documents themselves are expected to be transferred as utf-8. Frankly I view this as a bug in the lsp spec, and ideally we should lobby them to fix that, but I doubt this will get fixed anytime soon...

Kakoune uses 0-based byte coordinates for selections internally, and exposes them as 1-based byte coordinates to the external world (because user side line/columns are traditionally 1-based, as seen in compiler error message for example).

I would find it really ugly for kak-lsp to have to store the buffer content itself just for that case, an alternate solution (that I am not really happy with either) would be to have a way to specify utf-16 based coordinates to kakoune (say :select -utf16 ...), and handle the ugly details there.

The best alternative remains to remind the LSP spec writer that there were 3 sane alternatives (utf8 byte coordinates, column coordinates or codepoint coordinates) and for some strange/historical reason they went with another one...

Yeah, I am a bit annoyed at you Microsoft 😄

Edit: Here is the discussion on the lsp side: microsoft/language-server-protocol#376

Screwtapello · 2018-10-08T23:41:37Z

(to be fair to Microsoft, I'm guessing this particular API decision comes from VS Code being written in JavaScript, whose spec requires UTF-16 strings, not particularly the Win32/Cocoa/Java APIs)

Screwtapello · 2018-10-08T23:54:53Z

As discussed on IRC, kak-lsp wouldn't necessarily need to cache the entire document: if you had a list of the offsets at which astral-plane characters appear, you could take each LSP coordinate and binary-search in the list to see how many astral-plane characters appear before it, and subtract that number from the offset to find the codepoint offset.

As for finding astral-plane characters, some quick investigation with Python:

>>> "\uffff".encode('utf-8')
b'\xef\xbf\xbf'
>>> "\U00010000".encode('utf-8')
b'\xf0\x90\x80\x80'
>>> "\U0010FFFF".encode('utf-8')
b'\xf4\x8f\xbf\xbf'

... suggests that any byte whose value >= 0xf0 is the initial byte of an astral-plane character. That should be pretty easy to search for, without having to transcode anything to UTF-16 and count code-points.

Kakoune uses 0-based byte coordinates for selections internally

Wait, so the line:column indicator in the status-bar (which seems to count codepoints) is unrelated to the line.column syntax used in ranges and selections? That seems... misleading.

mawww · 2018-10-09T02:26:33Z

Wait, so the line:column indicator in the status-bar (which seems to count codepoints) is unrelated to the line.column syntax used in ranges and selections? That seems... misleading.

ranges and selections use <line>.<byte since line start>, the indication given in the status line is <line>:<column since line start>, not sure if that is misleading or not. Both are displayed 1-based while internally they are stored (the byte ones, we do not store column information) 0-based.

krobelus · 2022-12-30T16:05:41Z

fixed by fb972fc (Use UTF-16 code unit offsets instead of code point offsets, as per LSP, 2022-09-03)

ul added bug Something isn't working high priority labels Oct 8, 2018

ul added the help wanted Extra attention is needed label Oct 8, 2018

Screwtapello mentioned this issue Oct 9, 2018

Be specific about the units of ranges and cursor positions. mawww/kakoune#2484

Merged

Screwtapello mentioned this issue Feb 6, 2019

Proposal: Select by character or display indices mawww/kakoune#2724

Closed

mawww mentioned this issue Feb 15, 2019

UTF-8 mode clangd/clangd#3

Closed

ul mentioned this issue Apr 6, 2019

The best way to select and highlight based on character coordinates mawww/kakoune#2839

Closed

ul added a commit that referenced this issue Apr 22, 2019

wip better character offset handling, ref #191, #98, #40

cb626b4

ul added a commit that referenced this issue Apr 22, 2019

wip better character offset handling, ref #191, #98, #40

3fc9b7e

krobelus closed this as completed Dec 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert positions from LSP coordinates to Kakoune coordinates #98

Convert positions from LSP coordinates to Kakoune coordinates #98

Screwtapello commented Oct 8, 2018

ul commented Oct 8, 2018

ul commented Oct 8, 2018

mawww commented Oct 8, 2018 •

edited

Loading

Screwtapello commented Oct 8, 2018 •

edited

Loading

Screwtapello commented Oct 8, 2018

mawww commented Oct 9, 2018

krobelus commented Dec 30, 2022

Convert positions from LSP coordinates to Kakoune coordinates #98

Convert positions from LSP coordinates to Kakoune coordinates #98

Comments

Screwtapello commented Oct 8, 2018

ul commented Oct 8, 2018

ul commented Oct 8, 2018

mawww commented Oct 8, 2018 • edited Loading

Screwtapello commented Oct 8, 2018 • edited Loading

Screwtapello commented Oct 8, 2018

mawww commented Oct 9, 2018

krobelus commented Dec 30, 2022

mawww commented Oct 8, 2018 •

edited

Loading

Screwtapello commented Oct 8, 2018 •

edited

Loading