LanguageServer.jl uses characters for position rather than UTF-16 codeunits #401

non-Jedi · 2019-10-16T20:22:00Z

As per the quote in #400:

A position inside a document (see Position definition below) is expressed as a zero-based line and character offset. The offsets are based on a UTF-16 string representation. So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16.

The character offset should be in terms of UTF-16 codeunits. As far as
I can tell, LanguageServer.jl only uses UTF-8 internally and works in
terms of characters (codepoints) rather than codeunits. eglot works around
this
but not all editors might. I have no idea how VSCode behaves.

So for e.g. the file:

𐐀𐐀𐐀="hello"
𐐀𐐀𐐀𐐀=𐐀𐐀𐐀

Asking for line 1 position 6 should show the hover for 𐐀𐐀𐐀𐐀 since
the 7th UTF-16 codeunit is still within that variable. Instead it
shows the hover for 𐐀𐐀𐐀:

client-request (id:105) Wed Oct 16 16:29:55 2019:
(:jsonrpc "2.0" :id 105 :method "textDocument/hover" :params
          (:textDocument
           (:uri "file:///home/adam/tmp/test.jl")
           :position
           (:line 1 :character 6)))

server-reply (id:105) Wed Oct 16 16:29:55 2019:
(:id 105 :jsonrpc "2.0" :result
     (:contents
      [(:language "julia" :value "𐐀𐐀𐐀 = \"hello\"")]))

There's some discussion about the awkwardness of using UTF-16 code units at microsoft/language-server-protocol#376 and a survey of other implementations at https://github.com/Avi-D-coder/lsp-range-unit-survey.

The text was updated successfully, but these errors were encountered:

davidanthoff · 2019-10-25T22:51:24Z

Wow, that is insane!

Does that mean we should use some UTF16 string type internally in the LS? Or is there some easy way to convert indices?

davidanthoff · 2019-10-26T00:04:07Z

@StefanKarpinski do you have advice what the best strategy here is? Right now we hold a copy of each source file as a julia String instance in the language server, i.e. a UTF-8 encoded copy. But apparently we need to use UTF-16 codepoint offsets in the protocol to communicate with VS Code... So we will receive positions that we need to convert from UTF-16 codepoint offsets into UTF-8 codepoints, and the other way around as well, when we send positions via the protocol.

My gut feeling is that we want to stick with the String type from julia for our storage, and then maybe convert indices? Do you know of julia code around somewhere that does that?

davidanthoff · 2019-10-26T00:10:51Z

I found the following two links in the discussion of this problem on the LS repo that supposedly contain code that convert indices in the way we need it. One in Typescript: https://github.com/NeekSandhu/onigasm/blob/master/src/OnigString.ts. And another one in C++: https://github.com/atom/node-oniguruma/blob/9e3334b4fbe50752ec672fed29c48fc583e44485/src/onig-string.cc#L32-L85.

non-Jedi · 2019-10-26T00:28:03Z

Since the "Windows languages" use UTF-16 internally, I think that the LS protocol was designed with the intention of representing the document as UTF-16 in all the clients/servers, insane as that is when an encoding-agnostic approach like unicode codepoints already exists. https://github.com/JuliaStrings/LegacyStrings.jl includes a type for UTF-16 if you want to go down that route.

StefanKarpinski · 2019-10-26T13:22:19Z

But apparently we need to use UTF-16 codepoint offsets in the protocol to communicate with VS Code... So we will receive positions that we need to convert from UTF-16 codepoint offsets into UTF-8 codepoints, and the other way around as well, when we send positions via the protocol.

Oh wow, that's unfortunate. UTF-16, the worst of all worlds encoding. I guess no one has told Microsoft that UTF-8 won? 😝

My gut feeling is that we want to stick with the String type from julia for our storage, and then maybe convert indices? Do you know of julia code around somewhere that does that?

I think that instinct is on point. I don't have any code handy for this, but if you can describe what translation you need, I could try to write something—shouldn't be too tricky. Do you want utf16index(s, i) where i is a UTF-8 code unit index or where i is a character index? Either one is going to be O(n) but shouldn't require transcoding the entire string, just scanning and counting.

davidanthoff · 2019-10-27T01:22:01Z

We probably need @ZacLN's input here on what exactly we need, not clear to me right now. I think (but not sure) a lot of times position info is transmitted as a line/column combo, so what we probably need is to convert only those relative column offsets between the two index systems...

non-Jedi · 2019-10-30T20:48:36Z

Do you want utf16index(s, i) where i is a UTF-8 code unit index or where i is a character index?

LanguageServer.jl currently uses character index (same as codepoint, right?) for everything, so that's going to be easiest to convert to/from.

StefanKarpinski · 2019-10-31T02:31:30Z

So you need to scan from the start of the string, counting characters up to the character index and add 1 for BMP characters and 2 for non-BMP characters to compute the UTF-16 code unit index.

XVilka · 2019-11-21T05:05:50Z

Would be nice to stick to the UTF-8 where available.
Most self-respecting LSP implementations prefer UTF-8 instead of the UTF-16 if both server and client support this: https://github.com/Avi-D-coder/lsp-range-unit-survey

davidanthoff · 2019-12-14T20:49:56Z

I'm closing this issue because we now have a PR that is tracking things.

davidanthoff assigned davidanthoff and ZacLN Oct 25, 2019

davidanthoff added the bug label Oct 25, 2019

davidanthoff added this to the v0.7.0 milestone Oct 25, 2019

davidanthoff mentioned this issue Nov 15, 2019

Interpret Position offset as UTF16 #419

Closed

davidanthoff modified the milestones: v0.7.0, v1.0.1 Dec 14, 2019

davidanthoff closed this as completed Dec 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LanguageServer.jl uses characters for position rather than UTF-16 codeunits #401

LanguageServer.jl uses characters for position rather than UTF-16 codeunits #401

non-Jedi commented Oct 16, 2019 •

edited

Loading

davidanthoff commented Oct 25, 2019

davidanthoff commented Oct 26, 2019

davidanthoff commented Oct 26, 2019

non-Jedi commented Oct 26, 2019 •

edited

Loading

StefanKarpinski commented Oct 26, 2019 •

edited

Loading

davidanthoff commented Oct 27, 2019

non-Jedi commented Oct 30, 2019

StefanKarpinski commented Oct 31, 2019

XVilka commented Nov 21, 2019

davidanthoff commented Dec 14, 2019

LanguageServer.jl uses characters for position rather than UTF-16 codeunits #401

LanguageServer.jl uses characters for position rather than UTF-16 codeunits #401

Comments

non-Jedi commented Oct 16, 2019 • edited Loading

davidanthoff commented Oct 25, 2019

davidanthoff commented Oct 26, 2019

davidanthoff commented Oct 26, 2019

non-Jedi commented Oct 26, 2019 • edited Loading

StefanKarpinski commented Oct 26, 2019 • edited Loading

davidanthoff commented Oct 27, 2019

non-Jedi commented Oct 30, 2019

StefanKarpinski commented Oct 31, 2019

XVilka commented Nov 21, 2019

davidanthoff commented Dec 14, 2019

non-Jedi commented Oct 16, 2019 •

edited

Loading

non-Jedi commented Oct 26, 2019 •

edited

Loading

StefanKarpinski commented Oct 26, 2019 •

edited

Loading