Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LanguageServer.jl uses characters for position rather than UTF-16 codeunits #401

Closed
non-Jedi opened this issue Oct 16, 2019 · 10 comments
Closed
Assignees
Labels

Comments

@non-Jedi
Copy link
Member

non-Jedi commented Oct 16, 2019

As per the quote in #400:

A position inside a document (see Position definition below) is expressed as a zero-based line and character offset. The offsets are based on a UTF-16 string representation. So a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16.

The character offset should be in terms of UTF-16 codeunits. As far as
I can tell, LanguageServer.jl only uses UTF-8 internally and works in
terms of characters (codepoints) rather than codeunits. eglot works around
this

but not all editors might. I have no idea how VSCode behaves.

So for e.g. the file:

𐐀𐐀𐐀="hello"
𐐀𐐀𐐀𐐀=𐐀𐐀𐐀

Asking for line 1 position 6 should show the hover for 𐐀𐐀𐐀𐐀 since
the 7th UTF-16 codeunit is still within that variable. Instead it
shows the hover for 𐐀𐐀𐐀:

client-request (id:105) Wed Oct 16 16:29:55 2019:
(:jsonrpc "2.0" :id 105 :method "textDocument/hover" :params
          (:textDocument
           (:uri "file:///home/adam/tmp/test.jl")
           :position
           (:line 1 :character 6)))

server-reply (id:105) Wed Oct 16 16:29:55 2019:
(:id 105 :jsonrpc "2.0" :result
     (:contents
      [(:language "julia" :value "𐐀𐐀𐐀 = \"hello\"")]))

There's some discussion about the awkwardness of using UTF-16 code units at microsoft/language-server-protocol#376 and a survey of other implementations at https://github.com/Avi-D-coder/lsp-range-unit-survey.

@davidanthoff
Copy link
Member

Wow, that is insane!

Does that mean we should use some UTF16 string type internally in the LS? Or is there some easy way to convert indices?

@davidanthoff
Copy link
Member

@StefanKarpinski do you have advice what the best strategy here is? Right now we hold a copy of each source file as a julia String instance in the language server, i.e. a UTF-8 encoded copy. But apparently we need to use UTF-16 codepoint offsets in the protocol to communicate with VS Code... So we will receive positions that we need to convert from UTF-16 codepoint offsets into UTF-8 codepoints, and the other way around as well, when we send positions via the protocol.

My gut feeling is that we want to stick with the String type from julia for our storage, and then maybe convert indices? Do you know of julia code around somewhere that does that?

@davidanthoff
Copy link
Member

I found the following two links in the discussion of this problem on the LS repo that supposedly contain code that convert indices in the way we need it. One in Typescript: https://github.com/NeekSandhu/onigasm/blob/master/src/OnigString.ts. And another one in C++: https://github.com/atom/node-oniguruma/blob/9e3334b4fbe50752ec672fed29c48fc583e44485/src/onig-string.cc#L32-L85.

@non-Jedi
Copy link
Member Author

non-Jedi commented Oct 26, 2019

Since the "Windows languages" use UTF-16 internally, I think that the LS protocol was designed with the intention of representing the document as UTF-16 in all the clients/servers, insane as that is when an encoding-agnostic approach like unicode codepoints already exists. https://github.com/JuliaStrings/LegacyStrings.jl includes a type for UTF-16 if you want to go down that route.

@StefanKarpinski
Copy link

StefanKarpinski commented Oct 26, 2019

But apparently we need to use UTF-16 codepoint offsets in the protocol to communicate with VS Code... So we will receive positions that we need to convert from UTF-16 codepoint offsets into UTF-8 codepoints, and the other way around as well, when we send positions via the protocol.

Oh wow, that's unfortunate. UTF-16, the worst of all worlds encoding. I guess no one has told Microsoft that UTF-8 won? 😝

My gut feeling is that we want to stick with the String type from julia for our storage, and then maybe convert indices? Do you know of julia code around somewhere that does that?

I think that instinct is on point. I don't have any code handy for this, but if you can describe what translation you need, I could try to write something—shouldn't be too tricky. Do you want utf16index(s, i) where i is a UTF-8 code unit index or where i is a character index? Either one is going to be O(n) but shouldn't require transcoding the entire string, just scanning and counting.

@davidanthoff
Copy link
Member

We probably need @ZacLN's input here on what exactly we need, not clear to me right now. I think (but not sure) a lot of times position info is transmitted as a line/column combo, so what we probably need is to convert only those relative column offsets between the two index systems...

@non-Jedi
Copy link
Member Author

Do you want utf16index(s, i) where i is a UTF-8 code unit index or where i is a character index?

LanguageServer.jl currently uses character index (same as codepoint, right?) for everything, so that's going to be easiest to convert to/from.

@StefanKarpinski
Copy link

So you need to scan from the start of the string, counting characters up to the character index and add 1 for BMP characters and 2 for non-BMP characters to compute the UTF-16 code unit index.

@XVilka
Copy link

XVilka commented Nov 21, 2019

Would be nice to stick to the UTF-8 where available.
Most self-respecting LSP implementations prefer UTF-8 instead of the UTF-16 if both server and client support this: https://github.com/Avi-D-coder/lsp-range-unit-survey

@davidanthoff davidanthoff modified the milestones: v0.7.0, v1.0.1 Dec 14, 2019
@davidanthoff
Copy link
Member

I'm closing this issue because we now have a PR that is tracking things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants