-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LanguageServer.jl uses characters for position rather than UTF-16 codeunits #401
Comments
Wow, that is insane! Does that mean we should use some UTF16 string type internally in the LS? Or is there some easy way to convert indices? |
@StefanKarpinski do you have advice what the best strategy here is? Right now we hold a copy of each source file as a julia My gut feeling is that we want to stick with the |
I found the following two links in the discussion of this problem on the LS repo that supposedly contain code that convert indices in the way we need it. One in Typescript: https://github.com/NeekSandhu/onigasm/blob/master/src/OnigString.ts. And another one in C++: https://github.com/atom/node-oniguruma/blob/9e3334b4fbe50752ec672fed29c48fc583e44485/src/onig-string.cc#L32-L85. |
Since the "Windows languages" use UTF-16 internally, I think that the LS protocol was designed with the intention of representing the document as UTF-16 in all the clients/servers, insane as that is when an encoding-agnostic approach like unicode codepoints already exists. https://github.com/JuliaStrings/LegacyStrings.jl includes a type for UTF-16 if you want to go down that route. |
Oh wow, that's unfortunate. UTF-16, the worst of all worlds encoding. I guess no one has told Microsoft that UTF-8 won? 😝
I think that instinct is on point. I don't have any code handy for this, but if you can describe what translation you need, I could try to write something—shouldn't be too tricky. Do you want |
We probably need @ZacLN's input here on what exactly we need, not clear to me right now. I think (but not sure) a lot of times position info is transmitted as a line/column combo, so what we probably need is to convert only those relative column offsets between the two index systems... |
LanguageServer.jl currently uses character index (same as codepoint, right?) for everything, so that's going to be easiest to convert to/from. |
So you need to scan from the start of the string, counting characters up to the character index and add 1 for BMP characters and 2 for non-BMP characters to compute the UTF-16 code unit index. |
Would be nice to stick to the UTF-8 where available. |
I'm closing this issue because we now have a PR that is tracking things. |
As per the quote in #400:
The character offset should be in terms of UTF-16 codeunits. As far as
I can tell, LanguageServer.jl only uses UTF-8 internally and works in
terms of characters (codepoints) rather than codeunits. eglot works around
this
but not all editors might. I have no idea how VSCode behaves.
So for e.g. the file:
Asking for line 1 position 6 should show the hover for
𐐀𐐀𐐀𐐀
sincethe 7th UTF-16 codeunit is still within that variable. Instead it
shows the hover for
𐐀𐐀𐐀
:There's some discussion about the awkwardness of using UTF-16 code units at microsoft/language-server-protocol#376 and a survey of other implementations at https://github.com/Avi-D-coder/lsp-range-unit-survey.
The text was updated successfully, but these errors were encountered: