-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix utf16 offset handling #427
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review the part about error marking, we should probably first figure out whether my point about the other places in the code is right or not.
src/document.jl
Outdated
@@ -60,6 +60,9 @@ function get_offset(doc::Document, line::Integer, character::Integer) | |||
offset = line_offsets[line + 1] | |||
while character > 0 | |||
offset = nextind(doc._content, offset) | |||
if nextind(get_text(doc), offset) - offset > 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this seems the wrong cut-off? UTF-8 uses more than 2 codepoints for anything above U+0800, but for UTF-16 the cutoff from 1 to 2 codepoints is 0x010000, I believe. So I think we need to capture the case where a character is above 0x010000, right? Also, wouldn't it be easier to directly check for the value of the char, like I did at https://github.com/julia-vscode/LanguageServer.jl/pull/419/files#diff-434665f0bad6ba3f5bf14654c456fa35R61?
@StefanKarpinski, can you help? I feel on very shaky ground whenever it comes to anything UTF :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think 0x010000 is right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I've altered it to use an IOBuffer and explicitly check at those cutoffs
src/document.jl
Outdated
while offset > ind | ||
ind = nextind(doc._content, ind) | ||
while offset >= ind | ||
if nextind(get_text(doc), ind) - ind > 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same story here, I don't think having more than 2 codepoints in UTF-8 is the right cut-off, if I understand UTF-8 and UTF-16 correctly.
For info - |
Can you explain a bit more clearly what the expectations on the inputs and outputs are in terms of encoding and what's computed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, do we really need to use an IOBuffer
here? It is really only used to get index positions, shouldn't we be able to do that with nextind
as well?
I'll try to fix the tests on master
so that we can actually run our tests on this.
c = read(io, Char) | ||
if UInt32(c) >= 0x010000 | ||
char += 1 | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be able to somehow reuse the functions for converting to and from offsets here as well, so that we don't have to handle this manually here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, lets merge this. I think we can still try to come up with a more elegant version later on.
Super, I'd rather keep it obvious (to me) how it works incase further issues arise immediately |
Yep, that seems most important :) So I'd say you merge this here, and then we try to fix the tests on |
Yep sounds like a plan |
I think this should do it, supersedes #419. I think before we were essentially treating the character index as UTF32.
Thanks to @non-Jedi for finding this issue along with all the background ...