Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix utf16 offset handling #427

Merged
merged 5 commits into from
Dec 15, 2019
Merged

fix utf16 offset handling #427

merged 5 commits into from
Dec 15, 2019

Conversation

ZacLN
Copy link
Contributor

@ZacLN ZacLN commented Dec 4, 2019

I think this should do it, supersedes #419. I think before we were essentially treating the character index as UTF32.

Thanks to @non-Jedi for finding this issue along with all the background ...

@ZacLN ZacLN requested a review from davidanthoff December 4, 2019 16:06
Copy link
Member

@davidanthoff davidanthoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the part about error marking, we should probably first figure out whether my point about the other places in the code is right or not.

src/document.jl Outdated
@@ -60,6 +60,9 @@ function get_offset(doc::Document, line::Integer, character::Integer)
offset = line_offsets[line + 1]
while character > 0
offset = nextind(doc._content, offset)
if nextind(get_text(doc), offset) - offset > 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this seems the wrong cut-off? UTF-8 uses more than 2 codepoints for anything above U+0800, but for UTF-16 the cutoff from 1 to 2 codepoints is 0x010000, I believe. So I think we need to capture the case where a character is above 0x010000, right? Also, wouldn't it be easier to directly check for the value of the char, like I did at https://github.com/julia-vscode/LanguageServer.jl/pull/419/files#diff-434665f0bad6ba3f5bf14654c456fa35R61?

@StefanKarpinski, can you help? I feel on very shaky ground whenever it comes to anything UTF :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think 0x010000 is right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I've altered it to use an IOBuffer and explicitly check at those cutoffs

src/document.jl Outdated
while offset > ind
ind = nextind(doc._content, ind)
while offset >= ind
if nextind(get_text(doc), ind) - ind > 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same story here, I don't think having more than 2 codepoints in UTF-8 is the right cut-off, if I understand UTF-8 and UTF-16 correctly.

@ZacLN
Copy link
Contributor Author

ZacLN commented Dec 15, 2019

For info - mark_errors uses exactly the same approach as get_position_at but handles (sorted) batches of offsets

@StefanKarpinski
Copy link

Can you explain a bit more clearly what the expectations on the inputs and outputs are in terms of encoding and what's computed?

Copy link
Member

@davidanthoff davidanthoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, do we really need to use an IOBuffer here? It is really only used to get index positions, shouldn't we be able to do that with nextind as well?

I'll try to fix the tests on master so that we can actually run our tests on this.

src/document.jl Show resolved Hide resolved
src/document.jl Show resolved Hide resolved
c = read(io, Char)
if UInt32(c) >= 0x010000
char += 1
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be able to somehow reuse the functions for converting to and from offsets here as well, so that we don't have to handle this manually here?

Copy link
Member

@davidanthoff davidanthoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, lets merge this. I think we can still try to come up with a more elegant version later on.

src/document.jl Show resolved Hide resolved
src/document.jl Show resolved Hide resolved
@ZacLN
Copy link
Contributor Author

ZacLN commented Dec 15, 2019

Super, I'd rather keep it obvious (to me) how it works incase further issues arise immediately

@davidanthoff
Copy link
Member

I'd rather keep it obvious (to me) how it works incase further issues arise immediately

Yep, that seems most important :)

So I'd say you merge this here, and then we try to fix the tests on master? And then I tag new versions of all packages and release a first release candidate?

@ZacLN
Copy link
Contributor Author

ZacLN commented Dec 15, 2019

Yep sounds like a plan

@ZacLN ZacLN merged commit b1cc3c5 into master Dec 15, 2019
@ZacLN ZacLN deleted the utf16handling branch December 15, 2019 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants