fix utf16 offset handling #427

ZacLN · 2019-12-04T16:00:59Z

I think this should do it, supersedes #419. I think before we were essentially treating the character index as UTF32.

Thanks to @non-Jedi for finding this issue along with all the background ...

davidanthoff

I didn't review the part about error marking, we should probably first figure out whether my point about the other places in the code is right or not.

davidanthoff · 2019-12-14T20:05:39Z

src/document.jl

@@ -60,6 +60,9 @@ function get_offset(doc::Document, line::Integer, character::Integer)
    offset = line_offsets[line + 1]
    while character > 0
        offset = nextind(doc._content, offset)
+        if nextind(get_text(doc), offset) - offset > 2


So this seems the wrong cut-off? UTF-8 uses more than 2 codepoints for anything above U+0800, but for UTF-16 the cutoff from 1 to 2 codepoints is 0x010000, I believe. So I think we need to capture the case where a character is above 0x010000, right? Also, wouldn't it be easier to directly check for the value of the char, like I did at https://github.com/julia-vscode/LanguageServer.jl/pull/419/files#diff-434665f0bad6ba3f5bf14654c456fa35R61?

@StefanKarpinski, can you help? I feel on very shaky ground whenever it comes to anything UTF :)

Yes, I think 0x010000 is right.

Gotcha, I've altered it to use an IOBuffer and explicitly check at those cutoffs

davidanthoff · 2019-12-14T20:06:58Z

src/document.jl

-    while offset > ind
-        ind = nextind(doc._content, ind)
+    while offset >= ind
+        if nextind(get_text(doc), ind) - ind > 2


Same story here, I don't think having more than 2 codepoints in UTF-8 is the right cut-off, if I understand UTF-8 and UTF-16 correctly.

ZacLN · 2019-12-15T09:30:11Z

For info - mark_errors uses exactly the same approach as get_position_at but handles (sorted) batches of offsets

StefanKarpinski · 2019-12-15T16:49:31Z

Can you explain a bit more clearly what the expectations on the inputs and outputs are in terms of encoding and what's computed?

davidanthoff

Hm, do we really need to use an IOBuffer here? It is really only used to get index positions, shouldn't we be able to do that with nextind as well?

I'll try to fix the tests on master so that we can actually run our tests on this.

src/document.jl

davidanthoff · 2019-12-15T19:54:55Z

src/requests/textdocument.jl

+                    c = read(io, Char)
+                    if UInt32(c) >= 0x010000
+                        char += 1
+                    end


Shouldn't we be able to somehow reuse the functions for converting to and from offsets here as well, so that we don't have to handle this manually here?

davidanthoff

Alright, lets merge this. I think we can still try to come up with a more elegant version later on.

src/document.jl

ZacLN · 2019-12-15T23:49:00Z

Super, I'd rather keep it obvious (to me) how it works incase further issues arise immediately

davidanthoff · 2019-12-15T23:57:40Z

I'd rather keep it obvious (to me) how it works incase further issues arise immediately

Yep, that seems most important :)

So I'd say you merge this here, and then we try to fix the tests on master? And then I tag new versions of all packages and release a first release candidate?

ZacLN · 2019-12-15T23:59:26Z

Yep sounds like a plan

fix utf16 offset handling

da28c1c

ZacLN requested a review from davidanthoff December 4, 2019 16:06

ZacLN added 2 commits December 4, 2019 20:55

add change to mark_errors

f407a65

fix off by one error with unicode names

c1865ed

ZacLN mentioned this pull request Dec 10, 2019

renaming unicode variable eats extra space julia-vscode/julia-vscode#871

Closed

davidanthoff reviewed Dec 14, 2019

View reviewed changes

davidanthoff added the bug label Dec 14, 2019

davidanthoff modified the milestones: v0.7.0, v1.0.1 Dec 14, 2019

davidanthoff assigned ZacLN Dec 14, 2019

explicitly check Chars

d4c2ffa

davidanthoff reviewed Dec 15, 2019

View reviewed changes

Merge branch 'master' into utf16handling

0c649cc

davidanthoff approved these changes Dec 15, 2019

View reviewed changes

src/document.jl Show resolved Hide resolved

src/document.jl Show resolved Hide resolved

ZacLN merged commit b1cc3c5 into master Dec 15, 2019

ZacLN deleted the utf16handling branch December 15, 2019 23:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix utf16 offset handling #427

fix utf16 offset handling #427

ZacLN commented Dec 4, 2019

davidanthoff left a comment

davidanthoff Dec 14, 2019

StefanKarpinski Dec 15, 2019

ZacLN Dec 15, 2019

davidanthoff Dec 14, 2019

ZacLN commented Dec 15, 2019

StefanKarpinski commented Dec 15, 2019

davidanthoff left a comment

davidanthoff Dec 15, 2019

davidanthoff left a comment

ZacLN commented Dec 15, 2019

davidanthoff commented Dec 15, 2019

ZacLN commented Dec 15, 2019

fix utf16 offset handling #427

fix utf16 offset handling #427

Conversation

ZacLN commented Dec 4, 2019

davidanthoff left a comment

Choose a reason for hiding this comment

davidanthoff Dec 14, 2019

Choose a reason for hiding this comment

StefanKarpinski Dec 15, 2019

Choose a reason for hiding this comment

ZacLN Dec 15, 2019

Choose a reason for hiding this comment

davidanthoff Dec 14, 2019

Choose a reason for hiding this comment

ZacLN commented Dec 15, 2019

StefanKarpinski commented Dec 15, 2019

davidanthoff left a comment

Choose a reason for hiding this comment

davidanthoff Dec 15, 2019

Choose a reason for hiding this comment

davidanthoff left a comment

Choose a reason for hiding this comment

ZacLN commented Dec 15, 2019

davidanthoff commented Dec 15, 2019

ZacLN commented Dec 15, 2019