-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
properly handle LSP position encoding #5711
Conversation
offset_encoding: OffsetEncoding::Utf8, | ||
offset_encoding: OffsetEncoding::Utf16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core fix. As we harcode the encoding right now so only Utf16
get actually used (that was implemented correctly for the most part) so the UTf8 fix doesn't matter anymore but I still kept it (and the other cleanup) around as futureproofing when we can update to LSP 3.17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't the idea here that we ask for UTF-8 instead of UTF-16 if possible? Based on the clangd negotiation proposal that was merged into the spec. We should prefer UTF-8 if it can be negotiated, then UTF-16 as fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly but that proposal is only available in LSP 3.17 which lsp_types
doesn't support yet.
Right now we don't negotiate anything (offset_encoding is only set here and never changed) and just assume UTF-8 which is incorrect.
Once lsp_types
updates to 3.17 (there is a PR but not yet merged) I am happy to add that. For now using UTF-16 seems the right thing to me tough as almost all LSPs only support that (and it's the mandatory to use UTf-16 when not otherwise negotiated according to the standard)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also from a performance standpoint UTF-32 is actually faster as we use chard indecies. Whether we convert UTf-8 or UTF-16 to UTF-32 (char index) with ropey makes little difference. Ropey does some additional counting but we need to transverse the chunk (so at most 4KB) in either case. The UTF-8, UTF-16 and UTF-32 position at the start of each chunk are always maintained inside the rope anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually, the spec mandates UTF-16 (thanks Microsoft..) but a lot of implementations do it wrong and just use UTF-8 offsets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that is unfortunate but switching to UTF-16 can't be causing too many problems as right now on master our UTF-8 implementation is actually a UTF-32 implementation by accident (we treat the characters
offset as char offsets instead of byte offsets) so right now helix behaves incorrectly for either case.
I think we should probably stick to the standard. We could add an option to langauges.toml to work around incorrect LS. The lsp_types
update isn't far away either so that will be the proper solution then.
Just to clarify: This is just about how the characters
field of the Position
struct is treated. The actual text send back and forth is always UTF-8 (that is standard compliant and it's stupid that the offset encoding doesn't match the actual data but oh well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure that most LSPs support utf-16 as a fallback tough. For example for clangd:
https://clangd.llvm.org/extensions.html#utf-8-offsets
New client capability: offsetEncoding : string[]:
Lists the encodings the client supports, in preference order. It SHOULD include "utf-16". If not present, it is assumed to be ["utf-16"]
Otherwise these LSPs would not work with vscode at all (which always uses UTF-16 AFAIK). An LS that doesn't work with VSCode seems like an exception to me.
test_case!("", (0, 1) => Some(0)); | ||
test_case!("", (1, 0) => None); | ||
test_case!("\n\n", (0, 0) => Some(0)); | ||
test_case!("\n\n", (1, 0) => Some(1)); | ||
test_case!("\n\n", (1, 1) => Some(2)); | ||
test_case!("\n\n", (1, 1) => Some(1)); | ||
test_case!("\n\n", (2, 0) => Some(2)); | ||
test_case!("\n\n", (3, 0) => None); | ||
test_case!("test\n\n\n\ncase", (4, 3) => Some(11)); | ||
test_case!("test\n\n\n\ncase", (4, 4) => Some(12)); | ||
test_case!("test\n\n\n\ncase", (4, 5) => None); | ||
test_case!("test\n\n\n\ncase", (4, 5) => Some(12)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes in this test reflect that the offset by Position::character
is now capped to the line length (before the line terminator)
Likely fixes #3286 too |
Likely fixes #2547 too altough that PR is missing info to reproduce the issue reliably. But since editing text after emoojis is involved I am quite certain that the problem was also caused by UTF-32/UTF-16 missmatch |
Also fixes #5809 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments but otherwise looks good to me
@@ -203,6 +203,13 @@ pub fn line_end_char_index(slice: &RopeSlice, line: usize) -> usize { | |||
.unwrap_or(0) | |||
} | |||
|
|||
pub fn line_end_byte_index(slice: &RopeSlice, line: usize) -> usize { | |||
slice.line_to_byte(line + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any risk of going over the line count here with the + 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is something I didn't consider and a good point.
I just copied the line_end_char_idx
function and converted it to return a byte index. For this specific usecase it's fine since ropey allows one past the end index but LSP doesn't (and that's enforced at the callsite of this function at the start of lsp_pos_to_pos
so any newly added calls here can never lead to an out of bounds panic). We should probably check all existing call sites of line_end_char_idx
but that would be seprerste from this PR as it's an existing function.
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
Fixes #4791
Fixes #3286
Fixes #2547
Fixes #5809
I started to thoroughly investigate the char_pos <-> lsp_pos conversion in an effort to fix #4791. Initially I just started with the suggestion by @the-mikdavis of capping the
character
offset to the line length. While this is only mentioned offhand in the LSP standard, it's still part of the standard and seems a good precaution against crashing from broken LSPs.This fixed the original panic but caused panics and weird formatting bugs for slightly different json files. So I investigated further and found a pretty mayor bug: Helix implemented the
UTF-8
encoding incorrectly:Currently, helix treat the
Position::character
field as a char offset inUtf-8
mode and asutf16
offset in UTF-16 mode. This is incorrect, thePosition::character
field corresponds toUTF-8
byte, what we had really implemented wasUTF-32
.From the LSP spec:
Fixing the UTF-8 encoding however only made the problems worse. The reason was simply that it was incorrect for us to use UTF-8 encoding at all. The
lsp-types
crate and helix (and many LSP servers) don't support LSP3.17
yet so this option is not available to us. Instead we just hardcoded the encoding toUTF-8
. However, we should be hard-coding helix to use utf-16 instead. To quote the standard:The only reason this hasn't caused much more problems yet is that these two bugs somewhat cancel out. Specifically we treated
UTF-8
likeUTF-32
. Characters that require 2UTF-16
codepoints instead of just 1 are rare and therefore problems were rare and these bugs essentially canceled out.TLDR: