properly handle LSP position encoding #5711

pascalkuthe · 2023-01-27T23:21:33Z

Fixes #4791
Fixes #3286
Fixes #2547
Fixes #5809

I started to thoroughly investigate the char_pos <-> lsp_pos conversion in an effort to fix #4791. Initially I just started with the suggestion by @the-mikdavis of capping the character offset to the line length. While this is only mentioned offhand in the LSP standard, it's still part of the standard and seems a good precaution against crashing from broken LSPs.

This fixed the original panic but caused panics and weird formatting bugs for slightly different json files. So I investigated further and found a pretty mayor bug: Helix implemented the UTF-8 encoding incorrectly:

Currently, helix treat the Position::character field as a char offset in Utf-8 mode and as utf16 offset in UTF-16 mode. This is incorrect, the Position::character field corresponds to UTF-8 byte, what we had really implemented was UTF-32.

From the LSP spec:

When describing positions the protocol needs to specify how offsets (specifically character offsets) should be interpreted. The corresponding PositionEncodingKind is negotiated between the client and the server during initialization.

/**
* A type indicating how positions are encoded,
* specifically what column offsets mean.
*
* @since 3.17.0
*/
export type PositionEncodingKind = string;

/**
* A set of predefined position encoding kinds.
*
* @since 3.17.0
*/
export namespace PositionEncodingKind {

/**
* Character offsets count UTF-8 code units (e.g bytes).
*/
export const UTF8: PositionEncodingKind = 'utf-8';

/**
* Character offsets count UTF-16 code units.
*
* This is the default and must always be supported
* by servers
*/
export const UTF16: PositionEncodingKind = 'utf-16';

/**
* Character offsets count UTF-32 code units.
*
* Implementation note: these are the same as Unicode code points,
* so this `PositionEncodingKind` may also be used for an
* encoding-agnostic representation of character offsets.
*/
export const UTF32: PositionEncodingKind = 'utf-32';
}

Fixing the UTF-8 encoding however only made the problems worse. The reason was simply that it was incorrect for us to use UTF-8 encoding at all. The lsp-types crate and helix (and many LSP servers) don't support LSP 3.17 yet so this option is not available to us. Instead we just hardcoded the encoding to UTF-8. However, we should be hard-coding helix to use utf-16 instead. To quote the standard:

Prior to 3.17 the offsets were always based on a UTF-16 string representation. So in a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16. Since 3.17 clients and servers can agree on a different string encoding representation (e.g. UTF-8). The client announces it’s supported encoding via the client capability general.positionEncodings. The value is an array of position encodings the client supports, with decreasing preference (e.g. the encoding at index 0 is the most preferred one). To stay backwards compatible the only mandatory encoding is UTF-16 represented via the string utf-16. The server can pick one of the encodings offered by the client and signals that encoding back to the client via the initialize result’s property capabilities.positionEncoding. If the string value utf-16 is missing from the client’s capability general.positionEncodings servers can safely assume that the client supports UTF-16. If the server omits the position encoding in its initialize result the encoding defaults to the string value utf-16.

The only reason this hasn't caused much more problems yet is that these two bugs somewhat cancel out. Specifically we treated UTF-8 like UTF-32. Characters that require 2 UTF-16 codepoints instead of just 1 are rare and therefore problems were rare and these bugs essentially canceled out.

TLDR:

Helix uses UTF-8 encoding for LSP offsets
It should be using UTF-16 (first bug fixed here)
The UTF-8 encoding was accidently a UTF-32 encoding (second bug fixed here), which is the same as UTF-16 for commonly used chars
Added some minor cleanups and comments to be closer to the LSP spec

pascalkuthe · 2023-01-27T23:27:13Z

helix-lsp/src/client.rs

-            offset_encoding: OffsetEncoding::Utf8,
+            offset_encoding: OffsetEncoding::Utf16,


This is the core fix. As we harcode the encoding right now so only Utf16 get actually used (that was implemented correctly for the most part) so the UTf8 fix doesn't matter anymore but I still kept it (and the other cleanup) around as futureproofing when we can update to LSP 3.17

Wasn't the idea here that we ask for UTF-8 instead of UTF-16 if possible? Based on the clangd negotiation proposal that was merged into the spec. We should prefer UTF-8 if it can be negotiated, then UTF-16 as fallback.

Yeah exactly but that proposal is only available in LSP 3.17 which lsp_types doesn't support yet.
Right now we don't negotiate anything (offset_encoding is only set here and never changed) and just assume UTF-8 which is incorrect.

Once lsp_types updates to 3.17 (there is a PR but not yet merged) I am happy to add that. For now using UTF-16 seems the right thing to me tough as almost all LSPs only support that (and it's the mandatory to use UTf-16 when not otherwise negotiated according to the standard)

Also from a performance standpoint UTF-32 is actually faster as we use chard indecies. Whether we convert UTf-8 or UTF-16 to UTF-32 (char index) with ropey makes little difference. Ropey does some additional counting but we need to transverse the chunk (so at most 4KB) in either case. The UTF-8, UTF-16 and UTF-32 position at the start of each chunk are always maintained inside the rope anyway

So actually, the spec mandates UTF-16 (thanks Microsoft..) but a lot of implementations do it wrong and just use UTF-8 offsets

Hmm that is unfortunate but switching to UTF-16 can't be causing too many problems as right now on master our UTF-8 implementation is actually a UTF-32 implementation by accident (we treat the characters offset as char offsets instead of byte offsets) so right now helix behaves incorrectly for either case.

I think we should probably stick to the standard. We could add an option to langauges.toml to work around incorrect LS. The lsp_types update isn't far away either so that will be the proper solution then.

Just to clarify: This is just about how the characters field of the Position struct is treated. The actual text send back and forth is always UTF-8 (that is standard compliant and it's stupid that the offset encoding doesn't match the actual data but oh well)

I am pretty sure that most LSPs support utf-16 as a fallback tough. For example for clangd:

https://clangd.llvm.org/extensions.html#utf-8-offsets

New client capability: offsetEncoding : string[]:

Lists the encodings the client supports, in preference order. It SHOULD include "utf-16". If not present, it is assumed to be ["utf-16"]

Otherwise these LSPs would not work with vscode at all (which always uses UTF-16 AFAIK). An LS that doesn't work with VSCode seems like an exception to me.

pascalkuthe · 2023-01-27T23:40:59Z

helix-lsp/src/lib.rs

+        test_case!("", (0, 1) => Some(0));
        test_case!("", (1, 0) => None);
        test_case!("\n\n", (0, 0) => Some(0));
        test_case!("\n\n", (1, 0) => Some(1));
-        test_case!("\n\n", (1, 1) => Some(2));
+        test_case!("\n\n", (1, 1) => Some(1));
        test_case!("\n\n", (2, 0) => Some(2));
        test_case!("\n\n", (3, 0) => None);
        test_case!("test\n\n\n\ncase", (4, 3) => Some(11));
        test_case!("test\n\n\n\ncase", (4, 4) => Some(12));
-        test_case!("test\n\n\n\ncase", (4, 5) => None);
+        test_case!("test\n\n\n\ncase", (4, 5) => Some(12));


changes in this test reflect that the offset by Position::character is now capped to the line length (before the line terminator)

pascalkuthe · 2023-01-28T23:34:49Z

Likely fixes #3286 too ~~but I haven't tested that yet~~. I was now able to test, this PR indeed fixes that problem too.

pascalkuthe · 2023-01-30T00:07:20Z

Likely fixes #2547 too altough that PR is missing info to reproduce the issue reliably. But since editing text after emoojis is involved I am quite certain that the problem was also caused by UTF-32/UTF-16 missmatch

pascalkuthe · 2023-02-03T17:57:35Z

Also fixes #5809

the-mikedavis

Some small comments but otherwise looks good to me

the-mikedavis · 2023-02-08T23:04:42Z

helix-core/src/line_ending.rs

@@ -203,6 +203,13 @@ pub fn line_end_char_index(slice: &RopeSlice, line: usize) -> usize {
            .unwrap_or(0)
 }

+pub fn line_end_byte_index(slice: &RopeSlice, line: usize) -> usize {
+    slice.line_to_byte(line + 1)


Is there any risk of going over the line count here with the + 1?

Hmm this is something I didn't consider and a good point.

I just copied the line_end_char_idx function and converted it to return a byte index. For this specific usecase it's fine since ropey allows one past the end index but LSP doesn't (and that's enforced at the callsite of this function at the start of lsp_pos_to_pos so any newly added calls here can never lead to an out of bounds panic). We should probably check all existing call sites of line_end_char_idx but that would be seprerste from this PR as it's an existing function.

helix-lsp/src/lib.rs

Co-authored-by: Michael Davis <mcarsondavis@gmail.com>

pascalkuthe added 2 commits January 28, 2023 00:00

properly handle LSP position encoding

ea38284

add debug assertion to Transaction::change

2bba3a6

pascalkuthe added C-bug Category: This is a bug A-language-server Area: Language server client E-medium Call for participation: Experience needed to fix: Medium / intermediate S-waiting-on-review Status: Awaiting review from a maintainer. labels Jan 27, 2023

pascalkuthe commented Jan 27, 2023

View reviewed changes

David-Else mentioned this pull request Jan 28, 2023

LSP Autocompletion does not insert, it overwrites #4380

Closed

This was referenced Jan 30, 2023

Helix crashed on "Undo change" (u) [undo] #5733

Closed

Incorrect underlining of errors and wrong results from code action #5809

Closed

This was referenced Feb 7, 2023

Crash when formatting document #4673

Closed

Crash on write, open buffers corrupted, files overridden, #4422

Closed

archseer requested a review from the-mikedavis February 8, 2023 07:44

the-mikedavis reviewed Feb 8, 2023

View reviewed changes

Apply suggestions from code review

9eacb45

Co-authored-by: Michael Davis <mcarsondavis@gmail.com>

the-mikedavis approved these changes Feb 9, 2023

View reviewed changes

archseer approved these changes Feb 9, 2023

View reviewed changes

archseer merged commit 7ebcf4e into helix-editor:master Feb 9, 2023

pascalkuthe deleted the fix_lsp_encoding branch February 9, 2023 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

properly handle LSP position encoding #5711

properly handle LSP position encoding #5711

pascalkuthe commented Jan 27, 2023 •

edited

Loading

pascalkuthe Jan 27, 2023 •

edited

Loading

archseer Feb 1, 2023

pascalkuthe Feb 1, 2023 •

edited

Loading

pascalkuthe Feb 1, 2023 •

edited

Loading

archseer Feb 1, 2023

pascalkuthe Feb 1, 2023 •

edited

Loading

pascalkuthe Feb 1, 2023 •

edited

Loading

pascalkuthe Jan 27, 2023

pascalkuthe commented Jan 28, 2023 •

edited

Loading

pascalkuthe commented Jan 30, 2023

pascalkuthe commented Feb 3, 2023

the-mikedavis left a comment

the-mikedavis Feb 8, 2023

pascalkuthe Feb 8, 2023 •

edited

Loading

		offset_encoding: OffsetEncoding::Utf8,
		offset_encoding: OffsetEncoding::Utf16,

properly handle LSP position encoding #5711

properly handle LSP position encoding #5711

Conversation

pascalkuthe commented Jan 27, 2023 • edited Loading

pascalkuthe Jan 27, 2023 • edited Loading

Choose a reason for hiding this comment

archseer Feb 1, 2023

Choose a reason for hiding this comment

pascalkuthe Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

pascalkuthe Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

archseer Feb 1, 2023

Choose a reason for hiding this comment

pascalkuthe Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

pascalkuthe Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

pascalkuthe Jan 27, 2023

Choose a reason for hiding this comment

pascalkuthe commented Jan 28, 2023 • edited Loading

pascalkuthe commented Jan 30, 2023

pascalkuthe commented Feb 3, 2023

the-mikedavis left a comment

Choose a reason for hiding this comment

the-mikedavis Feb 8, 2023

Choose a reason for hiding this comment

pascalkuthe Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

pascalkuthe commented Jan 27, 2023 •

edited

Loading

pascalkuthe Jan 27, 2023 •

edited

Loading

pascalkuthe Feb 1, 2023 •

edited

Loading

pascalkuthe Feb 1, 2023 •

edited

Loading

pascalkuthe Feb 1, 2023 •

edited

Loading

pascalkuthe Feb 1, 2023 •

edited

Loading

pascalkuthe commented Jan 28, 2023 •

edited

Loading

pascalkuthe Feb 8, 2023 •

edited

Loading