Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

properly handle LSP position encoding #5711

Merged
merged 3 commits into from
Feb 9, 2023

Conversation

pascalkuthe
Copy link
Member

@pascalkuthe pascalkuthe commented Jan 27, 2023

Fixes #4791
Fixes #3286
Fixes #2547
Fixes #5809

I started to thoroughly investigate the char_pos <-> lsp_pos conversion in an effort to fix #4791. Initially I just started with the suggestion by @the-mikdavis of capping the character offset to the line length. While this is only mentioned offhand in the LSP standard, it's still part of the standard and seems a good precaution against crashing from broken LSPs.

This fixed the original panic but caused panics and weird formatting bugs for slightly different json files. So I investigated further and found a pretty mayor bug: Helix implemented the UTF-8 encoding incorrectly:

Currently, helix treat the Position::character field as a char offset in Utf-8 mode and as utf16 offset in UTF-16 mode. This is incorrect, the Position::character field corresponds to UTF-8 byte, what we had really implemented was UTF-32.

From the LSP spec:

When describing positions the protocol needs to specify how offsets (specifically character offsets) should be interpreted. The corresponding PositionEncodingKind is negotiated between the client and the server during initialization.

/**
* A type indicating how positions are encoded,
* specifically what column offsets mean.
*
* @since 3.17.0
*/
export type PositionEncodingKind = string;

/**
* A set of predefined position encoding kinds.
*
* @since 3.17.0
*/
export namespace PositionEncodingKind {

/**
* Character offsets count UTF-8 code units (e.g bytes).
*/
export const UTF8: PositionEncodingKind = 'utf-8';

/**
* Character offsets count UTF-16 code units.
*
* This is the default and must always be supported
* by servers
*/
export const UTF16: PositionEncodingKind = 'utf-16';

/**
* Character offsets count UTF-32 code units.
*
* Implementation note: these are the same as Unicode code points,
* so this `PositionEncodingKind` may also be used for an
* encoding-agnostic representation of character offsets.
*/
export const UTF32: PositionEncodingKind = 'utf-32';
}

Fixing the UTF-8 encoding however only made the problems worse. The reason was simply that it was incorrect for us to use UTF-8 encoding at all. The lsp-types crate and helix (and many LSP servers) don't support LSP 3.17 yet so this option is not available to us. Instead we just hardcoded the encoding to UTF-8. However, we should be hard-coding helix to use utf-16 instead. To quote the standard:

Prior to 3.17 the offsets were always based on a UTF-16 string representation. So in a string of the form a𐐀b the character offset of the character a is 0, the character offset of 𐐀 is 1 and the character offset of b is 3 since 𐐀 is represented using two code units in UTF-16. Since 3.17 clients and servers can agree on a different string encoding representation (e.g. UTF-8). The client announces it’s supported encoding via the client capability general.positionEncodings. The value is an array of position encodings the client supports, with decreasing preference (e.g. the encoding at index 0 is the most preferred one). To stay backwards compatible the only mandatory encoding is UTF-16 represented via the string utf-16. The server can pick one of the encodings offered by the client and signals that encoding back to the client via the initialize result’s property capabilities.positionEncoding. If the string value utf-16 is missing from the client’s capability general.positionEncodings servers can safely assume that the client supports UTF-16. If the server omits the position encoding in its initialize result the encoding defaults to the string value utf-16.

The only reason this hasn't caused much more problems yet is that these two bugs somewhat cancel out. Specifically we treated UTF-8 like UTF-32. Characters that require 2 UTF-16 codepoints instead of just 1 are rare and therefore problems were rare and these bugs essentially canceled out.

TLDR:

  • Helix uses UTF-8 encoding for LSP offsets
  • It should be using UTF-16 (first bug fixed here)
  • The UTF-8 encoding was accidently a UTF-32 encoding (second bug fixed here), which is the same as UTF-16 for commonly used chars
  • Added some minor cleanups and comments to be closer to the LSP spec

@pascalkuthe pascalkuthe added C-bug Category: This is a bug A-language-server Area: Language server client E-medium Call for participation: Experience needed to fix: Medium / intermediate S-waiting-on-review Status: Awaiting review from a maintainer. labels Jan 27, 2023
offset_encoding: OffsetEncoding::Utf8,
offset_encoding: OffsetEncoding::Utf16,
Copy link
Member Author

@pascalkuthe pascalkuthe Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core fix. As we harcode the encoding right now so only Utf16 get actually used (that was implemented correctly for the most part) so the UTf8 fix doesn't matter anymore but I still kept it (and the other cleanup) around as futureproofing when we can update to LSP 3.17

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't the idea here that we ask for UTF-8 instead of UTF-16 if possible? Based on the clangd negotiation proposal that was merged into the spec. We should prefer UTF-8 if it can be negotiated, then UTF-16 as fallback.

Copy link
Member Author

@pascalkuthe pascalkuthe Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly but that proposal is only available in LSP 3.17 which lsp_types doesn't support yet.
Right now we don't negotiate anything (offset_encoding is only set here and never changed) and just assume UTF-8 which is incorrect.

Once lsp_types updates to 3.17 (there is a PR but not yet merged) I am happy to add that. For now using UTF-16 seems the right thing to me tough as almost all LSPs only support that (and it's the mandatory to use UTf-16 when not otherwise negotiated according to the standard)

Copy link
Member Author

@pascalkuthe pascalkuthe Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from a performance standpoint UTF-32 is actually faster as we use chard indecies. Whether we convert UTf-8 or UTF-16 to UTF-32 (char index) with ropey makes little difference. Ropey does some additional counting but we need to transverse the chunk (so at most 4KB) in either case. The UTF-8, UTF-16 and UTF-32 position at the start of each chunk are always maintained inside the rope anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually, the spec mandates UTF-16 (thanks Microsoft..) but a lot of implementations do it wrong and just use UTF-8 offsets

Copy link
Member Author

@pascalkuthe pascalkuthe Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that is unfortunate but switching to UTF-16 can't be causing too many problems as right now on master our UTF-8 implementation is actually a UTF-32 implementation by accident (we treat the characters offset as char offsets instead of byte offsets) so right now helix behaves incorrectly for either case.

I think we should probably stick to the standard. We could add an option to langauges.toml to work around incorrect LS. The lsp_types update isn't far away either so that will be the proper solution then.

Just to clarify: This is just about how the characters field of the Position struct is treated. The actual text send back and forth is always UTF-8 (that is standard compliant and it's stupid that the offset encoding doesn't match the actual data but oh well)

Copy link
Member Author

@pascalkuthe pascalkuthe Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure that most LSPs support utf-16 as a fallback tough. For example for clangd:

https://clangd.llvm.org/extensions.html#utf-8-offsets

New client capability: offsetEncoding : string[]:

Lists the encodings the client supports, in preference order. It SHOULD include "utf-16". If not present, it is assumed to be ["utf-16"]

Otherwise these LSPs would not work with vscode at all (which always uses UTF-16 AFAIK). An LS that doesn't work with VSCode seems like an exception to me.

Comment on lines +646 to +655
test_case!("", (0, 1) => Some(0));
test_case!("", (1, 0) => None);
test_case!("\n\n", (0, 0) => Some(0));
test_case!("\n\n", (1, 0) => Some(1));
test_case!("\n\n", (1, 1) => Some(2));
test_case!("\n\n", (1, 1) => Some(1));
test_case!("\n\n", (2, 0) => Some(2));
test_case!("\n\n", (3, 0) => None);
test_case!("test\n\n\n\ncase", (4, 3) => Some(11));
test_case!("test\n\n\n\ncase", (4, 4) => Some(12));
test_case!("test\n\n\n\ncase", (4, 5) => None);
test_case!("test\n\n\n\ncase", (4, 5) => Some(12));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes in this test reflect that the offset by Position::character is now capped to the line length (before the line terminator)

@pascalkuthe
Copy link
Member Author

pascalkuthe commented Jan 28, 2023

Likely fixes #3286 too but I haven't tested that yet. I was now able to test, this PR indeed fixes that problem too.

@pascalkuthe
Copy link
Member Author

Likely fixes #2547 too altough that PR is missing info to reproduce the issue reliably. But since editing text after emoojis is involved I am quite certain that the problem was also caused by UTF-32/UTF-16 missmatch

@pascalkuthe
Copy link
Member Author

Also fixes #5809

Copy link
Member

@the-mikedavis the-mikedavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments but otherwise looks good to me

@@ -203,6 +203,13 @@ pub fn line_end_char_index(slice: &RopeSlice, line: usize) -> usize {
.unwrap_or(0)
}

pub fn line_end_byte_index(slice: &RopeSlice, line: usize) -> usize {
slice.line_to_byte(line + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk of going over the line count here with the + 1?

Copy link
Member Author

@pascalkuthe pascalkuthe Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is something I didn't consider and a good point.

I just copied the line_end_char_idx function and converted it to return a byte index. For this specific usecase it's fine since ropey allows one past the end index but LSP doesn't (and that's enforced at the callsite of this function at the start of lsp_pos_to_pos so any newly added calls here can never lead to an out of bounds panic). We should probably check all existing call sites of line_end_char_idx but that would be seprerste from this PR as it's an existing function.

helix-lsp/src/lib.rs Outdated Show resolved Hide resolved
helix-lsp/src/lib.rs Outdated Show resolved Hide resolved
Co-authored-by: Michael Davis <mcarsondavis@gmail.com>
@archseer archseer merged commit 7ebcf4e into helix-editor:master Feb 9, 2023
@pascalkuthe pascalkuthe deleted the fix_lsp_encoding branch February 9, 2023 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-language-server Area: Language server client C-bug Category: This is a bug E-medium Call for participation: Experience needed to fix: Medium / intermediate S-waiting-on-review Status: Awaiting review from a maintainer.
Projects
None yet
3 participants