Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

properly handle LSP position encoding #5711

Merged
merged 3 commits into from
Feb 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions helix-core/src/line_ending.rs
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,13 @@ pub fn line_end_char_index(slice: &RopeSlice, line: usize) -> usize {
.unwrap_or(0)
}

pub fn line_end_byte_index(slice: &RopeSlice, line: usize) -> usize {
slice.line_to_byte(line + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any risk of going over the line count here with the + 1?

Copy link
Member Author

@pascalkuthe pascalkuthe Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is something I didn't consider and a good point.

I just copied the line_end_char_idx function and converted it to return a byte index. For this specific usecase it's fine since ropey allows one past the end index but LSP doesn't (and that's enforced at the callsite of this function at the start of lsp_pos_to_pos so any newly added calls here can never lead to an out of bounds panic). We should probably check all existing call sites of line_end_char_idx but that would be seprerste from this PR as it's an existing function.

- get_line_ending(&slice.line(line))
.map(|le| le.as_str().len())
.unwrap_or(0)
}

/// Fetches line `line_idx` from the passed rope slice, sans any line ending.
pub fn line_without_line_ending<'a>(slice: &'a RopeSlice, line_idx: usize) -> RopeSlice<'a> {
let start = slice.line_to_char(line_idx);
Expand Down
5 changes: 5 additions & 0 deletions helix-core/src/transaction.rs
Original file line number Diff line number Diff line change
Expand Up @@ -481,6 +481,11 @@ impl Transaction {
for (from, to, tendril) in changes {
// Verify ranges are ordered and not overlapping
debug_assert!(last <= from);
// Verify ranges are correct
debug_assert!(
from <= to,
"Edit end must end before it starts (should {from} <= {to})"
);

// Retain from last "to" to current "from"
changeset.retain(from - last);
Expand Down
2 changes: 1 addition & 1 deletion helix-lsp/src/client.rs
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ impl Client {
server_tx,
request_counter: AtomicU64::new(0),
capabilities: OnceCell::new(),
offset_encoding: OffsetEncoding::Utf8,
offset_encoding: OffsetEncoding::Utf16,
Copy link
Member Author

@pascalkuthe pascalkuthe Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core fix. As we harcode the encoding right now so only Utf16 get actually used (that was implemented correctly for the most part) so the UTf8 fix doesn't matter anymore but I still kept it (and the other cleanup) around as futureproofing when we can update to LSP 3.17

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't the idea here that we ask for UTF-8 instead of UTF-16 if possible? Based on the clangd negotiation proposal that was merged into the spec. We should prefer UTF-8 if it can be negotiated, then UTF-16 as fallback.

Copy link
Member Author

@pascalkuthe pascalkuthe Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly but that proposal is only available in LSP 3.17 which lsp_types doesn't support yet.
Right now we don't negotiate anything (offset_encoding is only set here and never changed) and just assume UTF-8 which is incorrect.

Once lsp_types updates to 3.17 (there is a PR but not yet merged) I am happy to add that. For now using UTF-16 seems the right thing to me tough as almost all LSPs only support that (and it's the mandatory to use UTf-16 when not otherwise negotiated according to the standard)

Copy link
Member Author

@pascalkuthe pascalkuthe Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from a performance standpoint UTF-32 is actually faster as we use chard indecies. Whether we convert UTf-8 or UTF-16 to UTF-32 (char index) with ropey makes little difference. Ropey does some additional counting but we need to transverse the chunk (so at most 4KB) in either case. The UTF-8, UTF-16 and UTF-32 position at the start of each chunk are always maintained inside the rope anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually, the spec mandates UTF-16 (thanks Microsoft..) but a lot of implementations do it wrong and just use UTF-8 offsets

Copy link
Member Author

@pascalkuthe pascalkuthe Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that is unfortunate but switching to UTF-16 can't be causing too many problems as right now on master our UTF-8 implementation is actually a UTF-32 implementation by accident (we treat the characters offset as char offsets instead of byte offsets) so right now helix behaves incorrectly for either case.

I think we should probably stick to the standard. We could add an option to langauges.toml to work around incorrect LS. The lsp_types update isn't far away either so that will be the proper solution then.

Just to clarify: This is just about how the characters field of the Position struct is treated. The actual text send back and forth is always UTF-8 (that is standard compliant and it's stupid that the offset encoding doesn't match the actual data but oh well)

Copy link
Member Author

@pascalkuthe pascalkuthe Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure that most LSPs support utf-16 as a fallback tough. For example for clangd:

https://clangd.llvm.org/extensions.html#utf-8-offsets

New client capability: offsetEncoding : string[]:

Lists the encodings the client supports, in preference order. It SHOULD include "utf-16". If not present, it is assumed to be ["utf-16"]

Otherwise these LSPs would not work with vscode at all (which always uses UTF-16 AFAIK). An LS that doesn't work with VSCode seems like an exception to me.

config,
req_timeout,

Expand Down
112 changes: 94 additions & 18 deletions helix-lsp/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ pub enum OffsetEncoding {

pub mod util {
use super::*;
use helix_core::line_ending::{line_end_byte_index, line_end_char_index};
use helix_core::{diagnostic::NumberOrString, Range, Rope, Selection, Tendril, Transaction};

/// Converts a diagnostic in the document to [`lsp::Diagnostic`].
Expand Down Expand Up @@ -117,7 +118,7 @@ pub mod util {

/// Converts [`lsp::Position`] to a position in the document.
///
/// Returns `None` if position exceeds document length or an operation overflows.
/// Returns `None` if position.line is out of bounds or an overflow occurs
pub fn lsp_pos_to_pos(
doc: &Rope,
pos: lsp::Position,
Expand All @@ -128,22 +129,58 @@ pub mod util {
return None;
}

match offset_encoding {
// We need to be careful here to fully comply ith the LSP spec.
// Two relevant quotes from the spec:
//
// https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position
// > If the character value is greater than the line length it defaults back
// > to the line length.
//
// https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments
// > To ensure that both client and server split the string into the same
// > line representation the protocol specifies the following end-of-line sequences:
// > ‘\n’, ‘\r\n’ and ‘\r’. Positions are line end character agnostic.
// > So you can not specify a position that denotes \r|\n or \n| where | represents the character offset.
//
// This means that while the line must be in bounds the `charater`
// must be capped to the end of the line.
// Note that the end of the line here is **before** the line terminator
// so we must use `line_end_char_index` istead of `doc.line_to_char(pos_line + 1)`
//
// FIXME: Helix does not fully comply with the LSP spec for line terminators.
// The LSP standard requires that line terminators are ['\n', '\r\n', '\r'].
// Without the unicode-linebreak feature disabled, the `\r` terminator is not handled by helix.
// With the unicode-linebreak feature, helix recognizes multiple extra line break chars
// which means that positions will be decoded/encoded incorrectly in their presence

let line = match offset_encoding {
OffsetEncoding::Utf8 => {
let line = doc.line_to_char(pos_line);
let pos = line.checked_add(pos.character as usize)?;
if pos <= doc.len_chars() {
Some(pos)
} else {
None
}
let line_start = doc.line_to_byte(pos_line);
let line_end = line_end_byte_index(&doc.slice(..), pos_line);
line_start..line_end
}
OffsetEncoding::Utf16 => {
let line = doc.line_to_char(pos_line);
let line_start = doc.char_to_utf16_cu(line);
let pos = line_start.checked_add(pos.character as usize)?;
doc.try_utf16_cu_to_char(pos).ok()
// TODO directly translate line index to char-idx
// ropey can do this just as easily as utf-8 byte translation
// but the functions are just missing.
// Translate to char first and then utf-16 as a workaround
let line_start = doc.line_to_char(pos_line);
let line_end = line_end_char_index(&doc.slice(..), pos_line);
doc.char_to_utf16_cu(line_start)..doc.char_to_utf16_cu(line_end)
}
};

// The LSP spec demands that the offset is capped to the end of the line
let pos = line
.start
.checked_add(pos.character as usize)
.unwrap_or(line.end)
.min(line.end);

// TODO prefer UTF32/char indices to avoid this step
match offset_encoding {
OffsetEncoding::Utf8 => doc.try_byte_to_char(pos).ok(),
OffsetEncoding::Utf16 => doc.try_utf16_cu_to_char(pos).ok(),
}
}

Expand All @@ -158,8 +195,8 @@ pub mod util {
match offset_encoding {
OffsetEncoding::Utf8 => {
let line = doc.char_to_line(pos);
let line_start = doc.line_to_char(line);
let col = pos - line_start;
let line_start = doc.line_to_byte(line);
let col = doc.char_to_byte(pos) - line_start;

lsp::Position::new(line as u32, col as u32)
}
Expand Down Expand Up @@ -606,16 +643,55 @@ mod tests {
}

test_case!("", (0, 0) => Some(0));
test_case!("", (0, 1) => None);
test_case!("", (0, 1) => Some(0));
test_case!("", (1, 0) => None);
test_case!("\n\n", (0, 0) => Some(0));
test_case!("\n\n", (1, 0) => Some(1));
test_case!("\n\n", (1, 1) => Some(2));
test_case!("\n\n", (1, 1) => Some(1));
test_case!("\n\n", (2, 0) => Some(2));
test_case!("\n\n", (3, 0) => None);
test_case!("test\n\n\n\ncase", (4, 3) => Some(11));
test_case!("test\n\n\n\ncase", (4, 4) => Some(12));
test_case!("test\n\n\n\ncase", (4, 5) => None);
test_case!("test\n\n\n\ncase", (4, 5) => Some(12));
Comment on lines +646 to +655
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes in this test reflect that the offset by Position::character is now capped to the line length (before the line terminator)

test_case!("", (u32::MAX, u32::MAX) => None);
}

#[test]
fn emoji_format_gh_4791() {
use lsp_types::{Position, Range, TextEdit};

let edits = vec![
TextEdit {
range: Range {
start: Position {
line: 0,
character: 1,
},
end: Position {
line: 1,
character: 0,
},
},
new_text: "\n ".to_string(),
},
TextEdit {
range: Range {
start: Position {
line: 1,
character: 7,
},
end: Position {
line: 2,
character: 0,
},
},
new_text: "\n ".to_string(),
},
];

let mut source = Rope::from_str("[\n\"🇺🇸\",\n\"🎄\",\n]");

let transaction = generate_transaction_from_edits(&source, edits, OffsetEncoding::Utf8);
assert!(transaction.apply(&mut source));
}
}