-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
properly handle LSP position encoding #5711
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -104,7 +104,7 @@ impl Client { | |
server_tx, | ||
request_counter: AtomicU64::new(0), | ||
capabilities: OnceCell::new(), | ||
offset_encoding: OffsetEncoding::Utf8, | ||
offset_encoding: OffsetEncoding::Utf16, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the core fix. As we harcode the encoding right now so only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wasn't the idea here that we ask for UTF-8 instead of UTF-16 if possible? Based on the clangd negotiation proposal that was merged into the spec. We should prefer UTF-8 if it can be negotiated, then UTF-16 as fallback. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah exactly but that proposal is only available in LSP 3.17 which Once There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also from a performance standpoint UTF-32 is actually faster as we use chard indecies. Whether we convert UTf-8 or UTF-16 to UTF-32 (char index) with ropey makes little difference. Ropey does some additional counting but we need to transverse the chunk (so at most 4KB) in either case. The UTF-8, UTF-16 and UTF-32 position at the start of each chunk are always maintained inside the rope anyway There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So actually, the spec mandates UTF-16 (thanks Microsoft..) but a lot of implementations do it wrong and just use UTF-8 offsets There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm that is unfortunate but switching to UTF-16 can't be causing too many problems as right now on master our UTF-8 implementation is actually a UTF-32 implementation by accident (we treat the I think we should probably stick to the standard. We could add an option to langauges.toml to work around incorrect LS. The Just to clarify: This is just about how the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am pretty sure that most LSPs support utf-16 as a fallback tough. For example for clangd: https://clangd.llvm.org/extensions.html#utf-8-offsets
Otherwise these LSPs would not work with vscode at all (which always uses UTF-16 AFAIK). An LS that doesn't work with VSCode seems like an exception to me. |
||
config, | ||
req_timeout, | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,6 +57,7 @@ pub enum OffsetEncoding { | |
|
||
pub mod util { | ||
use super::*; | ||
use helix_core::line_ending::{line_end_byte_index, line_end_char_index}; | ||
use helix_core::{diagnostic::NumberOrString, Range, Rope, Selection, Tendril, Transaction}; | ||
|
||
/// Converts a diagnostic in the document to [`lsp::Diagnostic`]. | ||
|
@@ -117,7 +118,7 @@ pub mod util { | |
|
||
/// Converts [`lsp::Position`] to a position in the document. | ||
/// | ||
/// Returns `None` if position exceeds document length or an operation overflows. | ||
/// Returns `None` if position.line is out of bounds or an overflow occurs | ||
pub fn lsp_pos_to_pos( | ||
doc: &Rope, | ||
pos: lsp::Position, | ||
|
@@ -128,22 +129,58 @@ pub mod util { | |
return None; | ||
} | ||
|
||
match offset_encoding { | ||
// We need to be careful here to fully comply ith the LSP spec. | ||
// Two relevant quotes from the spec: | ||
// | ||
// https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position | ||
// > If the character value is greater than the line length it defaults back | ||
// > to the line length. | ||
// | ||
// https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments | ||
// > To ensure that both client and server split the string into the same | ||
// > line representation the protocol specifies the following end-of-line sequences: | ||
// > ‘\n’, ‘\r\n’ and ‘\r’. Positions are line end character agnostic. | ||
// > So you can not specify a position that denotes \r|\n or \n| where | represents the character offset. | ||
// | ||
// This means that while the line must be in bounds the `charater` | ||
// must be capped to the end of the line. | ||
// Note that the end of the line here is **before** the line terminator | ||
// so we must use `line_end_char_index` istead of `doc.line_to_char(pos_line + 1)` | ||
// | ||
// FIXME: Helix does not fully comply with the LSP spec for line terminators. | ||
// The LSP standard requires that line terminators are ['\n', '\r\n', '\r']. | ||
// Without the unicode-linebreak feature disabled, the `\r` terminator is not handled by helix. | ||
// With the unicode-linebreak feature, helix recognizes multiple extra line break chars | ||
// which means that positions will be decoded/encoded incorrectly in their presence | ||
|
||
let line = match offset_encoding { | ||
OffsetEncoding::Utf8 => { | ||
let line = doc.line_to_char(pos_line); | ||
let pos = line.checked_add(pos.character as usize)?; | ||
if pos <= doc.len_chars() { | ||
Some(pos) | ||
} else { | ||
None | ||
} | ||
let line_start = doc.line_to_byte(pos_line); | ||
let line_end = line_end_byte_index(&doc.slice(..), pos_line); | ||
line_start..line_end | ||
} | ||
OffsetEncoding::Utf16 => { | ||
let line = doc.line_to_char(pos_line); | ||
let line_start = doc.char_to_utf16_cu(line); | ||
let pos = line_start.checked_add(pos.character as usize)?; | ||
doc.try_utf16_cu_to_char(pos).ok() | ||
// TODO directly translate line index to char-idx | ||
// ropey can do this just as easily as utf-8 byte translation | ||
// but the functions are just missing. | ||
// Translate to char first and then utf-16 as a workaround | ||
let line_start = doc.line_to_char(pos_line); | ||
let line_end = line_end_char_index(&doc.slice(..), pos_line); | ||
doc.char_to_utf16_cu(line_start)..doc.char_to_utf16_cu(line_end) | ||
} | ||
}; | ||
|
||
// The LSP spec demands that the offset is capped to the end of the line | ||
let pos = line | ||
.start | ||
.checked_add(pos.character as usize) | ||
.unwrap_or(line.end) | ||
.min(line.end); | ||
|
||
// TODO prefer UTF32/char indices to avoid this step | ||
match offset_encoding { | ||
OffsetEncoding::Utf8 => doc.try_byte_to_char(pos).ok(), | ||
OffsetEncoding::Utf16 => doc.try_utf16_cu_to_char(pos).ok(), | ||
} | ||
} | ||
|
||
|
@@ -158,8 +195,8 @@ pub mod util { | |
match offset_encoding { | ||
OffsetEncoding::Utf8 => { | ||
let line = doc.char_to_line(pos); | ||
let line_start = doc.line_to_char(line); | ||
let col = pos - line_start; | ||
let line_start = doc.line_to_byte(line); | ||
let col = doc.char_to_byte(pos) - line_start; | ||
|
||
lsp::Position::new(line as u32, col as u32) | ||
} | ||
|
@@ -606,16 +643,55 @@ mod tests { | |
} | ||
|
||
test_case!("", (0, 0) => Some(0)); | ||
test_case!("", (0, 1) => None); | ||
test_case!("", (0, 1) => Some(0)); | ||
test_case!("", (1, 0) => None); | ||
test_case!("\n\n", (0, 0) => Some(0)); | ||
test_case!("\n\n", (1, 0) => Some(1)); | ||
test_case!("\n\n", (1, 1) => Some(2)); | ||
test_case!("\n\n", (1, 1) => Some(1)); | ||
test_case!("\n\n", (2, 0) => Some(2)); | ||
test_case!("\n\n", (3, 0) => None); | ||
test_case!("test\n\n\n\ncase", (4, 3) => Some(11)); | ||
test_case!("test\n\n\n\ncase", (4, 4) => Some(12)); | ||
test_case!("test\n\n\n\ncase", (4, 5) => None); | ||
test_case!("test\n\n\n\ncase", (4, 5) => Some(12)); | ||
Comment on lines
+646
to
+655
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. changes in this test reflect that the offset by |
||
test_case!("", (u32::MAX, u32::MAX) => None); | ||
} | ||
|
||
#[test] | ||
fn emoji_format_gh_4791() { | ||
use lsp_types::{Position, Range, TextEdit}; | ||
|
||
let edits = vec![ | ||
TextEdit { | ||
range: Range { | ||
start: Position { | ||
line: 0, | ||
character: 1, | ||
}, | ||
end: Position { | ||
line: 1, | ||
character: 0, | ||
}, | ||
}, | ||
new_text: "\n ".to_string(), | ||
}, | ||
TextEdit { | ||
range: Range { | ||
start: Position { | ||
line: 1, | ||
character: 7, | ||
}, | ||
end: Position { | ||
line: 2, | ||
character: 0, | ||
}, | ||
}, | ||
new_text: "\n ".to_string(), | ||
}, | ||
]; | ||
|
||
let mut source = Rope::from_str("[\n\"🇺🇸\",\n\"🎄\",\n]"); | ||
|
||
let transaction = generate_transaction_from_edits(&source, edits, OffsetEncoding::Utf8); | ||
assert!(transaction.apply(&mut source)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any risk of going over the line count here with the
+ 1
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is something I didn't consider and a good point.
I just copied the
line_end_char_idx
function and converted it to return a byte index. For this specific usecase it's fine since ropey allows one past the end index but LSP doesn't (and that's enforced at the callsite of this function at the start oflsp_pos_to_pos
so any newly added calls here can never lead to an out of bounds panic). We should probably check all existing call sites ofline_end_char_idx
but that would be seprerste from this PR as it's an existing function.