-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify measurement unit in CompletionsRequest.column #285
Comments
The base should be determined by the initial handshake. I believe column numbers, here and throughout the protocol, should be treated as UTF-8 code units. @weinand @roblourens thoughts? |
There is no such concept as "UTF-8 code points". There's only a Unicode code point, which is an integer number from the range 0 to 0x10_FFFF. Such a code point has a name and a plethora of other properties. It does not have an encoding though. UTF-8 is one possible encoding that encodes a single Unicode code point as a sequence of numbers from the range 0 to 0xFF. Each Unicode encoding is based on a code unit, which is:
Mixing the abstract numbers and their encoding doesn't make sense. Instead, choose a measurement unit from the below list:
|
Thanks for the correction, code |
@connor4312 we should use the same as LSP. |
On LSP:
I suspect that implementors are doing things differently here. It looks like VS Code is using UTF-16 columns in the REPL at least, as completing on Encoding is relevant since, as in this case, an individual character might span more than a single "column" depending on its width and the encoding. |
Doing what LSP does is super controversial. And not ideal. |
vscode's implementation would be however the editor describes columns, which I think would be using UTF-16 code units since that's what you get in JS. Maybe we need this in the spec. |
If we're okay spec'ing something different than what VS Code currently implements in DAP, I would prefer to normalize on UTF-8 units as a more 'modern' and conceptually simpler standard. |
If we're going to change it, we would have to add the same infrastructure for a client/server agreeing on what to use, and that doesn't seem worth it. |
In the context of the Gedankenexperiment: instead of using a |
That's an elegant approach. For further symmetry, it might be possible to split the |
That could help for completions, but every request that talks about columns could still have the confusion, right? |
Of course, the "thought experiment" was not meant as a suggestion to change "column" to "prefixText" (because that would be a breaking change and we would have to introduce new "capabilities"). I tried to explain that if a string |
Long story short, we've discussed this before and it was stated that it was UTF-16 code units: I asked this before and @weinand said it's the same as LSP, which is UTF-16 code units: #91 (comment). I raised this: : puremourning/vimspector#96 at the time. Fortunately, I haven't actually implemented vimspector/issue/96 so I'm open to changing it, but obviously I can't speak for other clients.
Kinda, if and only if we can define what
A practical approach is indeed to use the number of codepoints in some specific encoding. As the LSP maintainers have found, this quickly becomes a religious debate. The challenge of specifying it in terms of codepoints is that the whole thing gets complicated when you look at combining marks, grapheme clusters and oh my. My personal preference is codepoints because my client happens to be written in python, but "utf-8 code units" (byte offset into utf-8 encoded version of the line) seems a popular choice (modulo loud dissenters). |
@puremourning, as pointed out by @rillig above, there is no concept of an "encoding specific code point". |
Yes, 'Number of code units in some specific encoding' is what I should have written. |
Since the issue at hand asks for "Clarify measurement unit in CompletionsRequest.column", I first did an assessment of the status quo: Assessment:The description of
Since there is no mentioning of "encodings" or "code units" anywhere in the DAP, we can assume that the original intent of "character" was a high level concept like "visible character". In Unicode terminology this would probably translate at least to "code points" or to the even more abstract "grapheme clusters". However, real world DAP implementations (both clients and debug adapters) don't implement this high level concept. Instead many of these DAP implementations use programming languages (JS, TS, C#, Java) where in-memory strings are ordered sequences of 16-bit unsigned integer values (UTF-16 code units). It is therefore natural that these implementations interpret "character positions" as offsets into the sequences of UTF-16 code units. Please note that UTF-16 is only the in-memory encoding (which is relevant when implementing DAP). Sending DAP JSON over a communication mechanism (or storing it on disk) will most likely use UTF-8 encoding, but the decoding is typically done on a different layer and results in in-memory strings that have encoding of the underlying programming language (e.g. UTF-16). Bottom line:Today the "de-facto" measurement unit of Proposal:The clarification need is not confined to
The documentation for these DAP elements will be updated to reflect the "Bottom line" statement from above. Since the measurement unit of "column" properties was not clearly specified before, it is unclear whether existing clients or debug adapters need to update their implementations. No implementation changes are planned for VS Code and its embedded js-debug. Whether there is a need for introducing a configurable measurement unit for "column" properties can be discussed in an independent feature request. What do you think? |
Is the column measured in:
Is 'column' even a good word, or should it rather be 'index'?
Is this column 1-based (depending on the initial handshake), or is it always 0-based?
The text was updated successfully, but these errors were encountered: