Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Addon: Fix length calculation of wide unicode chars #3236

Merged
merged 9 commits into from
Dec 22, 2021

Conversation

gera2ld
Copy link
Contributor

@gera2ld gera2ld commented Jan 29, 2021

Fix #1686

@gera2ld gera2ld force-pushed the master branch 2 times, most recently from 81cdaca to a57ee6b Compare January 29, 2021 06:05
Copy link
Member

@jerch jerch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for looking into this.

Some remarks from my side:

  • Could you move the changes to a separate branch? That would make testing of the PR alot easier (it is currently on master 😨).
  • Imho this still does not work correctly, repro:
    • input ¥¥¥hello ¥¥¥hello##########café éé ééécafcaféééééécafcaféééééécaf𝄞𝄞𝄞𝄞𝄞𝄞𝄞𝄞𝄞𝄞𝄞ééhello into the terminal
    • search for , é and hello
    • result: not all occurances found, some selections are totally off

Imho this needs a more fundamental rewrite - one way to approach it would be to get rid of the regexp positioning at all and search only on buffer indices. This would eliminate the string to buffer position errors. It is abit more implementation work but should run fine perfwise as well.

Another way is to stick with initial regexp search and also to translate the beginning string offset into a buffer index. This is needed as only the very first cell of a new line is guaranteed to have the same index position in the string and the buffer (index 0). Note that this also does not apply to wrapped lines (you basically have to start the buffer index translation always from the very first char of a wrapped line, which might span several buffer rows).

@gera2ld
Copy link
Contributor Author

gera2ld commented Feb 22, 2021

Sorry for the late response.

I don't think it's a good idea to get rid of the regexp positioning because in that case we can hardly search by user provided regexp.
I've added a new method to translate buffer index to string offset so that the searching process only depends on string indices and works well with regexp.

But I noticed that searching in wrapped line does not work if the line contains special chars like emojis, because the .substring here cuts the string at terminal.cols, while such emojis occupies only one column but the string length is larger than 1:

lineString += line.translateToString(!lineWrapsToNext && trimRight).substring(0, terminal.cols);

I'm wondering if the .substring can be removed safely.

@gera2ld
Copy link
Contributor Author

gera2ld commented Feb 22, 2021

* Could you move the changes to a separate branch? That would make testing of the PR alot easier

Sorry for that, but it seems not possible to change branch name now. I can create another PR if you want.

@jerch
Copy link
Member

jerch commented Feb 22, 2021

@gera2ld Yes, removing regexp would be a big loss in flexibility, its better to find a solution that keeps working with them.

[...] I'm wondering if the .substring can be removed safely.

Hmm looking at _translateBufferLineToStringWithWrap as a whole imho it does not the right thing and needs a rewrite (and yes without substring) for several reasons:

  1. A search term might be split over several wrapped lines.
  2. A wrapped line might end earlier if the next char does not fit anymore (a wide char at term.col - 1 would flow to the next line because of taking 2 cells leaving one cell empty).
  3. Any position deduction of string index <---> buffer index has to be done via the full wcwidth translation.

Esp. the last point is tricky to get done right, as any non 1:1 mapping char will create offsets - combining chars on BMP chars will create +1 offsets on the string index, while BMP wide chars will create -1 offsets (relative to buffer index). The only fixpoint in both metrics we have are real new lines at index 0 (lines that have no following isWrapped flag set).

Idea to revamp _translateBufferLineToStringWithWrap:

  • Whenever a new line is found, consume all follow-up wrapped lines into one single (unwrapped) string. Thats pretty much what it does atm, but without substring. This will also correctly deal with 1. above.
  • Special case early wrapping wide char (2. above): If the first char on the next line is a wide char and the current line is only term.cols - 1 in length, treat it as early wrapping - ignore the empty cell, otherwise count empty cell as whitespace. This leads to a faulty handling, if the empty cell was indeed meant to stay empty. We already had an issue report about that, but sadly we have currently no way to track that special edge case.
  • Whenever some match was found in the full (unwrapped) string, translate all string positions (start and end index) back by walking buffer positions + wcwidth continuing from the last saved match buffer position (minor optimization, if it was in that string) or from the very first char of the whole line. This "buffer replay" will ensure 3. above.

Sorry for that, but it seems not possible to change branch name now. I can create another PR if you want.

Hmm I think changing a PRs source branch is not possible. Guess creating a new one is the only option.

@gera2ld
Copy link
Contributor Author

gera2ld commented Feb 23, 2021

Update:

  • the string offsets of the first char in each buffer row as well as the string of the whole line is cached now, so the buffer column can be calculated easily from the beginning of the corresponding row, no matter whether it's early wrapped or for whatever reason the string length does no match the buffer columns.

@JasinYip
Copy link

Any news for this? 😢

@Tyriar
Copy link
Member

Tyriar commented Oct 25, 2021

@jerch can you take another look at this if you have the time? The issue seems to be fixed when testing and looks like the tests pass. Any other comments?

@jerch
Copy link
Member

jerch commented Oct 25, 2021

@Tyriar Yes, will see, if I can look at it this week. The problem I had was the PR pulling because of its branch name, but I can do the testing on a separate repo clone.

@jerch jerch self-assigned this Oct 25, 2021
@jerch jerch removed their assignment Nov 16, 2021
@jerch
Copy link
Member

jerch commented Nov 16, 2021

Please review someone else, as it seems I wont get down to it.

@Tyriar Tyriar changed the title fix length calculation of wide unicode chars Search Addon: Fix length calculation of wide unicode chars Dec 22, 2021
Copy link
Member

@Tyriar Tyriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gera2ld and sorry about how long this took to merge in. I made a few minor doc related changes but since the tests pass and changes seem reasonable lgtm.

@Tyriar Tyriar added this to the 4.16.0 milestone Dec 22, 2021
@Tyriar Tyriar merged commit c22b3a9 into xtermjs:master Dec 22, 2021
@JasinYip
Copy link

This is the best news for me for Christmas and the New year, thank you all you guys fix this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Selection with search and unicode
5 participants