Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counter utf-16 characters #284

Closed
wants to merge 12 commits into from
Closed

Counter utf-16 characters #284

wants to merge 12 commits into from

Conversation

mattn
Copy link
Collaborator

@mattn mattn commented Feb 7, 2019

Fixes #282

@mattn
Copy link
Collaborator Author

mattn commented Feb 7, 2019

Benchmark

function! s:count_utf16_code_units(str) abort
  let l:len = strchars(a:str)
  let l:i = 0
  let l:cnt = 0

  while l:i < l:len
    let l:chr = strcharpart(a:str, l:i, 1)
    if char2nr(l:chr) > 0x10000
      let l:cnt = l:cnt + 2
    else
      let l:cnt = l:cnt + 1
    endif

    let l:i = l:i + 1
  endwhile

  return l:cnt
endfunction

function! s:my_count_utf16_code_units(str) abort
  let l:rs = split(a:str, '\zs')
  let l:len = len(l:rs)
  return l:len + count(l:rs, 'char2nr(v:val)>0x10000')
endfunction

function! s:benchmark()
  let s = 'a𐐀b'
  let n = 100000

  let start = reltime()
  for i in range(n)
    call strlen(s)
  endfor
  echo printf('strlen: %f', reltimefloat(reltime(start)))

  let start = reltime()
  for i in range(n)
    call s:count_utf16_code_units(s)
  endfor
  echo printf('s:count_utf16_code_units: %f', reltimefloat(reltime(start)))

  let start = reltime()
  for i in range(n)
    call s:my_count_utf16_code_units(s)
  endfor
  echo printf('s:my_count_utf16_code_units: %f', reltimefloat(reltime(start)))
endfunction

call s:benchmark()
strlen: 0.229879
s:count_utf16_code_units: 4.050134
s:my_count_utf16_code_units: 1.107014

@mattn mattn changed the title Counter utf-8 characters Counter utf-16 characters Feb 7, 2019
Copy link

@natebosch natebosch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I encountered this I also needed to replace all occurrence of str[start:end] with strcharpart(). Was that not necessary here?

autoload/lsp/utils/diff.vim Outdated Show resolved Hide resolved
@mattn
Copy link
Collaborator Author

mattn commented Feb 8, 2019

It is no needed that replace all of strlen to this function. This function should be used for only calculate offset.

@mattn
Copy link
Collaborator Author

mattn commented Feb 8, 2019

function! s:count_utf16_code_units(str) abort
  let l:len = strchars(a:str)
  let l:i = 0
  let l:cnt = 0

  while l:i < l:len
    let l:chr = strcharpart(a:str, l:i, 1)
    if char2nr(l:chr) > 0x10000
      let l:cnt = l:cnt + 2
    else
      let l:cnt = l:cnt + 1
    endif

    let l:i = l:i + 1
  endwhile

  return l:cnt
endfunction

function! s:my_count_utf16_code_units(str) abort
  let l:rs = split(a:str, '\zs')
  return len(l:rs) + count(l:rs, 'char2nr(v:val)=>0x10000')
endfunction

function! s:benchmark(name, n, case)
  let start = reltime()
  for i in range(a:n)
    call strlen(a:case)
  endfor
  echo printf('%s: strlen: %f', a:name, reltimefloat(reltime(start)))

  let start = reltime()
  for i in range(a:n)
    call s:count_utf16_code_units(a:case)
  endfor
  echo printf('%s: s:count_utf16_code_units: %f', a:name, reltimefloat(reltime(start)))

  let start = reltime()
  for i in range(a:n)
    call s:my_count_utf16_code_units(a:case)
  endfor
  echo printf('%s: s:my_count_utf16_code_units: %f', a:name, reltimefloat(reltime(start)))
endfunction

call s:benchmark('short string', 100000, 'a𐐀b')
call s:benchmark('long string', 10000, repeat('a𐐀b', 200))
short string: strlen: 0.249764
short string: s:count_utf16_code_units: 4.046245
short string: s:my_count_utf16_code_units: 1.054177
long string: strlen: 0.025720
long string: s:count_utf16_code_units: 60.834453
long string: s:my_count_utf16_code_units: 4.178993

@prabirshrestha
Copy link
Owner

There are some interesting approach and comments going on currently at microsoft/language-server-protocol#376 (comment)

@Avi-D-coder
Copy link

Is vim-lsp currently using UTF-8/bytes, codepoints or grapheme clusters?

@prabirshrestha
Copy link
Owner

There is now webworkers in ducktape vim. Perf heavy things could go there :) bobpepin/vim#1

@prabirshrestha
Copy link
Owner

@mattn Any updates on this? One option I can think of is having this under a feature flag turned off by default.

@mattn
Copy link
Collaborator Author

mattn commented Jul 5, 2019

Sorry delay, I'll fix soon.

if g:lsp_use_utf16
return lsp#utils#strlen(getline(a:m)[:col(a:m)])
endif
return a:m
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I met strange vim's behavior at trying to use this branch.

The behavior is col(a:m) is not match lsp#utils#strlen(getline(a:m)[:col(a:m)]) in insert-mode.

In my environment, those unmatching were fixable below code.

function! lsp#utils#col(m) abort
    if g:lsp_use_utf16
        let col = lsp#utils#strlen(getline(a:m)[:col(a:m) - 1])

        if mode() ==# 'i' && col(a:m) == col('$')
            let col = col + 1
        endif

        return col
    endif
    return col(a:m)
endfunction

(This branch is very great work for me. Thanks mattn)

@clason
Copy link
Contributor

clason commented Jul 24, 2019

@mattn Unfortunately, this PR doesn't help with the simple example from #425 (not even with @hrsh7th change). Am I missing something?

@hrsh7th
Copy link
Collaborator

hrsh7th commented Jul 25, 2019

@clason I missunderstood. My environemnt had other patch. It solves utf8 character problems maybe (This patch is vim-lsc's diff logic). But still remaining utf16 character problem after applying this patch.
I'm sorry, I should had learning character encoding and checking my environment before post comment...

@mattn
Copy link
Collaborator Author

mattn commented Jul 25, 2019

If someone already found the problem of this patch, please point it. Or if you already have another patch, please send new PR.

@clason
Copy link
Contributor

clason commented Jul 25, 2019

@hrsh7th Could you share your "other patch"? If it works for UTF-8, that'd already be progress.

@hrsh7th
Copy link
Collaborator

hrsh7th commented Jul 25, 2019

I opend PR (#447).

It includes other patch (it is ported vim-lsc's diff logic).

@clason Please test #447.

Sorry for my confusing movements...

@mattn
Copy link
Collaborator Author

mattn commented Jul 26, 2019

Fixed by #447

@mattn mattn closed this Jul 26, 2019
@mattn mattn deleted the count-utf16 branch July 26, 2019 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A problem for editing Non-ASCII characters
7 participants