Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError with chardet 3 #225

Closed
blueyed opened this issue Jun 4, 2017 · 13 comments
Closed

UnicodeDecodeError with chardet 3 #225

blueyed opened this issue Jun 4, 2017 · 13 comments

Comments

@blueyed
Copy link
Member

blueyed commented Jun 4, 2017

The following minimal vim file will cause an error:

scriptencoding utf-8
" :purple_heart: 💜
" set list listchars=tab:»·,trail:·,eol:¬,nbsp:_,extends:❯,precedes:❮
Traceback (most recent call last):
  File "…/Vcs/vint/.venv/bin/vint", line 11, in <module>
    load_entry_point('vim-vint', 'console_scripts', 'vint')()
  File "…/Vcs/vint/vint/__init__.py", line 11, in main
    init_cli()
  File "…/Vcs/vint/vint/bootstrap.py", line 22, in init_cli
    cli.start()
  File "…/Vcs/vint/vint/linting/cli.py", line 27, in start
    violations = self._lint_all(env, config_dict)
  File "…/Vcs/vint/vint/linting/cli.py", line 120, in _lint_all
    violations += linter.lint_file(file_path)
  File "…/Vcs/vint/vint/linting/linter.py", line 106, in lint_file
    root_ast = self._parser.parse_file(path)
  File "…/Vcs/vint/vint/ast/parsing.py", line 63, in parse_file
    decoded = bytes_seq.decode(encoding)
  File "…/Vcs/vint/.venv/lib/python3.6/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 105: character maps to <undefined>

encoding_hint in parse_file from chardet.detect(bytes_seq) is: {'encoding': 'Windows-1254', 'confidence': 0.5658124254347925, 'language': 'Turkish'}.

With chardet 2.3 it is {'encoding': 'ISO-8859-2', 'confidence': 0.6680924803464797}.

They seem to temporarily have disabled ISO-8859-2 as per the README on PyPI.

But anyway, since scriptencoding is present, this should be used by vint directly, and parse_file should fall back to utf-8 probably anyway in case of errors?!

b'scriptencoding' in bytes_seq could be used here for starters.

@blueyed
Copy link
Member Author

blueyed commented Jun 4, 2017

For reference: chardet/chardet#128.

@blueyed
Copy link
Member Author

blueyed commented Jun 5, 2017

I think vint's parse_file should split the input file into chunks at scriptencoding lines, since those are meant to specify the encoding afterwards.

@pixelastic
Copy link

I can confirm on my side that the same example file yields similar results:

scriptencoding utf-8
" :purple_heart: 💜
" set list listchars=tab:»·,trail:·,eol:¬,nbsp:_,extends:❯,precedes:❮
$ vint example.vim
Traceback (most recent call last):
  File "/home/tim/.local/bin/vint", line 11, in <module>
    sys.exit(main())
  File "/home/tim/.local/lib/python2.7/site-packages/vint/__init__.py", line 11, in main
    init_cli()
  File "/home/tim/.local/lib/python2.7/site-packages/vint/bootstrap.py", line 22, in init_cli
    cli.start()
  File "/home/tim/.local/lib/python2.7/site-packages/vint/linting/cli.py", line 27, in start
    violations = self._lint_all(env, config_dict)
  File "/home/tim/.local/lib/python2.7/site-packages/vint/linting/cli.py", line 120, in _lint_all
    violations += linter.lint_file(file_path)
  File "/home/tim/.local/lib/python2.7/site-packages/vint/linting/linter.py", line 106, in lint_file
    root_ast = self._parser.parse_file(path)
  File "/home/tim/.local/lib/python2.7/site-packages/vint/ast/parsing.py", line 58, in parse_file
    decoded = bytes_seq.decode(encoding)
  File "/usr/lib/python2.7/encodings/cp1254.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 105: character maps to <undefined>

I'm using vint v0.3.12 on a freshly installed computer. I just tested on a older computer, the exact same code works. Both are running Ubuntu 16.04 and installed vint through pip. I checked both LC_* environment variables and they are identical. I guess it means the difference is in the used dependencies, but I don't know enough of Python to go any deeper than that.

Hope that helps

@adrigzr
Copy link

adrigzr commented Sep 16, 2017

Same here.

Traceback (most recent call last):
  File "/home/user/.local/bin/vint", line 9, in <module>
    load_entry_point('vim-vint==0.3.14', 'console_scripts', 'vint')()
  File "/home/user/.local/lib/python3.5/site-packages/vint/__init__.py", line 11, in main
    init_cli()
  File "/home/user/.local/lib/python3.5/site-packages/vint/bootstrap.py", line 22, in init_cli
    cli.start()
  File "/home/user/.local/lib/python3.5/site-packages/vint/linting/cli.py", line 27, in start
    violations = self._lint_all(env, config_dict)
  File "/home/user/.local/lib/python3.5/site-packages/vint/linting/cli.py", line 120, in _lint_all
    violations += linter.lint_file(file_path)
  File "/home/user/.local/lib/python3.5/site-packages/vint/linting/linter.py", line 106, in lint_file
    root_ast = self._parser.parse_file(path)
  File "/home/user/.local/lib/python3.5/site-packages/vint/ast/parsing.py", line 45, in parse_file
    with file_path.open(mode='rb', encoding="utf8") as f:
  File "/usr/lib/python3.5/pathlib.py", line 1151, in open
    opener=self._opener)
ValueError: binary mode doesn't take an encoding argument

Any news on this topic?

@unphased
Copy link

Yeah I have some fancy comments in my vimrc and it totally kills vint

@pixelastic
Copy link

Hey @Kuniwak, I saw you commited fix to this issue recently. Thanks a lot for that!

Do you have an ETA for the next pip release of vint that would include the fix?

Thanks again

@Kuniwak
Copy link
Member

Kuniwak commented Nov 10, 2017

These commit are in review.

I'm gonna ship it when the review is finished.

Kuniwak added a commit that referenced this issue Nov 11, 2017
* WIP

* Make debugging easy for fix encoding bugs

* Fix encoding problem that is #225 #242

* More simple implementation for bytes compatible

* Make more simple

* Remove debugging code

* It is a classmethod, not instance method

* Add a test case for suddn EOF

* Rename to the correct name

* Care multiple scriptencoding

* Fix a problem about debug_hint overwriting

* Care single line scriptencoding

* decoding error is not a RuntimeError but Exception

* More debug_hint

* Fix a problem about missing last char

* Change Chardet priority

* Revert "WIP"

This reverts commit 1fb7dfc.

* Split files

* Try to resolve module name conflict

* Cosmetic changes

* Compose strategies to decoding_strategy
@Kuniwak
Copy link
Member

Kuniwak commented Nov 12, 2017

This bugfix was shipped at v0.3.15.
Please try it.

@blueyed
Copy link
Member Author

blueyed commented Nov 12, 2017

Will try.

It seems however that 0.3.15 has some output of debug info left:

% vint -e t.vim
{'0:408': {'composed_strategies': ['DecodingStrategyForEmpty',
                                   'DecodingStrategyByScriptencoding',
                                   'DecodingStrategyForUTF8',
                                   'DecodingStrategyByChardet'],
           'empty': 'false',
           'scriptencoding': 'None',
           'scriptencoding_error': '`scriptencoding` is not found',
           'selected_strategy': 'DecodingStrategyForUTF8',
           'utf-8': 'success'},
 'version': '3.6.3 (default, Oct 24 2017, 14:48:20) \n[GCC 7.2.0]'}
t.vim:18:1: E492: Not an editor command: foo (see ynkdir/vim-vimlparser)

@blueyed
Copy link
Member Author

blueyed commented Nov 12, 2017

The bug itself is fixed, thanks a lot for this huge improvement! 💜

@Kuniwak
Copy link
Member

Kuniwak commented Nov 13, 2017

@blueyed This bug is handled at #251 .

@Kuniwak
Copy link
Member

Kuniwak commented Nov 13, 2017

The debug code was removed at v0.3.16. Sorry for my mistake.

@blueyed
Copy link
Member Author

blueyed commented Jan 17, 2020

Closing as fixed.

@blueyed blueyed closed this as completed Jan 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants