Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 132693: invalid start byte #28

Open
leosh64 opened this issue Sep 1, 2023 · 2 comments

Comments

@leosh64
Copy link

leosh64 commented Sep 1, 2023

Getting this error during generation of embeddings:

Traceback (most recent call last):
  File "/home/user/.local/bin/sem", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/cli.py", line 84, in main
    query_func(args)
  File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/cli.py", line 38, in query_func
    do_query(args, model)
  File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/query.py", line 51, in do_query
    do_embed(args, model)
  File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/embed.py", line 82, in do_embed
    functions = _get_repo_functions(
  File "/home/user/.local/lib/python3.10/site-packages/semantic_code_search/embed.py", line 71, in _get_repo_functions
    file_content = f.read()
  File "/usr/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 132693: invalid start byte

after it already successfully processed quite a few files:

 27%|████████████████████████▎                                                                 | 35036/130013 [00:33<01:30, 1047.23it/s]
@leosh64
Copy link
Author

leosh64 commented Sep 1, 2023

As workaround, I just added try/catch to the affected lines:

def _get_repo_functions(root, supported_file_extensions, relevant_node_types):
    functions = []
    print('Extracting functions from {}'.format(root))
    for fp in tqdm([root + '/' + f for f in os.popen('git -C {} ls-files'.format(root)).read().split('\n')]):
        if not os.path.isfile(fp):
            continue
        with open(fp, 'r') as f:
            lang = supported_file_extensions.get(fp[fp.rfind('.'):])
            if lang:
                try:
                    parser = get_parser(lang)
                    file_content = f.read()
                    tree = parser.parse(bytes(file_content, 'utf8'))
                    all_nodes = list(_traverse_tree(tree.root_node))
                    functions.extend(_extract_functions(
                        all_nodes, fp, file_content, relevant_node_types))
                except Exception as e:
                    print(f"Hit error while parsing {fp}: {e}")
    return functions

It shows quite a lot of third-party files in my repo. Since these are third-party, I cannot update/fix them. Should sem be made robust against such issues?

Maybe the requirement to have UTF-8 encoding for the files could be dropped. Ideas: https://stackoverflow.com/questions/22216076/unicodedecodeerror-utf8-codec-cant-decode-byte-0xa5-in-position-0-invalid-s

@nnWhisperer
Copy link

Using yours code, I looked at non-utf8 files and changed their encodings; then restarted sem; now it goes through fixed non-utf-8 files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants