Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llm embed-multi --files should handle encodings other than utf-8 #225

Closed
simonw opened this issue Sep 4, 2023 · 6 comments
Closed

llm embed-multi --files should handle encodings other than utf-8 #225

simonw opened this issue Sep 4, 2023 · 6 comments
Labels
bug Something isn't working embeddings
Milestone

Comments

@simonw
Copy link
Owner

simonw commented Sep 4, 2023

Got this while running this command:

llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/Dropbox/Development '**/README.md' --store

Traceback:

  File "/Users/simon/.local/share/virtualenvs/llm-cluster-AL-JPg-s/lib/python3.10/site-packages/llm/cli.py", line 1128, in embed_multi
    collection_obj.embed_multi(tuples(), store=store)
  File "/Users/simon/.local/share/virtualenvs/llm-cluster-AL-JPg-s/lib/python3.10/site-packages/llm/embeddings.py", line 164, in embed_multi
    self.embed_multi_with_metadata(
  File "/Users/simon/.local/share/virtualenvs/llm-cluster-AL-JPg-s/lib/python3.10/site-packages/llm/embeddings.py", line 188, in embed_multi_with_metadata
    batch = list(islice(iterator, batch_size))
  File "/Users/simon/.local/share/virtualenvs/llm-cluster-AL-JPg-s/lib/python3.10/site-packages/llm/embeddings.py", line 165, in <genexpr>
    ((id, text, None) for id, text in entries), store=store
  File "/Users/simon/.local/share/virtualenvs/llm-cluster-AL-JPg-s/lib/python3.10/site-packages/llm/cli.py", line 1121, in tuples
    for row in rows:
  File "/Users/simon/.local/share/virtualenvs/llm-cluster-AL-JPg-s/lib/python3.10/site-packages/click/_termui_impl.py", line 344, in generator
    for rv in self.iter:
  File "/Users/simon/.local/share/virtualenvs/llm-cluster-AL-JPg-s/lib/python3.10/site-packages/llm/cli.py", line 1087, in iterate_files
    yield {"id": str(relative), "content": path.read_text()}
  File "/Users/simon/.pyenv/versions/3.10.4/lib/python3.10/pathlib.py", line 1133, in read_text
    return f.read()
  File "/Users/simon/.pyenv/versions/3.10.4/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 522: invalid start byte
@simonw simonw added bug Something isn't working embeddings labels Sep 4, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 4, 2023

In the debugger:

(Pdb) type(data)
<class 'bytes'>
(Pdb) data
b'<a href="https://hapi.dev"><img src="https://raw.githubusercontent.com/hapijs/assets/master/images/family.png" width="180px" align="right" /></a>\n\n# @hapi/address\n\n#### Validate email address and domain.\n\n**address** is part of the **hapi** ecosystem and was designed to work seamlessly with the [hapi web framework](https://hapi.dev) and its other components (but works great on its own or with other frameworks). If you are using a different web framework and find this module useful, check out [hapi](https://hapi.dev) \x96 they work even better together.\n\n### Visit the [hapi.dev](https://hapi.dev) Developer Portal for tutorials, documentation, and support\n\n## Useful resources\n\n- [Documentation and API](https://hapi.dev/family/address/)\n- [Versions status](https://hapi.dev/resources/status/#address)\n- [Project policies](https://hapi.dev/policies/)\n- [Free and commercial support options](https://hapi.dev/support/)'
(Pdb) data.decode("utf-8")
*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 522: invalid start byte

I think it's this:

heck out [hapi](https://hapi.dev) \x96 they work even better

@simonw
Copy link
Owner Author

simonw commented Sep 4, 2023

So... should the --files command only work with utf-8? What should it do if something cannot be decoded?

@simonw
Copy link
Owner Author

simonw commented Sep 4, 2023

I'm tempted to go with the cheapest option: fall back to latin-1 on a decoding error.

Another option would be to support a --encoding option which can be used to get --files to work against other encodings.

@simonw
Copy link
Owner Author

simonw commented Sep 4, 2023

Code at fault:

llm/llm/cli.py

Lines 1083 to 1087 in 206e691

def iterate_files():
for directory, pattern in files:
for path in pathlib.Path(directory).glob(pattern):
relative = path.relative_to(directory)
yield {"id": str(relative), "content": path.read_text()}

@simonw simonw closed this as completed in 78a0e9b Sep 4, 2023
simonw added a commit that referenced this issue Sep 5, 2023
@simonw simonw changed the title llm embed-multi --files error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 522 llm embed-multi --files should handle encodings other than utf-8 Sep 9, 2023
@simonw simonw added this to the 0.10 milestone Sep 10, 2023
simonw added a commit that referenced this issue Sep 12, 2023
@spm1001
Copy link

spm1001 commented Dec 29, 2024

(I love the way you've documented things like this even when you were talking to yourself. It means that if someone has a similar problem. they can find a relevant issue. Am just posting here rather than opening a new issue but will if you want. )

I've been using folders of old (mostly plain text) files full of notes from meetings. They are all a bit ropey as they've been in different systems and encodings over time.

I'm sending them to a model by cat *.md | llm "Prompt" but sometimes get errors as below

Traceback (most recent call last):
  File "/opt/homebrew/bin/llm", line 8, in <module>
    sys.exit(cli())
             ~~~^^
  File "/opt/homebrew/Cellar/llm/0.19.1/libexec/lib/python3.13/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/llm/0.19.1/libexec/lib/python3.13/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/opt/homebrew/Cellar/llm/0.19.1/libexec/lib/python3.13/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/opt/homebrew/Cellar/llm/0.19.1/libexec/lib/python3.13/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/llm/0.19.1/libexec/lib/python3.13/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/homebrew/Cellar/llm/0.19.1/libexec/lib/python3.13/site-packages/llm/cli.py", line 381, in prompt
    prompt = read_prompt()
  File "/opt/homebrew/Cellar/llm/0.19.1/libexec/lib/python3.13/site-packages/llm/cli.py", line 258, in read_prompt
    stdin_prompt = sys.stdin.read()
  File "<frozen codecs>", line 325, in decode
**UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 236464: invalid continuation byte**

Where the offending line was <div>Deduping across the different hashed ID spaces, but because they are on multiple ‘networks’/dsps they can heal that fracture</div> - I think it was just the "smart" quotes around 'networks'.

I can weed them out manually but having llm handle it for me would be great. I don't think I can use the encoding option as that's in the context of embeddings only?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working embeddings
Projects
None yet
Development

No branches or pull requests

2 participants