-
-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for embedding binary files #253
Comments
Here's a prototype which was enough to get a rough CLIP embedding model working: diff --git a/llm/cli.py b/llm/cli.py
index d352087..c46a4d0 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -1113,6 +1113,7 @@ def embed(collection, id, input, model, store, database, content, metadata, form
help="Encoding to use when reading --files",
multiple=True,
)
+@click.option("--binary", is_flag=True, help="Treat --files as binary data")
@click.option("--sql", help="Read input using this SQL query")
@click.option(
"--attach",
@@ -1135,6 +1136,7 @@ def embed_multi(
format,
files,
encodings,
+ binary,
sql,
attach,
prefix,
@@ -1158,6 +1160,10 @@ def embed_multi(
2. A SQL query against a SQLite database
3. A directory of files
"""
+ if binary and not files:
+ raise click.UsageError("--binary must be used with --files")
+ if binary and encodings:
+ raise click.UsageError("--binary cannot be used with --encoding")
if not input_path and not sql and not files:
raise click.UsageError("Either --sql or input path or --files is required")
@@ -1200,11 +1206,14 @@ def embed_multi(
for path in pathlib.Path(directory).glob(pattern):
relative = path.relative_to(directory)
content = None
- for encoding in encodings:
- try:
- content = path.read_text(encoding=encoding)
- except UnicodeDecodeError:
- continue
+ if binary:
+ content = path.read_bytes()
+ else:
+ for encoding in encodings:
+ try:
+ content = path.read_text(encoding=encoding)
+ except UnicodeDecodeError:
+ continue
if content is None:
# Log to stderr
click.echo(
@@ -1249,8 +1258,10 @@ def embed_multi(
for row in rows:
values = list(row.values())
id = prefix + str(values[0])
- text = " ".join(v or "" for v in values[1:])
- yield id, text
+ if binary:
+ yield id, values[1]
+ else:
+ yield id, " ".join(v or "" for v in values[1:])
# collection_obj.max_batch_size = 1
collection_obj.embed_multi(tuples(), store=store)
diff --git a/llm/embeddings.py b/llm/embeddings.py
index c25f7e1..bdaa336 100644
--- a/llm/embeddings.py
+++ b/llm/embeddings.py
@@ -336,4 +336,6 @@ class Collection:
@staticmethod
def content_hash(text: str) -> bytes:
"Hash content for deduplication. Override to change hashing behavior."
- return hashlib.md5(text.encode("utf8")).digest()
+ if isinstance(text, str):
+ text = text.encode("utf8")
+ return hashlib.md5(text).digest() No tests yet, plus it doesn't update the type hints. |
Also this only updates cat IMG_3087.jpeg | llm embed -m clip --binary |
Should the If yes, should it store in the Probably better to have a |
Moving this work to a PR: |
I want to be able to store and calculate embeddings for binary data - CLIP for images, ImageBind for audio and suchlike.
llm embed-multi
andllm embed
currently assume text data, see:llm embed-multi --files
should handle encodings other than utf-8 #225Plus the embeddings models are expected to work against lists of strings.
I think a
--binary
flag in a few places plus redefining embedding models to optionally accept binary objects in addition to strings would be good.The text was updated successfully, but these errors were encountered: