Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for embedding binary files #253

Closed
simonw opened this issue Sep 9, 2023 · 4 comments · Fixed by #254
Closed

Support for embedding binary files #253

simonw opened this issue Sep 9, 2023 · 4 comments · Fixed by #254
Labels
embeddings enhancement New feature or request
Milestone

Comments

@simonw
Copy link
Owner

simonw commented Sep 9, 2023

I want to be able to store and calculate embeddings for binary data - CLIP for images, ImageBind for audio and suchlike.

llm embed-multi and llm embed currently assume text data, see:

Plus the embeddings models are expected to work against lists of strings.

I think a --binary flag in a few places plus redefining embedding models to optionally accept binary objects in addition to strings would be good.

@simonw simonw added enhancement New feature or request embeddings labels Sep 9, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 9, 2023

Here's a prototype which was enough to get a rough CLIP embedding model working:

diff --git a/llm/cli.py b/llm/cli.py
index d352087..c46a4d0 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -1113,6 +1113,7 @@ def embed(collection, id, input, model, store, database, content, metadata, form
     help="Encoding to use when reading --files",
     multiple=True,
 )
+@click.option("--binary", is_flag=True, help="Treat --files as binary data")
 @click.option("--sql", help="Read input using this SQL query")
 @click.option(
     "--attach",
@@ -1135,6 +1136,7 @@ def embed_multi(
     format,
     files,
     encodings,
+    binary,
     sql,
     attach,
     prefix,
@@ -1158,6 +1160,10 @@ def embed_multi(
     2. A SQL query against a SQLite database
     3. A directory of files
     """
+    if binary and not files:
+        raise click.UsageError("--binary must be used with --files")
+    if binary and encodings:
+        raise click.UsageError("--binary cannot be used with --encoding")
     if not input_path and not sql and not files:
         raise click.UsageError("Either --sql or input path or --files is required")
 
@@ -1200,11 +1206,14 @@ def embed_multi(
                 for path in pathlib.Path(directory).glob(pattern):
                     relative = path.relative_to(directory)
                     content = None
-                    for encoding in encodings:
-                        try:
-                            content = path.read_text(encoding=encoding)
-                        except UnicodeDecodeError:
-                            continue
+                    if binary:
+                        content = path.read_bytes()
+                    else:
+                        for encoding in encodings:
+                            try:
+                                content = path.read_text(encoding=encoding)
+                            except UnicodeDecodeError:
+                                continue
                     if content is None:
                         # Log to stderr
                         click.echo(
@@ -1249,8 +1258,10 @@ def embed_multi(
             for row in rows:
                 values = list(row.values())
                 id = prefix + str(values[0])
-                text = " ".join(v or "" for v in values[1:])
-                yield id, text
+                if binary:
+                    yield id, values[1]
+                else:
+                    yield id, " ".join(v or "" for v in values[1:])
 
         # collection_obj.max_batch_size = 1
         collection_obj.embed_multi(tuples(), store=store)
diff --git a/llm/embeddings.py b/llm/embeddings.py
index c25f7e1..bdaa336 100644
--- a/llm/embeddings.py
+++ b/llm/embeddings.py
@@ -336,4 +336,6 @@ class Collection:
     @staticmethod
     def content_hash(text: str) -> bytes:
         "Hash content for deduplication. Override to change hashing behavior."
-        return hashlib.md5(text.encode("utf8")).digest()
+        if isinstance(text, str):
+            text = text.encode("utf8")
+        return hashlib.md5(text).digest()

No tests yet, plus it doesn't update the type hints.

@simonw
Copy link
Owner Author

simonw commented Sep 9, 2023

Also this only updates llm embed-multi --files - but I also want to be able to do things like this:

cat IMG_3087.jpeg | llm embed -m clip --binary

@simonw
Copy link
Owner Author

simonw commented Sep 9, 2023

Should the --store option still work?

If yes, should it store in the content column given that it's defined as text? SQLite will let us get away with it but I worry it will break things in Datasette later on.

Probably better to have a content_blob null column which can be used for storing binary data instead.

simonw added a commit that referenced this issue Sep 9, 2023
@simonw simonw mentioned this issue Sep 9, 2023
3 tasks
@simonw
Copy link
Owner Author

simonw commented Sep 9, 2023

Moving this work to a PR:

@simonw simonw added this to the 0.10 milestone Sep 10, 2023
@simonw simonw linked a pull request Sep 12, 2023 that will close this issue
3 tasks
simonw added a commit that referenced this issue Sep 12, 2023
* Binary embeddings support, refs #253
* Write binary content to content_blob, with tests - refs #253
* supports_text and supports_binary embedding validation, refs #253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
embeddings enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant