Support for embedding binary files #253

simonw · 2023-09-09T19:15:32Z

I want to be able to store and calculate embeddings for binary data - CLIP for images, ImageBind for audio and suchlike.

llm embed-multi and llm embed currently assume text data, see:

llm embed-multi --files should handle encodings other than utf-8 #225

Plus the embeddings models are expected to work against lists of strings.

I think a --binary flag in a few places plus redefining embedding models to optionally accept binary objects in addition to strings would be good.

The text was updated successfully, but these errors were encountered:

simonw · 2023-09-09T19:16:09Z

Here's a prototype which was enough to get a rough CLIP embedding model working:

diff --git a/llm/cli.py b/llm/cli.py
index d352087..c46a4d0 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -1113,6 +1113,7 @@ def embed(collection, id, input, model, store, database, content, metadata, form
     help="Encoding to use when reading --files",
     multiple=True,
 )
+@click.option("--binary", is_flag=True, help="Treat --files as binary data")
 @click.option("--sql", help="Read input using this SQL query")
 @click.option(
     "--attach",
@@ -1135,6 +1136,7 @@ def embed_multi(
     format,
     files,
     encodings,
+    binary,
     sql,
     attach,
     prefix,
@@ -1158,6 +1160,10 @@ def embed_multi(
     2. A SQL query against a SQLite database
     3. A directory of files
     """
+    if binary and not files:
+        raise click.UsageError("--binary must be used with --files")
+    if binary and encodings:
+        raise click.UsageError("--binary cannot be used with --encoding")
     if not input_path and not sql and not files:
         raise click.UsageError("Either --sql or input path or --files is required")
 
@@ -1200,11 +1206,14 @@ def embed_multi(
                 for path in pathlib.Path(directory).glob(pattern):
                     relative = path.relative_to(directory)
                     content = None
-                    for encoding in encodings:
-                        try:
-                            content = path.read_text(encoding=encoding)
-                        except UnicodeDecodeError:
-                            continue
+                    if binary:
+                        content = path.read_bytes()
+                    else:
+                        for encoding in encodings:
+                            try:
+                                content = path.read_text(encoding=encoding)
+                            except UnicodeDecodeError:
+                                continue
                     if content is None:
                         # Log to stderr
                         click.echo(
@@ -1249,8 +1258,10 @@ def embed_multi(
             for row in rows:
                 values = list(row.values())
                 id = prefix + str(values[0])
-                text = " ".join(v or "" for v in values[1:])
-                yield id, text
+                if binary:
+                    yield id, values[1]
+                else:
+                    yield id, " ".join(v or "" for v in values[1:])
 
         # collection_obj.max_batch_size = 1
         collection_obj.embed_multi(tuples(), store=store)
diff --git a/llm/embeddings.py b/llm/embeddings.py
index c25f7e1..bdaa336 100644
--- a/llm/embeddings.py
+++ b/llm/embeddings.py
@@ -336,4 +336,6 @@ class Collection:
     @staticmethod
     def content_hash(text: str) -> bytes:
         "Hash content for deduplication. Override to change hashing behavior."
-        return hashlib.md5(text.encode("utf8")).digest()
+        if isinstance(text, str):
+            text = text.encode("utf8")
+        return hashlib.md5(text).digest()

No tests yet, plus it doesn't update the type hints.

simonw · 2023-09-09T19:18:15Z

Also this only updates llm embed-multi --files - but I also want to be able to do things like this:

cat IMG_3087.jpeg | llm embed -m clip --binary

simonw · 2023-09-09T20:19:40Z

Should the --store option still work?

If yes, should it store in the content column given that it's defined as text? SQLite will let us get away with it but I worry it will break things in Datasette later on.

Probably better to have a content_blob null column which can be used for storing binary data instead.

simonw · 2023-09-09T20:21:03Z

Moving this work to a PR:

Binary embeddings #254

* Binary embeddings support, refs #253 * Write binary content to content_blob, with tests - refs #253 * supports_text and supports_binary embedding validation, refs #253

simonw added enhancement New feature or request embeddings labels Sep 9, 2023

simonw added a commit that referenced this issue Sep 9, 2023

'just mypy' command, refs #253

96906b9

simonw added a commit that referenced this issue Sep 9, 2023

Work in progress binary embeddings support, refs #253

d4dd847

simonw mentioned this issue Sep 9, 2023

Binary embeddings #254

Merged

3 tasks

simonw added a commit that referenced this issue Sep 10, 2023

Write binary content to content_blob, with tests - refs #253

9fd2804

simonw added this to the 0.10 milestone Sep 10, 2023

simonw added a commit that referenced this issue Sep 12, 2023

Work in progress binary embeddings support, refs #253

ed375c3

simonw added a commit that referenced this issue Sep 12, 2023

Write binary content to content_blob, with tests - refs #253

90b599c

simonw added a commit that referenced this issue Sep 12, 2023

supports_text and supports_binary embedding validation, refs #253

81051aa

simonw linked a pull request Sep 12, 2023 that will close this issue

Binary embeddings #254

Merged

3 tasks

simonw closed this as completed in #254 Sep 12, 2023

simonw added a commit that referenced this issue Sep 12, 2023

Binary embeddings (#254)

52cec13

* Binary embeddings support, refs #253 * Write binary content to content_blob, with tests - refs #253 * supports_text and supports_binary embedding validation, refs #253

simonw mentioned this issue Sep 12, 2023

Binary embeddings final cleanup #264

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for embedding binary files #253

Support for embedding binary files #253

simonw commented Sep 9, 2023

simonw commented Sep 9, 2023

simonw commented Sep 9, 2023 •

edited

Loading

simonw commented Sep 9, 2023

simonw commented Sep 9, 2023

Support for embedding binary files #253

Support for embedding binary files #253

Comments

simonw commented Sep 9, 2023

simonw commented Sep 9, 2023

simonw commented Sep 9, 2023 • edited Loading

simonw commented Sep 9, 2023

simonw commented Sep 9, 2023

simonw commented Sep 9, 2023 •

edited

Loading