llm embed-multi command #215

simonw · 2023-09-03T03:59:03Z

For embedding multiple things (semi-efficiently) in a single go.

Accepts CSV or TSV or JSON or nl-JSON or a SQLite database with a SQL query.

Maybe even accepts a directory and a recursive glob, which will cause it to embed the contents of those files with the filepath as the ID for each one.

simonw · 2023-09-03T04:00:03Z

Would be great if this could show a progress bar, where possible - so when doing a SQL query or reading a file from a filepath, but not when reading from standard input.

simonw · 2023-09-03T04:00:32Z

I may have solved most of this already in openai-to-sqlite or similar.

simonw · 2023-09-03T15:12:56Z

I'm going to lift most of the implementation from there: https://github.com/simonw/openai-to-sqlite/blob/main/README.md#json-csv-and-tsv

https://github.com/simonw/openai-to-sqlite/blob/84d6bbd67379a054347dc9a7f84e6c7e225bfb67/openai_to_sqlite/cli.py#L70

simonw · 2023-09-03T15:16:43Z

Not sure how best to solve for metadata though - the openai-to-sqlite embeddings command treated all tabular data as an ID column first and then everything else is columns to embed.

I guess I could have a flag for saying 'add these columns as metadata'.

simonw · 2023-09-03T15:32:09Z

It should have an option for skipping embedding stuff that already exists in the database (based on ID matches).

simonw · 2023-09-03T15:33:11Z

Maybe there should be a mechanism where it can keep a hash of the content in the database table, such that it can avoid duplicate lookups of already-embedded content based on that hash?

This is a tiny bit more expensive in terms of storage and compute, but would save a LOT of money against paid embedding APIs.

simonw · 2023-09-03T15:35:42Z

I think a 16 byte md5 in a BLOB column would be fine. I'm already storing the embeddings themselves which are MUCH larger than that.

sha256 would be 32 bytes. I don't think there are any security concerns here for using MD5.

I'll make the hash calculation a method on the Collection class that people can subclass and over-ride if they need to.

simonw · 2023-09-03T19:39:18Z

Prototype so far:

@cli.command()
@click.argument("collection")
@click.argument(
    "input_path",
    type=click.File("rb"),
    required=False,
)
@click.option(
    "--format",
    type=click.Choice(["json", "csv", "tsv"]),
)
@click.option("--sql", help="Read input using this SQL query")
@click.option(
    "--attach",
    type=(str, click.Path(file_okay=True, dir_okay=False, allow_dash=False)),
    multiple=True,
    help="Additional databases to attach - specify alias and file path",
)
@click.option("-m", "--model", help="Embedding model to use")
@click.option("--store", is_flag=True, help="Store the text itself in the database")
@click.option(
    "-d",
    "--database",
    type=click.Path(file_okay=True, allow_dash=False, dir_okay=False, writable=True),
    envvar="LLM_EMBEDDINGS_DB",
)
def embed_multi(collection, input_path, format, sql, attach, model, store, database):
    """
    Store embeddings for multiple strings at once

    Input can be CSV, TSV or a JSON list of objects.

    The first column is treated as an ID - all other columns
    are assumed to be text that should be concatenated together
    in order to calculate the embeddings.
    """
    if not input_path and not sql:
        raise click.UsageError("Either --sql or input path is required")

    if database:
        db = sqlite_utils.Database(database)
    else:
        db = sqlite_utils.Database(user_dir() / "embeddings.db")

    for alias, attach_path in attach:
        db.attach(alias, attach_path)

    collection_obj = Collection(collection, db, model_id=model)

    expected_length = None
    if sql:
        rows = db.query(sql)
        count_sql = "select count(*) as c from ({})".format(sql)
        expected_length = next(db.query(count_sql))["c"]
    else:
        # Auto-detect
        try:
            rows, _ = rows_from_file(
                input_path, Format[format.upper()] if format else None
            )
        except json.JSONDecodeError as ex:
            raise click.ClickException(str(ex))

    with click.progressbar(
        rows, label="Embedding", show_percent=True, length=expected_length
    ) as rows:

        def tuples():
            for row in rows:
                values = list(row.values())
                id = values[0]
                text = " ".join(v or "" for v in values[1:])
                yield id, text

        collection_obj.embed_multi(tuples(), store=store)

And this extra import:

from sqlite_utils.utils import rows_from_file, Format

simonw · 2023-09-03T19:59:27Z

I'm going to add an option to this command where it can recursively glob search for files in a specified directory and embed those. The ID for each file will be its path and filename.

Maybe like this:

llm embed-multi myfiles --files ~/docs '*.md'

Which creates a myfiles collection populated with the embeddings for every *.md file in that directory.

simonw · 2023-09-03T20:27:54Z

This is pretty cool!

llm embed-multi -d files.db datasette --files ../datasette/docs '**/*.rst' -m sentence-transformers/all-MiniLM-L6-v2

And now:

llm similar -d files.db datasette plugins.rst -n 3

Returns:

{"id": "writing_plugins.rst", "score": 0.6955489467371455, "content": null, "metadata": null}
{"id": "ecosystem.rst", "score": 0.6786199995036342, "content": null, "metadata": null}
{"id": "settings.rst", "score": 0.5996245842903308, "content": null, "metadata": null}

simonw · 2023-09-03T21:51:05Z

Another fun demo. I ran this against a Whisper transcript of my WordCamp talk:

llm embed-multi -d wordcamp.db wordcamp simon-wordcamp.csv -m sentence-transformers/all-MiniLM-L6-v2 --store

And then:

llm similar -d wordcamp.db wordcamp -c 'extensions to an application'

{"id": "25:15,460", "score": 0.185814186656052, "content": "25:17,460 Give me 20 ideas for WordPress plugins", "metadata": null}
{"id": "48:25,480", "score": 0.18035502628838557, "content": "48:28,080 That's this attack against applications", "metadata": null}
{"id": "48:47,880", "score": 0.1625639909226762, "content": "48:50,160 Let's say that you want to build an app that translates", "metadata": null}
{"id": "36:19,200", "score": 0.14616776408254076, "content": "36:23,920 of those things that is almost the hello world of building software on LLMs, except it's", "metadata": null}
{"id": "05:34,800", "score": 0.1336376530702213, "content": "05:38,280 architecture which is what all of these models are using today.", "metadata": null}
{"id": "17:27,700", "score": 0.12580986334304026, "content": "17:29,500 including programming libraries that you", "metadata": null}
{"id": "08:34,300", "score": 0.1192221881190461, "content": "08:37,780 suddenly this whole new avenue of abilities opens up.", "metadata": null}
{"id": "39:36,160", "score": 0.11500451840964403, "content": "39:41,280 I think we can build search for our own sites and applications on top of this semantic search", "metadata": null}
{"id": "30:25,400", "score": 0.11482474637934538, "content": "30:29,080 What kind of questions you should ask to unlock those features?", "metadata": null}
{"id": "41:56,160", "score": 0.1115482781584514, "content": "41:58,800 I'm not allowed to execute binaries you upload.", "metadata": null}

simonw · 2023-09-03T21:53:06Z

And for a quick comparison I fired the same 1200 lines through OpenAI's ada-002 as well (costing less than a cent):

llm embed-multi -d wordcamp.db wordcamp-ada simon-wordcamp.csv -m ada --store

And then:

llm similar -d wordcamp.db wordcamp-ada -c 'extensions to an application'

{"id": "48:25,480", "score": 0.8200134715337003, "content": "48:28,080 That's this attack against applications", "metadata": null}
{"id": "48:41,120", "score": 0.7996586899133216, "content": "48:43,200 But what this is is an attack against the apps", "metadata": null}
{"id": "27:48,500", "score": 0.7928532182157423, "content": "27:51,200 \"that applications that aim to believably mimic humans", "metadata": null}
{"id": "48:47,880", "score": 0.7823372147826224, "content": "48:50,160 Let's say that you want to build an app that translates", "metadata": null}
{"id": "39:36,160", "score": 0.7802652994705345, "content": "39:41,280 I think we can build search for our own sites and applications on top of this semantic search", "metadata": null}
{"id": "02:06,640", "score": 0.7727559467504219, "content": "02:10,560 on turning my open source project", "metadata": null}
{"id": "18:11,420", "score": 0.7634463665945006, "content": "18:15,100 or will affect it in other ways, will inject back doors into it.", "metadata": null}
{"id": "31:11,660", "score": 0.7624166808809196, "content": "31:15,280 because the language model knows the syntax and I can then apply my sort of high-level", "metadata": null}
{"id": "33:26,420", "score": 0.762149452245869, "content": "33:32,260 The react paper and what this described was just another one of these little prompt engineering tricks", "metadata": null}
{"id": "31:36,560", "score": 0.7614831196225215, "content": "31:41,920 Well, what this adds up to is that these language models make me more ambitious with the projects", "metadata": null}

simonw · 2023-09-03T22:21:37Z

I used the --files option against sqlite-utils/docs and against datasette/docs and realized that files with the same name e.g configuration.rst had over-ridden each other.

So I'm going to add a --prefix option that sets a prefix for the IDs that are used.

simonw · 2023-09-03T22:36:51Z

Used --prefix to run this:

llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
  --prefix datasette/ --files ../datasette/docs '**/*.rst'
llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
  --prefix sqlite-utils/ --files ../sqlite-utils/docs '**/*.rst'
llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
  --prefix llm/ --files docs '**/*.md'
llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
  --prefix shot-scraper/ --files ../shot-scraper/docs '**/*.md'

Result was a 1.1MB docs.db database with embeddings and stored text for all of my documentation across four projects.

simonw · 2023-09-03T23:42:07Z

Documentation: https://llm.datasette.io/en/latest/embeddings/cli.html#llm-embed-multi

Refs #192, #209, #211, #213, #215, #217, #218, #219, #222

Refs #192, #209, #211, #213, #215, #217, #218, #219, #222 Closes #205

simonw added enhancement New feature or request embeddings labels Sep 3, 2023

simonw added this to the 0.9 - embeddings milestone Sep 3, 2023

simonw mentioned this issue Sep 3, 2023

Store content hash to avoid embedding duplicate data #217

Closed

simonw mentioned this issue Sep 3, 2023

Got an error passing a generator to collection.embed_multi() simonw/llm-sentence-transformers#6

Closed

simonw added a commit that referenced this issue Sep 3, 2023

In progress llm embed-multi, refs #215

566a15f

simonw mentioned this issue Sep 3, 2023

llm embed-multi command #221

Merged

5 tasks

simonw added a commit that referenced this issue Sep 3, 2023

Tests for CSV/TSV/JSON/NL, refs #215

369881a

simonw added a commit that referenced this issue Sep 3, 2023

Test for llm embed-multi against SQLite, refs #215

b731ba9

simonw added a commit that referenced this issue Sep 3, 2023

--prefix for llm embed-multi, refs #215

4dd13dc

simonw added a commit that referenced this issue Sep 3, 2023

Tests for llm embed-multi --files, refs #215

afb6b5d

simonw added a commit that referenced this issue Sep 3, 2023

Documentation for llm embed-multi, refs #215

81625c1

simonw added a commit that referenced this issue Sep 3, 2023

In progress llm embed-multi, refs #215

3d379fb

simonw added a commit that referenced this issue Sep 3, 2023

Tests for CSV/TSV/JSON/NL, refs #215

5e686fe

simonw added a commit that referenced this issue Sep 3, 2023

Test for llm embed-multi against SQLite, refs #215

70a3d4b

simonw added a commit that referenced this issue Sep 3, 2023

--prefix for llm embed-multi, refs #215

c8c0f80

simonw added a commit that referenced this issue Sep 3, 2023

Tests for llm embed-multi --files, refs #215

6f62b7d

simonw added a commit that referenced this issue Sep 3, 2023

Documentation for llm embed-multi, refs #215

8ce7046

simonw closed this as completed Sep 3, 2023

simonw added a commit that referenced this issue Sep 4, 2023

Release 0.9a1

c9f7299

Refs #192, #209, #211, #213, #215, #217, #218, #219, #222

simonw added a commit that referenced this issue Sep 4, 2023

Release 0.9

5efb300

Refs #192, #209, #211, #213, #215, #217, #218, #219, #222 Closes #205

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm embed-multi command #215

llm embed-multi command #215

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023 •

edited

Loading

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023 •

edited

Loading

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

llm embed-multi command #215

llm embed-multi command #215

Comments

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023 • edited Loading

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023 • edited Loading

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023

simonw commented Sep 3, 2023 •

edited

Loading

simonw commented Sep 3, 2023 •

edited

Loading