Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llm embed-multi command #215

Closed
simonw opened this issue Sep 3, 2023 · 15 comments
Closed

llm embed-multi command #215

simonw opened this issue Sep 3, 2023 · 15 comments
Labels
embeddings enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Sep 3, 2023

For embedding multiple things (semi-efficiently) in a single go.

Accepts CSV or TSV or JSON or nl-JSON or a SQLite database with a SQL query.

Maybe even accepts a directory and a recursive glob, which will cause it to embed the contents of those files with the filepath as the ID for each one.

@simonw simonw added enhancement New feature or request embeddings labels Sep 3, 2023
@simonw simonw added this to the 0.9 - embeddings milestone Sep 3, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

Would be great if this could show a progress bar, where possible - so when doing a SQL query or reading a file from a filepath, but not when reading from standard input.

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

I may have solved most of this already in openai-to-sqlite or similar.

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

Not sure how best to solve for metadata though - the openai-to-sqlite embeddings command treated all tabular data as an ID column first and then everything else is columns to embed.

I guess I could have a flag for saying 'add these columns as metadata'.

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

It should have an option for skipping embedding stuff that already exists in the database (based on ID matches).

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

Maybe there should be a mechanism where it can keep a hash of the content in the database table, such that it can avoid duplicate lookups of already-embedded content based on that hash?

This is a tiny bit more expensive in terms of storage and compute, but would save a LOT of money against paid embedding APIs.

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

I think a 16 byte md5 in a BLOB column would be fine. I'm already storing the embeddings themselves which are MUCH larger than that.

sha256 would be 32 bytes. I don't think there are any security concerns here for using MD5.

I'll make the hash calculation a method on the Collection class that people can subclass and over-ride if they need to.

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

Prototype so far:

@cli.command()
@click.argument("collection")
@click.argument(
    "input_path",
    type=click.File("rb"),
    required=False,
)
@click.option(
    "--format",
    type=click.Choice(["json", "csv", "tsv"]),
)
@click.option("--sql", help="Read input using this SQL query")
@click.option(
    "--attach",
    type=(str, click.Path(file_okay=True, dir_okay=False, allow_dash=False)),
    multiple=True,
    help="Additional databases to attach - specify alias and file path",
)
@click.option("-m", "--model", help="Embedding model to use")
@click.option("--store", is_flag=True, help="Store the text itself in the database")
@click.option(
    "-d",
    "--database",
    type=click.Path(file_okay=True, allow_dash=False, dir_okay=False, writable=True),
    envvar="LLM_EMBEDDINGS_DB",
)
def embed_multi(collection, input_path, format, sql, attach, model, store, database):
    """
    Store embeddings for multiple strings at once

    Input can be CSV, TSV or a JSON list of objects.

    The first column is treated as an ID - all other columns
    are assumed to be text that should be concatenated together
    in order to calculate the embeddings.
    """
    if not input_path and not sql:
        raise click.UsageError("Either --sql or input path is required")

    if database:
        db = sqlite_utils.Database(database)
    else:
        db = sqlite_utils.Database(user_dir() / "embeddings.db")

    for alias, attach_path in attach:
        db.attach(alias, attach_path)

    collection_obj = Collection(collection, db, model_id=model)

    expected_length = None
    if sql:
        rows = db.query(sql)
        count_sql = "select count(*) as c from ({})".format(sql)
        expected_length = next(db.query(count_sql))["c"]
    else:
        # Auto-detect
        try:
            rows, _ = rows_from_file(
                input_path, Format[format.upper()] if format else None
            )
        except json.JSONDecodeError as ex:
            raise click.ClickException(str(ex))

    with click.progressbar(
        rows, label="Embedding", show_percent=True, length=expected_length
    ) as rows:

        def tuples():
            for row in rows:
                values = list(row.values())
                id = values[0]
                text = " ".join(v or "" for v in values[1:])
                yield id, text

        collection_obj.embed_multi(tuples(), store=store)

And this extra import:

from sqlite_utils.utils import rows_from_file, Format

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

I'm going to add an option to this command where it can recursively glob search for files in a specified directory and embed those. The ID for each file will be its path and filename.

Maybe like this:

llm embed-multi myfiles --files ~/docs '*.md'

Which creates a myfiles collection populated with the embeddings for every *.md file in that directory.

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

This is pretty cool!

llm embed-multi -d files.db datasette --files ../datasette/docs '**/*.rst' -m sentence-transformers/all-MiniLM-L6-v2

And now:

llm similar -d files.db datasette plugins.rst -n 3      

Returns:

{"id": "writing_plugins.rst", "score": 0.6955489467371455, "content": null, "metadata": null}
{"id": "ecosystem.rst", "score": 0.6786199995036342, "content": null, "metadata": null}
{"id": "settings.rst", "score": 0.5996245842903308, "content": null, "metadata": null}

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

Another fun demo. I ran this against a Whisper transcript of my WordCamp talk:

llm embed-multi -d wordcamp.db wordcamp simon-wordcamp.csv -m sentence-transformers/all-MiniLM-L6-v2 --store

And then:

llm similar -d wordcamp.db wordcamp -c 'extensions to an application'
{"id": "25:15,460", "score": 0.185814186656052, "content": "25:17,460 Give me 20 ideas for WordPress plugins", "metadata": null}
{"id": "48:25,480", "score": 0.18035502628838557, "content": "48:28,080 That's this attack against applications", "metadata": null}
{"id": "48:47,880", "score": 0.1625639909226762, "content": "48:50,160 Let's say that you want to build an app that translates", "metadata": null}
{"id": "36:19,200", "score": 0.14616776408254076, "content": "36:23,920 of those things that is almost the hello world of building software on LLMs, except it's", "metadata": null}
{"id": "05:34,800", "score": 0.1336376530702213, "content": "05:38,280 architecture which is what all of these models are using today.", "metadata": null}
{"id": "17:27,700", "score": 0.12580986334304026, "content": "17:29,500 including programming libraries that you", "metadata": null}
{"id": "08:34,300", "score": 0.1192221881190461, "content": "08:37,780 suddenly this whole new avenue of abilities opens up.", "metadata": null}
{"id": "39:36,160", "score": 0.11500451840964403, "content": "39:41,280 I think we can build search for our own sites and applications on top of this semantic search", "metadata": null}
{"id": "30:25,400", "score": 0.11482474637934538, "content": "30:29,080 What kind of questions you should ask to unlock those features?", "metadata": null}
{"id": "41:56,160", "score": 0.1115482781584514, "content": "41:58,800 I'm not allowed to execute binaries you upload.", "metadata": null}

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

And for a quick comparison I fired the same 1200 lines through OpenAI's ada-002 as well (costing less than a cent):

llm embed-multi -d wordcamp.db wordcamp-ada simon-wordcamp.csv -m ada --store

And then:

llm similar -d wordcamp.db wordcamp-ada -c 'extensions to an application'
{"id": "48:25,480", "score": 0.8200134715337003, "content": "48:28,080 That's this attack against applications", "metadata": null}
{"id": "48:41,120", "score": 0.7996586899133216, "content": "48:43,200 But what this is is an attack against the apps", "metadata": null}
{"id": "27:48,500", "score": 0.7928532182157423, "content": "27:51,200 \"that applications that aim to believably mimic humans", "metadata": null}
{"id": "48:47,880", "score": 0.7823372147826224, "content": "48:50,160 Let's say that you want to build an app that translates", "metadata": null}
{"id": "39:36,160", "score": 0.7802652994705345, "content": "39:41,280 I think we can build search for our own sites and applications on top of this semantic search", "metadata": null}
{"id": "02:06,640", "score": 0.7727559467504219, "content": "02:10,560 on turning my open source project", "metadata": null}
{"id": "18:11,420", "score": 0.7634463665945006, "content": "18:15,100 or will affect it in other ways, will inject back doors into it.", "metadata": null}
{"id": "31:11,660", "score": 0.7624166808809196, "content": "31:15,280 because the language model knows the syntax and I can then apply my sort of high-level", "metadata": null}
{"id": "33:26,420", "score": 0.762149452245869, "content": "33:32,260 The react paper and what this described was just another one of these little prompt engineering tricks", "metadata": null}
{"id": "31:36,560", "score": 0.7614831196225215, "content": "31:41,920 Well, what this adds up to is that these language models make me more ambitious with the projects", "metadata": null}

simonw added a commit that referenced this issue Sep 3, 2023
@simonw simonw mentioned this issue Sep 3, 2023
5 tasks
simonw added a commit that referenced this issue Sep 3, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

I used the --files option against sqlite-utils/docs and against datasette/docs and realized that files with the same name e.g configuration.rst had over-ridden each other.

So I'm going to add a --prefix option that sets a prefix for the IDs that are used.

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

Used --prefix to run this:

llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
  --prefix datasette/ --files ../datasette/docs '**/*.rst'
llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
  --prefix sqlite-utils/ --files ../sqlite-utils/docs '**/*.rst'
llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
  --prefix llm/ --files docs '**/*.md'
llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
  --prefix shot-scraper/ --files ../shot-scraper/docs '**/*.md'

Result was a 1.1MB docs.db database with embeddings and stored text for all of my documentation across four projects.

@simonw
Copy link
Owner Author

simonw commented Sep 3, 2023

simonw added a commit that referenced this issue Sep 4, 2023
simonw added a commit that referenced this issue Sep 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
embeddings enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant