-
-
Notifications
You must be signed in to change notification settings - Fork 355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llm embed-multi command #215
Comments
Would be great if this could show a progress bar, where possible - so when doing a SQL query or reading a file from a filepath, but not when reading from standard input. |
I may have solved most of this already in |
I'm going to lift most of the implementation from there: https://github.com/simonw/openai-to-sqlite/blob/main/README.md#json-csv-and-tsv |
Not sure how best to solve for metadata though - the I guess I could have a flag for saying 'add these columns as metadata'. |
It should have an option for skipping embedding stuff that already exists in the database (based on ID matches). |
Maybe there should be a mechanism where it can keep a hash of the content in the database table, such that it can avoid duplicate lookups of already-embedded content based on that hash? This is a tiny bit more expensive in terms of storage and compute, but would save a LOT of money against paid embedding APIs. |
I think a 16 byte md5 in a BLOB column would be fine. I'm already storing the embeddings themselves which are MUCH larger than that. sha256 would be 32 bytes. I don't think there are any security concerns here for using MD5. I'll make the hash calculation a method on the |
Prototype so far: @cli.command()
@click.argument("collection")
@click.argument(
"input_path",
type=click.File("rb"),
required=False,
)
@click.option(
"--format",
type=click.Choice(["json", "csv", "tsv"]),
)
@click.option("--sql", help="Read input using this SQL query")
@click.option(
"--attach",
type=(str, click.Path(file_okay=True, dir_okay=False, allow_dash=False)),
multiple=True,
help="Additional databases to attach - specify alias and file path",
)
@click.option("-m", "--model", help="Embedding model to use")
@click.option("--store", is_flag=True, help="Store the text itself in the database")
@click.option(
"-d",
"--database",
type=click.Path(file_okay=True, allow_dash=False, dir_okay=False, writable=True),
envvar="LLM_EMBEDDINGS_DB",
)
def embed_multi(collection, input_path, format, sql, attach, model, store, database):
"""
Store embeddings for multiple strings at once
Input can be CSV, TSV or a JSON list of objects.
The first column is treated as an ID - all other columns
are assumed to be text that should be concatenated together
in order to calculate the embeddings.
"""
if not input_path and not sql:
raise click.UsageError("Either --sql or input path is required")
if database:
db = sqlite_utils.Database(database)
else:
db = sqlite_utils.Database(user_dir() / "embeddings.db")
for alias, attach_path in attach:
db.attach(alias, attach_path)
collection_obj = Collection(collection, db, model_id=model)
expected_length = None
if sql:
rows = db.query(sql)
count_sql = "select count(*) as c from ({})".format(sql)
expected_length = next(db.query(count_sql))["c"]
else:
# Auto-detect
try:
rows, _ = rows_from_file(
input_path, Format[format.upper()] if format else None
)
except json.JSONDecodeError as ex:
raise click.ClickException(str(ex))
with click.progressbar(
rows, label="Embedding", show_percent=True, length=expected_length
) as rows:
def tuples():
for row in rows:
values = list(row.values())
id = values[0]
text = " ".join(v or "" for v in values[1:])
yield id, text
collection_obj.embed_multi(tuples(), store=store) And this extra import: from sqlite_utils.utils import rows_from_file, Format |
I'm going to add an option to this command where it can recursively glob search for files in a specified directory and embed those. The ID for each file will be its path and filename. Maybe like this: llm embed-multi myfiles --files ~/docs '*.md' Which creates a |
This is pretty cool! llm embed-multi -d files.db datasette --files ../datasette/docs '**/*.rst' -m sentence-transformers/all-MiniLM-L6-v2 And now: llm similar -d files.db datasette plugins.rst -n 3 Returns:
|
Another fun demo. I ran this against a Whisper transcript of my WordCamp talk: llm embed-multi -d wordcamp.db wordcamp simon-wordcamp.csv -m sentence-transformers/all-MiniLM-L6-v2 --store And then: llm similar -d wordcamp.db wordcamp -c 'extensions to an application'
|
And for a quick comparison I fired the same 1200 lines through OpenAI's llm embed-multi -d wordcamp.db wordcamp-ada simon-wordcamp.csv -m ada --store And then: llm similar -d wordcamp.db wordcamp-ada -c 'extensions to an application'
|
I used the So I'm going to add a |
Used llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
--prefix datasette/ --files ../datasette/docs '**/*.rst'
llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
--prefix sqlite-utils/ --files ../sqlite-utils/docs '**/*.rst'
llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
--prefix llm/ --files docs '**/*.md'
llm embed-multi -d docs.db docs -m sentence-transformers/all-MiniLM-L6-v2 --store \
--prefix shot-scraper/ --files ../shot-scraper/docs '**/*.md' Result was a 1.1MB |
For embedding multiple things (semi-efficiently) in a single go.
Accepts CSV or TSV or JSON or nl-JSON or a SQLite database with a SQL query.
Maybe even accepts a directory and a recursive glob, which will cause it to embed the contents of those files with the filepath as the ID for each one.
The text was updated successfully, but these errors were encountered: