Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llm similar command for searching against embeddings #190

Closed
simonw opened this issue Aug 28, 2023 · 11 comments
Closed

llm similar command for searching against embeddings #190

simonw opened this issue Aug 28, 2023 · 11 comments
Labels
embeddings enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Aug 28, 2023

Storing these things isn't particularly interesting if you can't then do something with them.

I'm going to add one more command: llm similar - which returns the N most similar results to the thing you pass in.

This will return 10 most similar posts to the post with ID 1:

llm similar posts 1

Use -n 20 to change the number of results.

Or you can embed text and use it for the lookup straight away:

llm similar posts -c 'this is content'

No need to ever specify the model here because that can be looked up for the collection.

The big question: how do alternative vector search plugins come into play here?

The default is going to be a Python brute-force algorithm - but I very much want to have plugins able to add support for sqlite-vss or FAISS or Pinecone or similar.

Perhaps an index can be configured against a collection, then any changes to that collection automatically trigger an update to that index.

llm configure-vector-index posts faiss /tmp/faiss-index

Alternative names:

llm vector posts faiss …
llm configure-vector pasts faiss …
llm index posts faiss …

Originally posted by @simonw in #185 (comment)

@simonw simonw added enhancement New feature or request embeddings labels Aug 28, 2023
@simonw
Copy link
Owner Author

simonw commented Aug 28, 2023

I'm tempted to do something clever to help support vector search engine plugins in the future.

What if the embeddings table was designed such that external indexes could easily see what had changed since the last time they ran their indexer?

Having a modified timestamp on there would solve most of this problem - engines could track what the highest modified timestamp was when they last ran, then re-index just embeddings what have been added or updated since that date.

There's one catch though: deletions.

One mechanism that could work: when you delete a record from the embeddings table you don't delete it - you just set the embedding column to null, and the modified timestamp to now().

That way any indexing mechanisms can scan through the table for stuff modified since their last check, and delete any records with null in their embedding column.

I thought I'd need to add a is_deleted column for this, but I think null in embedding is an OK, unambiguous way to represent deletions.

@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

Got this working! Tested it against my TILs with this script:

#!/bin/zsh

# Define the root directory where we look for .md files
ROOT_DIR=~/Dropbox/Development/til

# Use find to locate all .md files under the root directory
find "$ROOT_DIR" -type f -name "*.md" | while read -r filepath; do
    # Extract the filename and parent directory
    filename=$(basename "$filepath")
    parent_dir=$(basename $(dirname "$filepath"))

    # Concatenate parent directory and filename
    name_of_file="${parent_dir}/${filename}"

    # Run the llm command with the found .md file
    cat "$filepath" | llm embed tils2 "$name_of_file" -m sentence-transformers/all-MiniLM-L6-v2
    echo $filepath
done

Now I can do this:

llm similar tils2 hacker-news/recent-comments.md -n 4

And get back:

{"id": "clickhouse/github-explorer.md", "score": 0.38178510868203586}
{"id": "wikipedia/page-stats-api.md", "score": 0.3409910110962394}
{"id": "mastodon/export-timeline-to-sqlite.md", "score": 0.2592339212719439}
{"id": "twitter/birdwatch-sqlite.md", "score": 0.25843177027337216}

@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

Still needs tests and documentation.

simonw added a commit that referenced this issue Sep 1, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

I'd like to solve this too:

@simonw
Copy link
Owner Author

simonw commented Sep 1, 2023

Got this working too:

llm similar tils2 -c 'sqlite python'
{"id": "sqlite/python-sqlite-memory-to-file.md", "score": 0.643185563707015}
{"id": "python/sqlite-in-pyodide.md", "score": 0.5837638528336709}
{"id": "sqlite/build-specific-sqlite-pysqlite-macos.md", "score": 0.5825596720458623}
{"id": "spatialite/minimal-spatialite-database-in-python.md", "score": 0.5638998107621749}
{"id": "sqlite/sqlite-extensions-python-macos.md", "score": 0.5546740147081825}
{"id": "sqlite/one-line-csv-operations.md", "score": 0.5396872249298176}
{"id": "python/find-local-variables-in-exception-traceback.md", "score": 0.5313510328532719}
{"id": "python/pypy-macos.md", "score": 0.5273470993906808}
{"id": "python/os-remove-windows.md", "score": 0.519507245494884}
{"id": "sqlite/import-csv.md", "score": 0.5150227057181345}

@simonw simonw added this to the 0.9 - embeddings milestone Sep 2, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 2, 2023

This can now be refactored to use:

It should also grow the ability to return stored content and metadata, if both stored and requested.

@simonw
Copy link
Owner Author

simonw commented Sep 2, 2023

Oh looks like I mostly refactored it already:

llm/llm/cli.py

Lines 1018 to 1048 in de6d257

def similar(collection, id, input, content, number, database):
"""Return top N similar IDs from a collection"""
if not id and not content and not input:
raise click.ClickException("Must provide content or an ID for the comparison")
if database:
db = sqlite_utils.Database(database)
else:
db = sqlite_utils.Database(user_dir() / "embeddings.db")
if not db["embeddings"].exists():
raise click.ClickException("No embeddings table found in database")
collection_obj = Collection(db, collection)
if not collection_obj.exists():
raise click.ClickException("Collection does not exist")
if id:
results = collection_obj.similar_by_id(id, number)
else:
if not content:
if not input:
# Read from stdin
input = sys.stdin
content = input.read()
if not content:
raise click.ClickException("No content provided")
results = collection_obj.similar_by_content(content, number)
for result in results:
click.echo(json.dumps(result))

One point of confusion: the -c/--content option is already being used to specify content to be embedded and compared with the stored data, so I can't use that option as a "and give me back the stored content" option.

@simonw
Copy link
Owner Author

simonw commented Sep 2, 2023

It's very broken right now:

% llm similar tils2 python
Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/bin/llm", line 33, in <module>
    sys.exit(load_entry_point('llm', 'console_scripts', 'llm')())
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/llm/llm/cli.py", line 1036, in similar
    results = collection_obj.similar_by_id(id, number)
  File "/Users/simon/Dropbox/Development/llm/llm/embeddings.py", line 227, in similar_by_id
    raise ValueError("ID not found")
ValueError: ID not found
% llm similar tils2 -c python
Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/bin/llm", line 33, in <module>
    sys.exit(load_entry_point('llm', 'console_scripts', 'llm')())
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/llm/llm/cli.py", line 1045, in similar
    results = collection_obj.similar_by_content(content, number)
AttributeError: 'Collection' object has no attribute 'similar_by_content'. Did you mean: 'similar_by_vector'?

@simonw
Copy link
Owner Author

simonw commented Sep 2, 2023

Even more broken if you ask for a collection ID that does not exist:

% llm similar til2 python    
Traceback (most recent call last):
  File "/Users/simon/Dropbox/Development/llm/llm/embeddings.py", line 60, in id
    row = next(rows)
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/simon/Dropbox/Development/llm/llm/__init__.py", line 116, in get_embedding_model
    return aliases[name]
KeyError: None

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/bin/llm", line 33, in <module>
    sys.exit(load_entry_point('llm', 'console_scripts', 'llm')())
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/llm/llm/cli.py", line 1031, in similar
    collection_obj = Collection(db, collection)
  File "/Users/simon/Dropbox/Development/llm/llm/embeddings.py", line 34, in __init__
    self._id = self.id()
  File "/Users/simon/Dropbox/Development/llm/llm/embeddings.py", line 71, in id
    "model": self.model().model_id,
  File "/Users/simon/Dropbox/Development/llm/llm/embeddings.py", line 42, in model
    self._model = llm.get_embedding_model(self._model_id)
  File "/Users/simon/Dropbox/Development/llm/llm/__init__.py", line 118, in get_embedding_model
    raise UnknownModelError("Unknown model: " + name)

@simonw
Copy link
Owner Author

simonw commented Sep 2, 2023

The standard input trick doesn't work:

% echo 'computer science' | llm similar quotations -i -
Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/bin/llm", line 33, in <module>
    sys.exit(load_entry_point('llm', 'console_scripts', 'llm')())
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/llm/llm/cli.py", line 1051, in similar
    content = input.read()
AttributeError: 'str' object has no attribute 'read'

@simonw simonw closed this as completed in 4be89fa Sep 2, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 2, 2023

simonw added a commit that referenced this issue Sep 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
embeddings enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant