Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Related articles with OpenAI embeddings on SQLite vectors #280

Merged
merged 26 commits into from
Nov 17, 2024
Merged

Conversation

rossta
Copy link
Contributor

@rossta rossta commented Nov 15, 2024

Articles now display related articles.

Screenshot 2024-11-16 at 5 51 27 AM

In this PR, we explore the using of "embeddings" for similarity search. Embeddings are "a measure of the relatedness of text strings" [as described in the OpenAI docs](OpenAI’s text embeddings measure the relatedness of text strings). We’re using OpenAI to generate embeddings for each article, storing embeddings as vectors in SQLite using the sqlite-vec extension, and querying for related articles based on embedding "distance" (similarity) across articles.

Helpful resources

Though Aaron‘s video uses PHP and libsql (a SQLite fork that has its own vector support), the concepts are similar and serve as inspiration to what we accomplish here.

This builds on #273 where we implemented Page topics including content analysis with OpenAI to associate articles with topics.

Copy link

codecov bot commented Nov 15, 2024

Codecov Report

Attention: Patch coverage is 97.70115% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
app/models/page/sitepressed.rb 96.15% 1 Missing ⚠️
app/models/types/vector.rb 87.50% 1 Missing ⚠️
Files with missing lines Coverage Δ
app/jobs/pages/batch_embedding_job.rb 100.00% <100.00%> (ø)
app/jobs/pages/batch_upsert_pages_job.rb 100.00% <100.00%> (ø)
app/jobs/pages/embedding_job.rb 100.00% <100.00%> (ø)
app/models/page.rb 100.00% <100.00%> (+2.32%) ⬆️
app/models/page/similarity.rb 100.00% <100.00%> (ø)
app/models/page_embedding.rb 100.00% <100.00%> (ø)
app/views/components/pages/timestamp.rb 100.00% <100.00%> (ø)
app/views/components/pages/topics.rb 85.71% <100.00%> (-0.96%) ⬇️
lib/tasks/pages/embeddings.rake 100.00% <100.00%> (ø)
app/models/page/sitepressed.rb 96.15% <96.15%> (ø)
... and 1 more

... and 1 file with indirect coverage changes

It’s not likely published posts will change signficantly after
publishing so we‘ll only batch the embedding jobs for published posts
without an embedding already created. This way we don‘t incur redundant
API calls for every post on every deploy.
Cheaper and better according to the docs
@rossta rossta merged commit f5d65c4 into main Nov 17, 2024
11 checks passed
@rossta rossta deleted the feat/embeddings branch November 17, 2024 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant