Skip to content

Commit

Permalink
docs: add explanation in rag blog and note on fixed length arrays (#8413
Browse files Browse the repository at this point in the history
)
  • Loading branch information
lostmygithubaccount authored Feb 22, 2024
1 parent 97dc7be commit ea58d22
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 6 deletions.

Large diffs are not rendered by default.

20 changes: 18 additions & 2 deletions docs/posts/duckdb-for-rag/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@ RAG!
The database must support array types and have some form of similarity metric
between arrays of numbers. Alternatively, a custom user-defined function (UDF)
can be used for the similarity metric.

The performance of calculating similarity will also be much faster if the
database supports [fixed-sized
arrays](https://duckdb.org/docs/sql/data_types/array) as DuckDB recently
launched in version 0.10.0. We're still using 0.9.2 in this post, but it'll be
easy to upgrade.
:::

[DuckDB is the default backend for Ibis](../why-duckdb/index.qmd) and makes
Expand Down Expand Up @@ -202,7 +208,7 @@ from largest to smallest:

```{python}
t = (
t.mutate(tokens_estimate=t.contents.length() // 4) # <1>
t.mutate(tokens_estimate=t["contents"].length() // 4) # <1>
.order_by(ibis._["tokens_estimate"].desc()) # <2>
.relocate("filepath", "tokens_estimate") # <3>
)
Expand Down Expand Up @@ -297,10 +303,12 @@ Now we can search for similar text in the documentation:

```{python}
def search_docs(text): # <1>
"""Search documentation for similar text, returning a sorted table""" # <1>
embedding = _embed(text) # <2>
s = (
t.mutate(similarity=list_cosine_similarity(t.embedding, embedding)) # <3>
t.mutate(similarity=list_cosine_similarity(t["embedding"], embedding)) # <3>
.relocate("similarity") # <4>
.order_by(ibis._["similarity"].desc()) # <5>
.cache() # <6>
Expand All @@ -321,6 +329,14 @@ text = "where can I chat with the community about Ibis?"
search_docs(text)
```

Now that we have retrieved the most similar documentation, we can augment our
language model's input with that context prior to generating a response! In
practice, we'd probably want to set a similarity threshold and take the top `N`
results. Chunking our text into smaller pieces and selecting from those results
would also be a good idea.

Let's try a few more queries:

```{python}
text = "what do users say about Ibis?"
search_docs(text)
Expand Down

0 comments on commit ea58d22

Please sign in to comment.