Skip to content

Commit

Permalink
Merge pull request #10 from zzstoatzz/sync
Browse files Browse the repository at this point in the history
sync first and update docs
  • Loading branch information
zzstoatzz authored Nov 4, 2024
2 parents bd4513b + a7fd007 commit a614ae3
Show file tree
Hide file tree
Showing 21 changed files with 709 additions and 351 deletions.
6 changes: 5 additions & 1 deletion .github/workflows/publish-docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ on:
push:
tags:
- v*
branches:
- main
paths:
- "docs/**"
workflow_dispatch:

permissions:
Expand Down Expand Up @@ -31,4 +35,4 @@ jobs:
cairosvg

- name: Publish docs
run: mkdocs gh-deploy --force
run: mkdocs gh-deploy --force
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,16 @@ pip install raggy

Read the [docs](https://zzstoatzz.github.io/raggy/)

### examples
### What is it?

A Python library for:

- scraping the web to produce rich documents
- putting these documents in vectorstores
- querying the vectorstores to find documents similar to a query

see this [example](https://github.com/zzstoatzz/raggy/blob/main/examples/refresh_vectorstore/refresh_tpuf.py) I use to refresh a chatbot that knows about `prefect`.
See this [example](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/website.py) to chat with any website, or this [example](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/repo.py) to chat with any GitHub repo.

### Contributing

We welcome contributions! See our [contributing guide](https://zzstoatzz.github.io/raggy/contributing) for details.
67 changes: 67 additions & 0 deletions docs/contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Contributing to Raggy

We love your input! We want to make contributing to Raggy as easy and transparent as possible.

## Development Setup

We recommend using [uv](https://github.com/astral-sh/uv) for Python environment management and package installation:

```bash
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repo
git clone https://github.com/zzstoatzz/raggy.git
cd raggy

# Create and activate a virtual environment
uv venv

# Install in editable mode with dev dependencies
uv pip install -e ".[dev]"
```

## Running Tests

```bash
# Install test dependencies
uv pip install -e ".[test]"

# Run tests
pytest
```

## Building Documentation

```bash
# Install docs dependencies
uv pip install -e ".[docs]"

# Serve docs locally
mkdocs serve
```

## Code Style

```
pre-commit install
pre-commit run --all-files # happens automatically on commit
```

## Running Examples

All examples can be run using uv:

!!! question "where are the dependencies?"
`uv` will run the example in an isolated environment using [inline script dependencies](https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies).

```bash
# Run example
uv run examples/chat_with_X/website.py
```

See our [example gallery](examples/index.md) for more details.

## Versioning

We use [Semantic Versioning](http://semver.org/). For the versions available, see the [tags on this repository](https://github.com/zzstoatzz/raggy/tags).
35 changes: 35 additions & 0 deletions docs/examples/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Example Gallery

Here are some practical examples of using `raggy` in real-world scenarios.

## Chat with Content

Ye old "chat your data" examples.

#### Chat with a Website

```bash
uv run examples/chat_with_X/website.py "let's chat about docs.astral.sh/uv"
```

#### Chat with a GitHub Repo

```bash
uv run examples/chat_with_X/repo.py "let's chat about astral-sh/uv"
```

## Refresh Vectorstores

A `prefect` flow to gather documents from sources of knowledge, embed them and put them in a vectorstore.

#### Refresh TurboPuffer

```bash
uv run examples/refresh_vectorstore/tpuf_namespace.py
```

#### Refresh Chroma

```bash
uv run examples/refresh_vectorstore/chroma_collection.py
```
17 changes: 2 additions & 15 deletions docs/hooks.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,8 @@
import logging
import subprocess

log = logging.getLogger("mkdocs")


def on_pre_build(config, **kwargs):
"""Add a custom route to the server."""
try:
subprocess.run(
[
"npx",
"tailwindcss",
"-i",
"./docs/overrides/tailwind.css",
"-o",
"./docs/static/css/tailwind.css",
]
)
except Exception:
log.error("You need to install tailwindcss using npx install tailwindcss")
"""Add any pre-build hooks here."""
pass
87 changes: 86 additions & 1 deletion docs/ingest_strategy.md
Original file line number Diff line number Diff line change
@@ -1 +1,86 @@
# Coming soon!
# Ingest Strategy

When building RAG applications, you often need to load and refresh content from multiple sources. This can involve:
- Expensive API calls
- Large document processing
- Concurrent embedding operations

We use [Prefect](https://docs.prefect.io) to handle these challenges, giving us:

- Automatic caching of expensive operations
- Concurrent processing with backpressure
- Observability and retries

Let's look at a real example that demonstrates these concepts.

## Building a Knowledge Base

```python
from datetime import timedelta
import httpx
from prefect import flow, task
from prefect.tasks import task_input_hash

from raggy.loaders.github import GitHubRepoLoader
from raggy.loaders.web import SitemapLoader
from raggy.vectorstores.tpuf import TurboPuffer

# Cache based on content changes
def get_last_modified(context, parameters):
"""Only reload if the content has changed."""
try:
return httpx.head(parameters["urls"][0]).headers.get("Last-Modified", "")
except Exception:
return None

@task(
cache_key_fn=get_last_modified,
cache_expiration=timedelta(hours=24),
retries=2,
)
async def gather_documents(urls: list[str]):
return await SitemapLoader(urls=urls).load()

@flow
async def refresh_knowledge():
# Load from multiple sources
documents = []
for loader in [
SitemapLoader(urls=["https://docs.prefect.io/sitemap.xml"]),
GitHubRepoLoader(repo="PrefectHQ/prefect", include_globs=["README.md"]),
]:
documents.extend(await gather_documents(loader))

# Store efficiently with concurrent embedding
with TurboPuffer(namespace="knowledge") as tpuf:
await tpuf.upsert_batched(
documents,
batch_size=100, # tune based on document size
max_concurrent=8 # tune based on rate limits
)
```

This example shows key patterns:

1. Content-aware caching (`Last-Modified` headers, commit SHAs, etc)
2. Automatic retries for resilience
3. Concurrent processing with backpressure
4. Efficient batching of embedding operations

See the [refresh examples](https://github.com/zzstoatzz/raggy/tree/main/examples/refresh_vectorstore) for complete implementations using both Chroma and TurboPuffer.

## Performance Tips

For production workloads:
```python
@task(
retries=2,
retry_delay_seconds=[3, 60], # exponential backoff
cache_expiration=timedelta(days=1),
persist_result=True, # save results to storage
)
async def gather_documents(loader):
return await loader.load()
```

See [Prefect's documentation](https://docs.prefect.io/latest/concepts/tasks/) for more on task configuration and caching strategies.
20 changes: 20 additions & 0 deletions docs/overrides/main.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{% extends "base.html" %}

{% block announce %}
<style>
.md-announce {
font-family: 'Roboto Mono', monospace;
background-color: var(--md-primary-fg-color);
}
.md-announce__inner {
margin: 0 auto;
padding: 0.2rem;
text-align: center;
font-weight: 300;
letter-spacing: 0.05em;
}
</style>
<a href="{{ config.extra.announcement.link }}" style="color: currentColor">
{{ config.extra.announcement.text }}
</a>
{% endblock %}
41 changes: 32 additions & 9 deletions docs/welcome/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,44 @@ print(documents[0])

## Adding documents to a vectorstore

```python
from raggy.vectorstores.tpuf import Turbopuffer
!!! note "New in 0.2.0"
Vectorstore operations are now synchronous by default, with async batching available via `upsert_batched`.

async with Turbopuffer() as vectorstore: # uses default `raggy` namespace
await vectorstore.upsert(documents)
```python
from raggy.vectorstores.tpuf import TurboPuffer

with TurboPuffer(namespace="my_documents") as vectorstore:
# Synchronous operation
vectorstore.upsert(documents)

# Async batched usage for large document sets
await vectorstore.upsert_batched(
documents,
batch_size=100,
max_concurrent=8
)
```

## Querying the vectorstore

```python
from raggy.vectorstores.tpuf import query_namespace

print(await query_namespace("how do I get started with raggy?"))
from raggy.vectorstores.tpuf import query_namespace, multi_query_tpuf

# Single query
result = query_namespace("how do I get started with raggy?")
print(result)

# Multiple related queries for better coverage
result = multi_query_tpuf([
"how to install raggy",
"basic raggy usage",
"raggy getting started"
])
print(result)
```

## Real-world example
## Real-world examples

See [this example](https://github.com/zzstoatzz/raggy/blob/main/examples/refresh_vectorstore/refresh_tpuf.py) I use to refresh a chatbot that knows about `prefect`.
- [Chat with a GitHub repo](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/repo.py)
- [Chat with a website](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/website.py)
- [Refresh a vectorstore](https://github.com/zzstoatzz/raggy/blob/main/examples/refresh_vectorstore/tpuf_namespace.py)
Loading

0 comments on commit a614ae3

Please sign in to comment.