Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync first and update docs #10

Merged
merged 3 commits into from
Nov 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .github/workflows/publish-docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ on:
push:
tags:
- v*
branches:
- main
paths:
- "docs/**"
workflow_dispatch:

permissions:
Expand Down Expand Up @@ -31,4 +35,4 @@ jobs:
cairosvg

- name: Publish docs
run: mkdocs gh-deploy --force
run: mkdocs gh-deploy --force
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,16 @@ pip install raggy

Read the [docs](https://zzstoatzz.github.io/raggy/)

### examples
### What is it?

A Python library for:

- scraping the web to produce rich documents
- putting these documents in vectorstores
- querying the vectorstores to find documents similar to a query

see this [example](https://github.com/zzstoatzz/raggy/blob/main/examples/refresh_vectorstore/refresh_tpuf.py) I use to refresh a chatbot that knows about `prefect`.
See this [example](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/website.py) to chat with any website, or this [example](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/repo.py) to chat with any GitHub repo.

### Contributing

We welcome contributions! See our [contributing guide](https://zzstoatzz.github.io/raggy/contributing) for details.
67 changes: 67 additions & 0 deletions docs/contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Contributing to Raggy

We love your input! We want to make contributing to Raggy as easy and transparent as possible.

## Development Setup

We recommend using [uv](https://github.com/astral-sh/uv) for Python environment management and package installation:

```bash
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repo
git clone https://github.com/zzstoatzz/raggy.git
cd raggy

# Create and activate a virtual environment
uv venv

# Install in editable mode with dev dependencies
uv pip install -e ".[dev]"
```

## Running Tests

```bash
# Install test dependencies
uv pip install -e ".[test]"

# Run tests
pytest
```

## Building Documentation

```bash
# Install docs dependencies
uv pip install -e ".[docs]"

# Serve docs locally
mkdocs serve
```

## Code Style

```
pre-commit install
pre-commit run --all-files # happens automatically on commit
```

## Running Examples

All examples can be run using uv:

!!! question "where are the dependencies?"
`uv` will run the example in an isolated environment using [inline script dependencies](https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies).

```bash
# Run example
uv run examples/chat_with_X/website.py
```

See our [example gallery](examples/index.md) for more details.

## Versioning

We use [Semantic Versioning](http://semver.org/). For the versions available, see the [tags on this repository](https://github.com/zzstoatzz/raggy/tags).
35 changes: 35 additions & 0 deletions docs/examples/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Example Gallery

Here are some practical examples of using `raggy` in real-world scenarios.

## Chat with Content

Ye old "chat your data" examples.

#### Chat with a Website

```bash
uv run examples/chat_with_X/website.py "let's chat about docs.astral.sh/uv"
```

#### Chat with a GitHub Repo

```bash
uv run examples/chat_with_X/repo.py "let's chat about astral-sh/uv"
```

## Refresh Vectorstores

A `prefect` flow to gather documents from sources of knowledge, embed them and put them in a vectorstore.

#### Refresh TurboPuffer

```bash
uv run examples/refresh_vectorstore/tpuf_namespace.py
```

#### Refresh Chroma

```bash
uv run examples/refresh_vectorstore/chroma_collection.py
```
17 changes: 2 additions & 15 deletions docs/hooks.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,8 @@
import logging
import subprocess

log = logging.getLogger("mkdocs")


def on_pre_build(config, **kwargs):
"""Add a custom route to the server."""
try:
subprocess.run(
[
"npx",
"tailwindcss",
"-i",
"./docs/overrides/tailwind.css",
"-o",
"./docs/static/css/tailwind.css",
]
)
except Exception:
log.error("You need to install tailwindcss using npx install tailwindcss")
"""Add any pre-build hooks here."""
pass
87 changes: 86 additions & 1 deletion docs/ingest_strategy.md
Original file line number Diff line number Diff line change
@@ -1 +1,86 @@
# Coming soon!
# Ingest Strategy

When building RAG applications, you often need to load and refresh content from multiple sources. This can involve:
- Expensive API calls
- Large document processing
- Concurrent embedding operations

We use [Prefect](https://docs.prefect.io) to handle these challenges, giving us:

- Automatic caching of expensive operations
- Concurrent processing with backpressure
- Observability and retries

Let's look at a real example that demonstrates these concepts.

## Building a Knowledge Base

```python
from datetime import timedelta
import httpx
from prefect import flow, task
from prefect.tasks import task_input_hash

from raggy.loaders.github import GitHubRepoLoader
from raggy.loaders.web import SitemapLoader
from raggy.vectorstores.tpuf import TurboPuffer

# Cache based on content changes
def get_last_modified(context, parameters):
"""Only reload if the content has changed."""
try:
return httpx.head(parameters["urls"][0]).headers.get("Last-Modified", "")
except Exception:
return None

@task(
cache_key_fn=get_last_modified,
cache_expiration=timedelta(hours=24),
retries=2,
)
async def gather_documents(urls: list[str]):
return await SitemapLoader(urls=urls).load()

@flow
async def refresh_knowledge():
# Load from multiple sources
documents = []
for loader in [
SitemapLoader(urls=["https://docs.prefect.io/sitemap.xml"]),
GitHubRepoLoader(repo="PrefectHQ/prefect", include_globs=["README.md"]),
]:
documents.extend(await gather_documents(loader))

# Store efficiently with concurrent embedding
with TurboPuffer(namespace="knowledge") as tpuf:
await tpuf.upsert_batched(
documents,
batch_size=100, # tune based on document size
max_concurrent=8 # tune based on rate limits
)
```

This example shows key patterns:

1. Content-aware caching (`Last-Modified` headers, commit SHAs, etc)
2. Automatic retries for resilience
3. Concurrent processing with backpressure
4. Efficient batching of embedding operations

See the [refresh examples](https://github.com/zzstoatzz/raggy/tree/main/examples/refresh_vectorstore) for complete implementations using both Chroma and TurboPuffer.

## Performance Tips

For production workloads:
```python
@task(
retries=2,
retry_delay_seconds=[3, 60], # exponential backoff
cache_expiration=timedelta(days=1),
persist_result=True, # save results to storage
)
async def gather_documents(loader):
return await loader.load()
```

See [Prefect's documentation](https://docs.prefect.io/latest/concepts/tasks/) for more on task configuration and caching strategies.
20 changes: 20 additions & 0 deletions docs/overrides/main.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{% extends "base.html" %}

{% block announce %}
<style>
.md-announce {
font-family: 'Roboto Mono', monospace;
background-color: var(--md-primary-fg-color);
}
.md-announce__inner {
margin: 0 auto;
padding: 0.2rem;
text-align: center;
font-weight: 300;
letter-spacing: 0.05em;
}
</style>
<a href="{{ config.extra.announcement.link }}" style="color: currentColor">
{{ config.extra.announcement.text }}
</a>
{% endblock %}
41 changes: 32 additions & 9 deletions docs/welcome/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,44 @@ print(documents[0])

## Adding documents to a vectorstore

```python
from raggy.vectorstores.tpuf import Turbopuffer
!!! note "New in 0.2.0"
Vectorstore operations are now synchronous by default, with async batching available via `upsert_batched`.

async with Turbopuffer() as vectorstore: # uses default `raggy` namespace
await vectorstore.upsert(documents)
```python
from raggy.vectorstores.tpuf import TurboPuffer

with TurboPuffer(namespace="my_documents") as vectorstore:
# Synchronous operation
vectorstore.upsert(documents)

# Async batched usage for large document sets
await vectorstore.upsert_batched(
documents,
batch_size=100,
max_concurrent=8
)
```

## Querying the vectorstore

```python
from raggy.vectorstores.tpuf import query_namespace

print(await query_namespace("how do I get started with raggy?"))
from raggy.vectorstores.tpuf import query_namespace, multi_query_tpuf

# Single query
result = query_namespace("how do I get started with raggy?")
print(result)

# Multiple related queries for better coverage
result = multi_query_tpuf([
"how to install raggy",
"basic raggy usage",
"raggy getting started"
])
print(result)
```

## Real-world example
## Real-world examples

See [this example](https://github.com/zzstoatzz/raggy/blob/main/examples/refresh_vectorstore/refresh_tpuf.py) I use to refresh a chatbot that knows about `prefect`.
- [Chat with a GitHub repo](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/repo.py)
- [Chat with a website](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/website.py)
- [Refresh a vectorstore](https://github.com/zzstoatzz/raggy/blob/main/examples/refresh_vectorstore/tpuf_namespace.py)
Loading
Loading