Merge pull request #10 from zzstoatzz/sync

sync first and update docs
zzstoatzz · Nov 4, 2024 · a614ae3 · a614ae3
2 parents bd4513b + a7fd007
commit a614ae3
Show file tree

Hide file tree

Showing 21 changed files with 709 additions and 351 deletions.
diff --git a/.github/workflows/publish-docs.yml b/.github/workflows/publish-docs.yml
@@ -4,6 +4,10 @@ on:
   push:
     tags:
       - v*
+    branches:
+      - main
+    paths:
+      - "docs/**"
   workflow_dispatch:
 
 permissions:
@@ -31,4 +35,4 @@ jobs:
           cairosvg
 
       - name: Publish docs
-        run: mkdocs gh-deploy --force 
+        run: mkdocs gh-deploy --force
diff --git a/README.md b/README.md
@@ -6,10 +6,16 @@ pip install raggy
 
 Read the [docs](https://zzstoatzz.github.io/raggy/)
 
-### examples
+### What is it?
+
+A Python library for:
 
 - scraping the web to produce rich documents
 - putting these documents in vectorstores
 - querying the vectorstores to find documents similar to a query
 
-see this [example](https://github.com/zzstoatzz/raggy/blob/main/examples/refresh_vectorstore/refresh_tpuf.py) I use to refresh a chatbot that knows about `prefect`.
+See this [example](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/website.py) to chat with any website, or this [example](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/repo.py) to chat with any GitHub repo.
+
+### Contributing
+
+We welcome contributions! See our [contributing guide](https://zzstoatzz.github.io/raggy/contributing) for details.
diff --git a/docs/contributing.md b/docs/contributing.md
@@ -0,0 +1,67 @@
+# Contributing to Raggy
+
+We love your input! We want to make contributing to Raggy as easy and transparent as possible.
+
+## Development Setup
+
+We recommend using [uv](https://github.com/astral-sh/uv) for Python environment management and package installation:
+
+```bash
+# Install uv
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Clone the repo
+git clone https://github.com/zzstoatzz/raggy.git
+cd raggy
+
+# Create and activate a virtual environment
+uv venv
+
+# Install in editable mode with dev dependencies
+uv pip install -e ".[dev]"
+```
+
+## Running Tests
+
+```bash
+# Install test dependencies
+uv pip install -e ".[test]"
+
+# Run tests
+pytest
+```
+
+## Building Documentation
+
+```bash
+# Install docs dependencies
+uv pip install -e ".[docs]"
+
+# Serve docs locally
+mkdocs serve
+```
+
+## Code Style
+
+```
+pre-commit install
+pre-commit run --all-files # happens automatically on commit
+```
+
+## Running Examples
+
+All examples can be run using uv:
+
+!!! question "where are the dependencies?"
+    `uv` will run the example in an isolated environment using [inline script dependencies](https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies).
+
+```bash
+# Run example
+uv run examples/chat_with_X/website.py
+```
+
+See our [example gallery](examples/index.md) for more details.
+
+## Versioning
+
+We use [Semantic Versioning](http://semver.org/). For the versions available, see the [tags on this repository](https://github.com/zzstoatzz/raggy/tags).
diff --git a/docs/examples/index.md b/docs/examples/index.md
@@ -0,0 +1,35 @@
+# Example Gallery
+
+Here are some practical examples of using `raggy` in real-world scenarios.
+
+## Chat with Content
+
+Ye old "chat your data" examples.
+
+#### Chat with a Website
+
+```bash
+uv run examples/chat_with_X/website.py "let's chat about docs.astral.sh/uv"
+```
+
+#### Chat with a GitHub Repo
+
+```bash
+uv run examples/chat_with_X/repo.py "let's chat about astral-sh/uv"
+```
+
+## Refresh Vectorstores
+
+A `prefect` flow to gather documents from sources of knowledge, embed them and put them in a vectorstore.
+
+#### Refresh TurboPuffer
+
+```bash
+uv run examples/refresh_vectorstore/tpuf_namespace.py
+```
+
+#### Refresh Chroma
+
+```bash
+uv run examples/refresh_vectorstore/chroma_collection.py
+```
diff --git a/docs/hooks.py b/docs/hooks.py
@@ -1,21 +1,8 @@
 import logging
-import subprocess
 
 log = logging.getLogger("mkdocs")
 
 
 def on_pre_build(config, **kwargs):
-    """Add a custom route to the server."""
-    try:
-        subprocess.run(
-            [
-                "npx",
-                "tailwindcss",
-                "-i",
-                "./docs/overrides/tailwind.css",
-                "-o",
-                "./docs/static/css/tailwind.css",
-            ]
-        )
-    except Exception:
-        log.error("You need to install tailwindcss using npx install tailwindcss")
+    """Add any pre-build hooks here."""
+    pass
diff --git a/docs/ingest_strategy.md b/docs/ingest_strategy.md
@@ -1 +1,86 @@
-# Coming soon!
+# Ingest Strategy
+
+When building RAG applications, you often need to load and refresh content from multiple sources. This can involve:
+- Expensive API calls
+- Large document processing
+- Concurrent embedding operations
+
+We use [Prefect](https://docs.prefect.io) to handle these challenges, giving us:
+
+- Automatic caching of expensive operations
+- Concurrent processing with backpressure
+- Observability and retries
+
+Let's look at a real example that demonstrates these concepts.
+
+## Building a Knowledge Base
+
+```python
+from datetime import timedelta
+import httpx
+from prefect import flow, task
+from prefect.tasks import task_input_hash
+
+from raggy.loaders.github import GitHubRepoLoader
+from raggy.loaders.web import SitemapLoader
+from raggy.vectorstores.tpuf import TurboPuffer
+
+# Cache based on content changes
+def get_last_modified(context, parameters):
+    """Only reload if the content has changed."""
+    try:
+        return httpx.head(parameters["urls"][0]).headers.get("Last-Modified", "")
+    except Exception:
+        return None
+
+@task(
+    cache_key_fn=get_last_modified,
+    cache_expiration=timedelta(hours=24),
+    retries=2,
+)
+async def gather_documents(urls: list[str]):
+    return await SitemapLoader(urls=urls).load()
+
+@flow
+async def refresh_knowledge():
+    # Load from multiple sources
+    documents = []
+    for loader in [
+        SitemapLoader(urls=["https://docs.prefect.io/sitemap.xml"]),
+        GitHubRepoLoader(repo="PrefectHQ/prefect", include_globs=["README.md"]),
+    ]:
+        documents.extend(await gather_documents(loader))
+
+    # Store efficiently with concurrent embedding
+    with TurboPuffer(namespace="knowledge") as tpuf:
+        await tpuf.upsert_batched(
+            documents,
+            batch_size=100,  # tune based on document size
+            max_concurrent=8  # tune based on rate limits
+        )
+```
+
+This example shows key patterns:
+
+1. Content-aware caching (`Last-Modified` headers, commit SHAs, etc)
+2. Automatic retries for resilience
+3. Concurrent processing with backpressure
+4. Efficient batching of embedding operations
+
+See the [refresh examples](https://github.com/zzstoatzz/raggy/tree/main/examples/refresh_vectorstore) for complete implementations using both Chroma and TurboPuffer.
+
+## Performance Tips
+
+For production workloads:
+```python
+@task(
+    retries=2,
+    retry_delay_seconds=[3, 60],  # exponential backoff
+    cache_expiration=timedelta(days=1),
+    persist_result=True,  # save results to storage
+)
+async def gather_documents(loader):
+    return await loader.load()
+```
+
+See [Prefect's documentation](https://docs.prefect.io/latest/concepts/tasks/) for more on task configuration and caching strategies.
diff --git a/docs/overrides/main.html b/docs/overrides/main.html
@@ -0,0 +1,20 @@
+{% extends "base.html" %}
+
+{% block announce %}
+  <style>
+    .md-announce {
+      font-family: 'Roboto Mono', monospace;
+      background-color: var(--md-primary-fg-color);
+    }
+    .md-announce__inner {
+      margin: 0 auto;
+      padding: 0.2rem;
+      text-align: center;
+      font-weight: 300;
+      letter-spacing: 0.05em;
+    }
+  </style>
+  <a href="{{ config.extra.announcement.link }}" style="color: currentColor">
+    {{ config.extra.announcement.text }}
+  </a>
+{% endblock %} 
diff --git a/docs/welcome/tutorial.md b/docs/welcome/tutorial.md
@@ -16,21 +16,44 @@ print(documents[0])
 
 ## Adding documents to a vectorstore
 
-```python
-from raggy.vectorstores.tpuf import Turbopuffer
+!!! note "New in 0.2.0"
+Vectorstore operations are now synchronous by default, with async batching available via `upsert_batched`.
 
-async with Turbopuffer() as vectorstore: # uses default `raggy` namespace
-    await vectorstore.upsert(documents)
+```python
+from raggy.vectorstores.tpuf import TurboPuffer
+
+with TurboPuffer(namespace="my_documents") as vectorstore:
+    # Synchronous operation
+    vectorstore.upsert(documents)
+
+    # Async batched usage for large document sets
+    await vectorstore.upsert_batched(
+        documents,
+        batch_size=100,
+        max_concurrent=8
+    )
 ```
 
 ## Querying the vectorstore
 
 ```python
-from raggy.vectorstores.tpuf import query_namespace
-
-print(await query_namespace("how do I get started with raggy?"))
+from raggy.vectorstores.tpuf import query_namespace, multi_query_tpuf
+
+# Single query
+result = query_namespace("how do I get started with raggy?")
+print(result)
+
+# Multiple related queries for better coverage
+result = multi_query_tpuf([
+    "how to install raggy",
+    "basic raggy usage",
+    "raggy getting started"
+])
+print(result)
 ```
 
-## Real-world example
+## Real-world examples
 
-See [this example](https://github.com/zzstoatzz/raggy/blob/main/examples/refresh_vectorstore/refresh_tpuf.py) I use to refresh a chatbot that knows about `prefect`.
+- [Chat with a GitHub repo](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/repo.py)
+- [Chat with a website](https://github.com/zzstoatzz/raggy/blob/main/examples/chat_with_X/website.py)
+- [Refresh a vectorstore](https://github.com/zzstoatzz/raggy/blob/main/examples/refresh_vectorstore/tpuf_namespace.py)