Skip to content

Commit

Permalink
Add parameter for setting the textractor backend, closes #22
Browse files Browse the repository at this point in the history
  • Loading branch information
davidmezzetti committed Dec 17, 2024
1 parent 0a13057 commit d2db9ec
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 2 deletions.
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ The RAG application has a number of environment variables that can be set to con
| EMBEDDINGS | Embeddings database path | [neuml/txtai-wikipedia-slim](https://hf.co/NeuML/txtai-wikipedia-slim) |
| MAXLENGTH | Maximum generation length | 2048 for topics, 4096 for RAG |
| CONTEXT | RAG context size | 10 |
| TEXTBACKEND | [Text extraction backend](https://neuml.github.io/txtai/pipeline/data/filetohtml/#txtai.pipeline.FileToHTML.__init__) | available |
| DATA | Optional directory to index data from | None |
| PERSIST | Optional directory to save index updates to | None |
| TOPICSBATCH | Optional batch size for LLM topic queries | None |
Expand Down Expand Up @@ -151,12 +152,24 @@ docker run -d --gpus=all -it -p 8501:8501 -e LLM=gpt-4o -e OPENAI_API_KEY=your-a
docker run -d --gpus=all -it -p 8501:8501 -e EMBEDDINGS=neuml/arxiv neuml/rag
```

### Start with an empty embeddings index

```
docker run -d --gpus=all -it -p 8501:8501 -e EMBEDDINGS= neuml/rag
```

### Build an embeddings index with a local directory of files

```
docker run -d --gpus=all -it -p 8501:8501 -e DATA=/data/path -v local/path:/data/path neuml/rag
```

### Use the Docling text extraction backend

```
docker run -d --gpus=all -it -p 8501:8501 -e TEXTBACKEND=docling neuml/rag
```

### Persist embeddings and cache models

```
Expand Down
14 changes: 12 additions & 2 deletions rag.py
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,9 @@ def __init__(self):
context=self.context,
)

# Textractor instance (lazy loaded)
self.textractor = None

def load(self):
"""
Creates or loads an Embeddings instance.
Expand Down Expand Up @@ -425,8 +428,15 @@ def extract(self, inputs):
extracted content
"""

textractor = Textractor(paragraphs=True)
return textractor(inputs)
# Initialize textractor
if not self.textractor:
self.textractor = Textractor(
paragraphs=True,
backend=os.environ.get("TEXTBACKEND", "available"),
)

# Extract text
return self.textractor(inputs)

def infertopics(self, embeddings, start):
"""
Expand Down

0 comments on commit d2db9ec

Please sign in to comment.