Skip to content

Commit

Permalink
documentation fixes (#837)
Browse files Browse the repository at this point in the history
  • Loading branch information
rishiraj committed Aug 14, 2024
1 parent 163bad0 commit 6ccd19d
Show file tree
Hide file tree
Showing 15 changed files with 47 additions and 35 deletions.
1 change: 0 additions & 1 deletion docs/examples/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ Here are some of examples of use-cases you could accomplish with Indexify

## Invoice Extraction
- [Structured Extraction using GPT4](https://github.com/tensorlakeai/indexify/tree/main/examples/invoices/structured_extraction)
- [Structured Extraction using a Local Model(Donut)](https://github.com/tensorlakeai/indexify/tree/main/examples/invoices/donut_invoice)


## LLM Integrations
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ indexify-extractor join-server

The extraction graph defines the flow of data through our entity extraction pipeline. We'll create a graph that first extracts text from PDFs, then sends that text to Mistral for entity extraction.
Create a new Python file called `pdf_entity_extraction_pipeline.py` and add the following code:
Create a new Python file called `setup_graph.py` and add the following code:
```python
from indexify import IndexifyClient, ExtractionGraph
Expand Down Expand Up @@ -83,14 +83,14 @@ Replace `'YOUR_MISTRAL_API_KEY'` with your actual Mistral API key.
You can run this script to set up the pipeline:
```bash
python pdf_entity_extraction_pipeline.py
python setup_graph.py
```
## Implementing the Entity Extraction Pipeline
Now that we have our extraction graph set up, we can upload files and retrieve the entities:
Create a file `upload_and_retreive.py`
Create a file `upload_and_retrieve.py`
```python
import json
Expand Down Expand Up @@ -140,7 +140,7 @@ if __name__ == "__main__":
You can run the Python script as many times, or use this in an application to continue generating summaries:
```bash
python upload_and_retreive.py
python upload_and_retrieve.py
```
<img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/llm_integrations/mistral/pdf-entity-extraction/carbon.png" width="600"/>
Expand Down
18 changes: 12 additions & 6 deletions examples/llm_integrations/openai_pdf_translation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ This approach leverages GPT-4o's ability to directly process PDFs, eliminating t
### Creating the Extraction Graph (GPT-4o)
Create a new Python file called `pdf_translation_pipeline_gpt4o.py` and add the following code:
Create a new Python file called `setup_graph_gpt4o.py` and add the following code:
```python
from indexify import IndexifyClient, ExtractionGraph
Expand All @@ -93,8 +93,8 @@ extraction_policies:
- extractor: 'tensorlake/openai'
name: 'pdf_to_french'
input_params:
model_name: 'gpt-4o'
key: 'YOUR_OPENAI_API_KEY'
model: 'gpt-4o'
api_key: 'YOUR_OPENAI_API_KEY'
system_prompt: 'Translate the content of the following PDF from English to French. Maintain the original formatting and structure as much as possible. Provide the translation in plain text format.'
"""
Expand Down Expand Up @@ -158,7 +158,7 @@ This approach first extracts text from the PDF, then sends that text to GPT-3.5-
### Creating the Extraction Graph (GPT-3.5-turbo)
Create a new Python file called `pdf_translation_pipeline_gpt35.py` and add the following code:
Create a new Python file called `setup_graph_gpt35.py` and add the following code:
```python
from indexify import IndexifyClient, ExtractionGraph
Expand All @@ -173,8 +173,8 @@ extraction_policies:
- extractor: 'tensorlake/openai'
name: 'text_to_french'
input_params:
model_name: 'gpt-3.5-turbo'
key: 'YOUR_OPENAI_API_KEY'
model: 'gpt-3.5-turbo'
api_key: 'YOUR_OPENAI_API_KEY'
system_prompt: 'You are a professional translator. Translate the following English text to French. Maintain the original formatting and structure as much as possible.'
content_source: 'pdf_to_text'
"""
Expand Down Expand Up @@ -236,6 +236,12 @@ if __name__ == "__main__":
You can run either Python script to translate a PDF:
```bash
python setup_graph_gpt4o.py
# or
python setup_graph_gpt35.py
```
```bash
python upload_and_retrieve_gpt4o.py
# or
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
- extractor: 'tensorlake/openai'
name: 'text_to_french'
input_params:
model_name: 'gpt-3.5-turbo'
key: 'YOUR_OPENAI_API_KEY'
model: 'gpt-3.5-turbo'
api_key: 'YOUR_OPENAI_API_KEY'
system_prompt: 'You are a professional translator. Translate the following English text to French. Maintain the original formatting and structure as much as possible.'
content_source: 'pdf_to_text'
"""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
- extractor: 'tensorlake/openai'
name: 'pdf_to_french'
input_params:
model_name: 'gpt-4o'
key: 'YOUR_OPENAI_API_KEY'
model: 'gpt-4o'
api_key: 'YOUR_OPENAI_API_KEY'
system_prompt: 'Translate the content of the following PDF from English to French. Maintain the original formatting and structure as much as possible. Provide the translation in plain text format.'
"""

Expand Down
5 changes: 4 additions & 1 deletion examples/pdf/chunking/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ indexify-extractor join-server

The extraction graph defines the flow of data through our chunking pipeline. We'll create a graph that first extracts text from PDFs, then chunks that text using the RecursiveCharacterTextSplitter.
Create a new Python file called `pdf_chunking_graph.py` and add the following code:
Create a new Python file called `setup_graph.py` and add the following code:
```python
from indexify import IndexifyClient, ExtractionGraph
Expand Down Expand Up @@ -133,8 +133,11 @@ You can run the Python script to process a PDF and generate chunks:
python upload_and_retrieve.py
```
Sample Page to extract chunk from:
<img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/pdf/chunking/screenshot.png" width="600"/>
Sample Chunk extracted from page:
<img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/pdf/chunking/carbon.png" width="600"/>
## Customization and Advanced Usage
Expand Down
6 changes: 3 additions & 3 deletions examples/pdf/image/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ We provide a script that downloads a PDF, uploads it to Indexify, and retrieves

| Sample Page |
|:-----------:|
| <img src="https://docs.getindexify.ai/example_code/pdf/image/2310.06825v1_page-0004.jpg" width="600"/> |
| <img src="https://raw.githubusercontent.com/tensorlakeai/indexify/docsupdate/examples/pdf/image/2310.06825v1_page-0004.jpg" width="600"/> |

Source: [https://arxiv.org/pdf/2310.06825.pdf](https://arxiv.org/pdf/2310.06825.pdf)

Expand Down Expand Up @@ -38,7 +38,7 @@ Before we begin, ensure you have the following:

## File Descriptions
1. `setup_graph.py`: This script sets up the extraction graph for converting PDFs to images.
2. `upload_and_retreive.py`: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted images.
2. `upload_and_retrieve.py`: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted images.

## Usage
1. First, run the [setup_graph.py](https://github.com/tensorlakeai/indexify/blob/main/examples/pdf/image/setup_graph.py) script to set up the extraction graph:
Expand Down Expand Up @@ -90,7 +90,7 @@ def get_images(pdf_path):
# Retrieve the images content
images = client.get_extracted_content(
content_id=content_id,
ingested_content_id=content_id,
graph_name="image_extractor",
policy_name="pdf_to_image"
)
Expand Down
19 changes: 10 additions & 9 deletions examples/pdf/image/upload_and_retrieve.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,33 +10,34 @@ def download_pdf(url, save_path):

def get_images(pdf_path):
client = IndexifyClient()

# Upload the PDF file
content_id = client.upload_file("image_extractor", pdf_path)


# Wait for the extraction to complete
client.wait_for_extraction(content_id)

# Retrieve the images content
images = client.get_extracted_content(
ingested_content_id=content_id,
graph_name="image_extractor",
policy_name="pdf_to_image",
blocking=True,
policy_name="pdf_to_image"
)

return images

# Example usage
if __name__ == "__main__":
pdf_url = "https://arxiv.org/pdf/2310.06825.pdf"
pdf_path = "reference_document.pdf"

# Download the PDF
download_pdf(pdf_url, pdf_path)

# Get images from the PDF
images = get_images(pdf_path)
for image in images:
content_id = image["id"]
with open(f"{content_id}.png", 'wb') as f:
print("writing image ", image["id"])
f.write(image["content"])

f.write(image["content"])
6 changes: 3 additions & 3 deletions examples/pdf/langchain/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ pip install indexify indexify-langchain langchain langchain-openai

### 1. Set Up the Extraction Graph

Create a file named `setup_extraction_graph.py`:
Create a file named `setup_graph.py`:

```python
from indexify import IndexifyClient, ExtractionGraph
Expand Down Expand Up @@ -73,7 +73,7 @@ client.create_extraction_graph(extraction_graph)
Run this script to set up the extraction pipeline:

```bash
python setup_extraction_graph.py
python setup_graph.py
```

### 2. Implement the PDF QA System
Expand Down Expand Up @@ -160,7 +160,7 @@ Reference from PDF file from which answer should be generated:

You can run the Python script to process a PDF and answer questions:
```bash
python upload_and_retreive.py
python upload_and_retrieve.py
```
<img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/pdf/langchain/carbon.png" width="600"/>

Expand Down
File renamed without changes.
11 changes: 7 additions & 4 deletions examples/pdf/pdf_to_markdown/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ indexify-extractor join-server

The extraction graph defines the flow of data through our text extraction pipeline. We'll create a graph that extracts text from PDFs using the tensorlake/marker extractor.
Create a new Python file called `pdf_text_extraction_graph.py` and add the following code:
Create a new Python file called `setup_graph.py` and add the following code:
```python
from indexify import IndexifyClient, ExtractionGraph
Expand All @@ -63,12 +63,12 @@ client.create_extraction_graph(extraction_graph)
Run this script to set up the pipeline:
```bash
python pdf_text_extraction_graph.py
python setup_graph.py
```
## Implementing the Text Extraction Pipeline
Now that we have our extraction graph set up, we can upload files and make the pipeline extract text. Create a file `upload_and_extract.py`:
Now that we have our extraction graph set up, we can upload files and make the pipeline extract text. Create a file `upload_and_retrieve.py`:
```python
import requests
Expand Down Expand Up @@ -116,11 +116,14 @@ if __name__ == "__main__":
You can run the Python script to process a PDF and extract its text:
```bash
python upload_and_extract.py
python upload_and_retrieve.py
```
Sample Page to extract markdown from:
<img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/pdf/pdf_to_markdown/screenshot.png" width="600"/>
Sample Markdown extracted from page:
<img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/pdf/pdf_to_markdown/carbon.png" width="600"/>
## Customization and Advanced Usage
Expand Down
File renamed without changes.

0 comments on commit 6ccd19d

Please sign in to comment.