documentation fixes (#837)

tensorlakeai · Aug 14, 2024 · 6ccd19d · 6ccd19d
1 parent 163bad0
commit 6ccd19d
Show file tree

Hide file tree

Showing 15 changed files with 47 additions and 35 deletions.
diff --git a/docs/examples/index.mdx b/docs/examples/index.mdx
@@ -23,7 +23,6 @@ Here are some of examples of use-cases you could accomplish with Indexify
 
 ## Invoice Extraction
 - [Structured Extraction using GPT4](https://github.com/tensorlakeai/indexify/tree/main/examples/invoices/structured_extraction)
-- [Structured Extraction using a Local Model(Donut)](https://github.com/tensorlakeai/indexify/tree/main/examples/invoices/donut_invoice)
 
 
 ## LLM Integrations

diff --git a/examples/llm_integrations/mistral/pdf-entity-extraction/README.md b/examples/llm_integrations/mistral/pdf-entity-extraction/README.md
@@ -54,7 +54,7 @@ indexify-extractor join-server
 
 The extraction graph defines the flow of data through our entity extraction pipeline. We'll create a graph that first extracts text from PDFs, then sends that text to Mistral for entity extraction.
 
-Create a new Python file called `pdf_entity_extraction_pipeline.py` and add the following code:
+Create a new Python file called `setup_graph.py` and add the following code:
 
 ```python
 from indexify import IndexifyClient, ExtractionGraph
@@ -83,14 +83,14 @@ Replace `'YOUR_MISTRAL_API_KEY'` with your actual Mistral API key.
 
 You can run this script to set up the pipeline:
 ```bash
-python pdf_entity_extraction_pipeline.py
+python setup_graph.py
 ```
 
 ## Implementing the Entity Extraction Pipeline
 
 Now that we have our extraction graph set up, we can upload files and retrieve the entities:
 
-Create a file `upload_and_retreive.py`
+Create a file `upload_and_retrieve.py`
 
 ```python
 import json
@@ -140,7 +140,7 @@ if __name__ == "__main__":
 
 You can run the Python script as many times, or use this in an application to continue generating summaries:
 ```bash
-python upload_and_retreive.py
+python upload_and_retrieve.py
 ```
 <img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/llm_integrations/mistral/pdf-entity-extraction/carbon.png" width="600"/>
 

diff --git a/...raction/pdf_entity_extraction_pipeline.py → ...tral/pdf-entity-extraction/setup_graph.py b/...raction/pdf_entity_extraction_pipeline.py → ...tral/pdf-entity-extraction/setup_graph.py
diff --git a/...-entity-extraction/upload_and_retreive.py → ...-entity-extraction/upload_and_retrieve.py b/...-entity-extraction/upload_and_retreive.py → ...-entity-extraction/upload_and_retrieve.py
diff --git a/examples/llm_integrations/openai_pdf_translation/README.md b/examples/llm_integrations/openai_pdf_translation/README.md
@@ -80,7 +80,7 @@ This approach leverages GPT-4o's ability to directly process PDFs, eliminating t
 
 ### Creating the Extraction Graph (GPT-4o)
 
-Create a new Python file called `pdf_translation_pipeline_gpt4o.py` and add the following code:
+Create a new Python file called `setup_graph_gpt4o.py` and add the following code:
 
 ```python
 from indexify import IndexifyClient, ExtractionGraph
@@ -93,8 +93,8 @@ extraction_policies:
   - extractor: 'tensorlake/openai'
     name: 'pdf_to_french'
     input_params:
-      model_name: 'gpt-4o'
-      key: 'YOUR_OPENAI_API_KEY'
+      model: 'gpt-4o'
+      api_key: 'YOUR_OPENAI_API_KEY'
       system_prompt: 'Translate the content of the following PDF from English to French. Maintain the original formatting and structure as much as possible. Provide the translation in plain text format.'
 """
 
@@ -158,7 +158,7 @@ This approach first extracts text from the PDF, then sends that text to GPT-3.5-
 
 ### Creating the Extraction Graph (GPT-3.5-turbo)
 
-Create a new Python file called `pdf_translation_pipeline_gpt35.py` and add the following code:
+Create a new Python file called `setup_graph_gpt35.py` and add the following code:
 
 ```python
 from indexify import IndexifyClient, ExtractionGraph
@@ -173,8 +173,8 @@ extraction_policies:
   - extractor: 'tensorlake/openai'
     name: 'text_to_french'
     input_params:
-      model_name: 'gpt-3.5-turbo'
-      key: 'YOUR_OPENAI_API_KEY'
+      model: 'gpt-3.5-turbo'
+      api_key: 'YOUR_OPENAI_API_KEY'
       system_prompt: 'You are a professional translator. Translate the following English text to French. Maintain the original formatting and structure as much as possible.'
     content_source: 'pdf_to_text'
 """
@@ -236,6 +236,12 @@ if __name__ == "__main__":
 
 You can run either Python script to translate a PDF:
 
+```bash
+python setup_graph_gpt4o.py
+# or
+python setup_graph_gpt35.py
+```
+
 ```bash
 python upload_and_retrieve_gpt4o.py
 # or

diff --git a/...slation/pdf_translation_pipeline_gpt35.py → ...enai_pdf_translation/setup_graph_gpt35.py b/...slation/pdf_translation_pipeline_gpt35.py → ...enai_pdf_translation/setup_graph_gpt35.py
@@ -10,8 +10,8 @@
   - extractor: 'tensorlake/openai'
     name: 'text_to_french'
     input_params:
-      model_name: 'gpt-3.5-turbo'
-      key: 'YOUR_OPENAI_API_KEY'
+      model: 'gpt-3.5-turbo'
+      api_key: 'YOUR_OPENAI_API_KEY'
       system_prompt: 'You are a professional translator. Translate the following English text to French. Maintain the original formatting and structure as much as possible.'
     content_source: 'pdf_to_text'
 """

diff --git a/...slation/pdf_translation_pipeline_gpt4o.py → ...enai_pdf_translation/setup_graph_gpt4o.py b/...slation/pdf_translation_pipeline_gpt4o.py → ...enai_pdf_translation/setup_graph_gpt4o.py
@@ -8,8 +8,8 @@
   - extractor: 'tensorlake/openai'
     name: 'pdf_to_french'
     input_params:
-      model_name: 'gpt-4o'
-      key: 'YOUR_OPENAI_API_KEY'
+      model: 'gpt-4o'
+      api_key: 'YOUR_OPENAI_API_KEY'
       system_prompt: 'Translate the content of the following PDF from English to French. Maintain the original formatting and structure as much as possible. Provide the translation in plain text format.'
 """
 

diff --git a/examples/pdf/chunking/README.md b/examples/pdf/chunking/README.md
@@ -48,7 +48,7 @@ indexify-extractor join-server
 
 The extraction graph defines the flow of data through our chunking pipeline. We'll create a graph that first extracts text from PDFs, then chunks that text using the RecursiveCharacterTextSplitter.
 
-Create a new Python file called `pdf_chunking_graph.py` and add the following code:
+Create a new Python file called `setup_graph.py` and add the following code:
 
 ```python
 from indexify import IndexifyClient, ExtractionGraph
@@ -133,8 +133,11 @@ You can run the Python script to process a PDF and generate chunks:
 python upload_and_retrieve.py
 ```
    Sample Page to extract chunk from:
+
    <img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/pdf/chunking/screenshot.png" width="600"/>
+
    Sample Chunk extracted from page:
+   
    <img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/pdf/chunking/carbon.png" width="600"/>
 
 ## Customization and Advanced Usage

diff --git a/examples/pdf/image/README.md b/examples/pdf/image/README.md
@@ -8,7 +8,7 @@ We provide a script that downloads a PDF, uploads it to Indexify, and retrieves
 
 | Sample Page |
 |:-----------:|
-| <img src="https://docs.getindexify.ai/example_code/pdf/image/2310.06825v1_page-0004.jpg" width="600"/> |
+| <img src="https://raw.githubusercontent.com/tensorlakeai/indexify/docsupdate/examples/pdf/image/2310.06825v1_page-0004.jpg" width="600"/> |
 
 Source: [https://arxiv.org/pdf/2310.06825.pdf](https://arxiv.org/pdf/2310.06825.pdf)
 
@@ -38,7 +38,7 @@ Before we begin, ensure you have the following:
 
 ## File Descriptions
 1. `setup_graph.py`: This script sets up the extraction graph for converting PDFs to images.
-2. `upload_and_retreive.py`: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted images.
+2. `upload_and_retrieve.py`: This script downloads a PDF, uploads it to Indexify, and retrieves the extracted images.
 
 ## Usage
 1. First, run the [setup_graph.py](https://github.com/tensorlakeai/indexify/blob/main/examples/pdf/image/setup_graph.py) script to set up the extraction graph:
@@ -90,7 +90,7 @@ def get_images(pdf_path):
 
     # Retrieve the images content
     images = client.get_extracted_content(
-        content_id=content_id,
+        ingested_content_id=content_id,
         graph_name="image_extractor",
         policy_name="pdf_to_image"
     )

diff --git a/examples/pdf/image/upload_and_retrieve.py b/examples/pdf/image/upload_and_retrieve.py
@@ -10,33 +10,34 @@ def download_pdf(url, save_path):
 
 def get_images(pdf_path):
     client = IndexifyClient()
-    
+
     # Upload the PDF file
     content_id = client.upload_file("image_extractor", pdf_path)
-
+
+    # Wait for the extraction to complete
+    client.wait_for_extraction(content_id)
+
     # Retrieve the images content
     images = client.get_extracted_content(
         ingested_content_id=content_id,
         graph_name="image_extractor",
-        policy_name="pdf_to_image",
-        blocking=True,
+        policy_name="pdf_to_image"
     )
-    
+
     return images
 
 # Example usage
 if __name__ == "__main__":
     pdf_url = "https://arxiv.org/pdf/2310.06825.pdf"
     pdf_path = "reference_document.pdf"
-    
+
     # Download the PDF
     download_pdf(pdf_url, pdf_path)
-    
+
     # Get images from the PDF
     images = get_images(pdf_path)
     for image in images:
         content_id = image["id"]
         with open(f"{content_id}.png", 'wb') as f:
             print("writing image ", image["id"])
-            f.write(image["content"])
-
+            f.write(image["content"])
diff --git a/examples/pdf/langchain/README.md b/examples/pdf/langchain/README.md
@@ -42,7 +42,7 @@ pip install indexify indexify-langchain langchain langchain-openai
 
 ### 1. Set Up the Extraction Graph
 
-Create a file named `setup_extraction_graph.py`:
+Create a file named `setup_graph.py`:
 
 ```python
 from indexify import IndexifyClient, ExtractionGraph
@@ -73,7 +73,7 @@ client.create_extraction_graph(extraction_graph)
 Run this script to set up the extraction pipeline:
 
 ```bash
-python setup_extraction_graph.py
+python setup_graph.py
 ```
 
 ### 2. Implement the PDF QA System
@@ -160,7 +160,7 @@ Reference from PDF file from which answer should be generated:
 
 You can run the Python script to process a PDF and answer questions:
 ```bash
-python upload_and_retreive.py
+python upload_and_retrieve.py
 ```
 <img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/pdf/langchain/carbon.png" width="600"/>
 

diff --git a/...s/pdf/langchain/setup_extraction_graph.py → examples/pdf/langchain/setup_graph.py b/...s/pdf/langchain/setup_extraction_graph.py → examples/pdf/langchain/setup_graph.py
diff --git a/examples/pdf/pdf_to_markdown/README.md b/examples/pdf/pdf_to_markdown/README.md
@@ -41,7 +41,7 @@ indexify-extractor join-server
 
 The extraction graph defines the flow of data through our text extraction pipeline. We'll create a graph that extracts text from PDFs using the tensorlake/marker extractor.
 
-Create a new Python file called `pdf_text_extraction_graph.py` and add the following code:
+Create a new Python file called `setup_graph.py` and add the following code:
 
 ```python
 from indexify import IndexifyClient, ExtractionGraph
@@ -63,12 +63,12 @@ client.create_extraction_graph(extraction_graph)
 
 Run this script to set up the pipeline:
 ```bash
-python pdf_text_extraction_graph.py
+python setup_graph.py
 ```
 
 ## Implementing the Text Extraction Pipeline
 
-Now that we have our extraction graph set up, we can upload files and make the pipeline extract text. Create a file `upload_and_extract.py`:
+Now that we have our extraction graph set up, we can upload files and make the pipeline extract text. Create a file `upload_and_retrieve.py`:
 
 ```python
 import requests
@@ -116,11 +116,14 @@ if __name__ == "__main__":
 
 You can run the Python script to process a PDF and extract its text:
 ```bash
-python upload_and_extract.py
+python upload_and_retrieve.py
 ```
    Sample Page to extract markdown from:
+
    <img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/pdf/pdf_to_markdown/screenshot.png" width="600"/>
+
    Sample Markdown extracted from page:
+   
    <img src="https://raw.githubusercontent.com/tensorlakeai/indexify/main/examples/pdf/pdf_to_markdown/carbon.png" width="600"/>
 
 ## Customization and Advanced Usage

diff --git a/..._to_markdown/pdf_text_extraction_graph.py → examples/pdf/pdf_to_markdown/setup_graph.py b/..._to_markdown/pdf_text_extraction_graph.py → examples/pdf/pdf_to_markdown/setup_graph.py
diff --git a/...pdf/pdf_to_markdown/upload_and_extract.py → ...df/pdf_to_markdown/upload_and_retrieve.py b/...pdf/pdf_to_markdown/upload_and_extract.py → ...df/pdf_to_markdown/upload_and_retrieve.py