PDF Highlighter

A library for highlighting and annotating sentences in PDF documents using Large Language Models (LLM). It's made to help users identify and emphasize relevant sentences in PDF documents. Compatible with both OpenAI and Ollama libraries.

Use cases

Finding Relevant Information:
- Highlight specific sentences in a PDF that are relevant to a user's question or input. For example, if a user asks, "What are the main findings?", the tool will highlight sentences in the PDF that answer this question.
Reviewing LLM-Generated Answers:
- If a user has received an answer from an LLM based on information in a PDF, they can use this tool to highlight the exact text in the PDF that supports the LLM's answer. This helps in verifying and understanding the context of the LLM's response.

Features

Highlight sentences in PDF documents based on user input.
Optionally add comments to highlighted sentences.
Supports both OpenAI and Ollama language models.
Combine multiple PDFs into a single document with highlights and comments.
Classes and methods are asynchronous, allowing for non-blocking operations.

Requirements

Python 3.7+ (tested with 3.10.13)
Required Python packages (see requirements.txt)

Installation

Clone the repository:

git clone https://github.com/lasseedfast/pdf-highlighter.git
cd pdf-highlighter

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate

Install the required packages:
```
pip install -r requirements.txt
```
Set up environment variables:
- Add your OpenAI API key and/or LLM model details to the .env file:
```
OPENAI_API_KEY=your_openai_api_key
LLM_MODEL=your_llm_model
```
  You can also set the LLM model name when initializing the LLM or Highlighter class using the model parameter.
If using Ollama, make sure to install the Ollama server and download the model you want to use. Follow the instructions in the Ollama documentation for more details.

Usage

Command-Line Interface

You can use the command-line interface to highlight sentences in a PDF document.

Arguments

--user_input: The text input from the user to highlight in the PDFs.
--pdf_filename: The PDF filename to process.
--silent: Suppress warnings (optional).
--openai_key: OpenAI API key (optional if set in .env).
--comment: Include comments in the highlighted PDF (optional).
--data: Data in JSON format (fields: user_input, pdf_filename, pages) (optional).
--llm_model: The LLM model to use (optional if set in .env).

Example

python highlight_pdf.py --user_input "What is said about climate?" --pdf_filename "example_pdf_document.pdf" --comment --llm_model llama3.1

Note on Long PDFs

If the PDF is long, the result will be better if the user provides the data containing filename, user_input, and pages. This helps the tool focus on specific parts of the document, improving the accuracy and relevance of the highlights.

Example using the data argument

python highlight_pdf.py --data '[{"user_input": "What is said about climate?", "pdf_filename": "example_pdf_document.pdf", "pages": [1, 2]}]'

Output

The highlighted PDF will be saved with _highlighted appended to the original filename.

Use in Python Code

This example demonstrates how to use the highlight tool to understand what text in the PDF is relevant for the original user input/question.

Use in Python Code with ChromaDB

If the user has previously used ChromaDB to query for relevant texts, they can use the tool to highlight the relevant text in the PDFs based on the user input/question.
This example assumes that there is a ChromaDB instance with information, and that the filenames and pages where the text is found are stored as metadata in ChromaDB.

Streamlit Example

A Streamlit example is provided in example_streamlit_app.py to demonstrate how to use the PDF highlighter tool in a web application.

Running the Streamlit App

Ensure you have installed the required packages and set up the environment variables as described in the Installation section.
Install streamlit:
```
pip install streamlit
```
Run the Streamlit app:
```
streamlit run example_streamlit_app.py
```

Streamlit App Features

Enter your question or input text.
Upload a PDF file.
Optionally, choose to add comments to the highlighted text.
Click the "Highlight PDF" button to process the PDF.
Preview the highlighted PDF in the sidebar.
Download the highlighted PDF.

API

Highlighter Class

Methods

__init__(self, silent=False, openai_key=None, comment=False, llm_model=None, llm_temperature=0, llm_system_prompt=None, llm_num_ctx=None, llm_memory=True, llm_keep_alive=3600): Initializes the Highlighter class with the given parameters.
async highlight(self, user_input, docs=None, data=None, pdf_filename=None): Highlights sentences in the provided PDF documents based on the user input.
async get_sentences_with_llm(self, text, user_input): Uses the LLM to generate sentences from the text that should be highlighted based on the user input.
async annotate_pdf(self, user_input: str, filename: str, pages: list = None, extend_pages: bool = False): Annotates the PDF with highlighted sentences and optional comments.

LLM Class

Methods

__init__(self, openai_key=False, model=None, temperature=0, system_prompt=None, num_ctx=None, memory=True, keep_alive=3600): Initializes the LLM class with the provided parameters.
use_openai(self, key, model): Configures the class to use OpenAI for generating responses.
use_ollama(self, model): Configures the class to use Ollama for generating responses.
async generate(self, prompt): Asynchronously generates a response based on the provided prompt.

Note: The num_ctx parameter is set to 20000 by default, which may not be sufficient for all use cases. Adjust this value based on your specific requirements.

Default Prompts

The default LLM prompts are stored in the prompts.yaml file. You can view and edit the prompts directly in this file.

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
build/lib		build/lib
dist		dist
examples		examples
highlight_pdf		highlight_pdf
pdf_highlighter.egg-info		pdf_highlighter.egg-info
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_pdf_document.pdf		example_pdf_document.pdf
prompts.yaml		prompts.yaml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Highlighter

Use cases

Features

Requirements

Installation

Usage

Command-Line Interface

Arguments

Example

Note on Long PDFs

Example using the data argument

Output

Use in Python Code

Use in Python Code with ChromaDB

Streamlit Example

Running the Streamlit App

Streamlit App Features

API

Highlighter Class

Methods

LLM Class

Methods

Default Prompts

Contributing

About

Languages

License

lasseedfast/pdf-highlighter

Folders and files

Latest commit

History

Repository files navigation

PDF Highlighter

Use cases

Features

Requirements

Installation

Usage

Command-Line Interface

Arguments

Example

Note on Long PDFs

Example using the data argument

Output

Use in Python Code

Use in Python Code with ChromaDB

Streamlit Example

Running the Streamlit App

Streamlit App Features

API

Highlighter Class

Methods

LLM Class

Methods

Default Prompts

Contributing

About

Resources

License

Stars

Watchers

Forks

Languages