A library for highlighting and annotating sentences in PDF documents using Large Language Models (LLM). It's made to help users identify and emphasize relevant sentences in PDF documents. Compatible with both OpenAI and Ollama libraries.
-
Finding Relevant Information:
- Highlight specific sentences in a PDF that are relevant to a user's question or input. For example, if a user asks, "What are the main findings?", the tool will highlight sentences in the PDF that answer this question.
-
Reviewing LLM-Generated Answers:
- If a user has received an answer from an LLM based on information in a PDF, they can use this tool to highlight the exact text in the PDF that supports the LLM's answer. This helps in verifying and understanding the context of the LLM's response.
- Highlight sentences in PDF documents based on user input.
- Optionally add comments to highlighted sentences.
- Supports both OpenAI and Ollama language models.
- Combine multiple PDFs into a single document with highlights and comments.
- Classes and methods are asynchronous, allowing for non-blocking operations.
- Python 3.7+ (tested with 3.10.13)
- Required Python packages (see
requirements.txt
)
-
Clone the repository:
git clone https://github.com/lasseedfast/pdf-highlighter.git cd pdf-highlighter
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate
-
Install the required packages:
pip install -r requirements.txt
-
Set up environment variables:
- Add your OpenAI API key and/or LLM model details to the
.env
file:You can also set the LLM model name when initializing theOPENAI_API_KEY=your_openai_api_key LLM_MODEL=your_llm_model
LLM
orHighlighter
class using themodel
parameter.
- Add your OpenAI API key and/or LLM model details to the
-
If using Ollama, make sure to install the Ollama server and download the model you want to use. Follow the instructions in the Ollama documentation for more details.
You can use the command-line interface to highlight sentences in a PDF document.
--user_input
: The text input from the user to highlight in the PDFs.--pdf_filename
: The PDF filename to process.--silent
: Suppress warnings (optional).--openai_key
: OpenAI API key (optional if set in.env
).--comment
: Include comments in the highlighted PDF (optional).--data
: Data in JSON format (fields: user_input, pdf_filename, pages) (optional).--llm_model
: The LLM model to use (optional if set in.env
).
python highlight_pdf.py --user_input "What is said about climate?" --pdf_filename "example_pdf_document.pdf" --comment --llm_model llama3.1
If the PDF is long, the result will be better if the user provides the data containing filename, user_input, and pages. This helps the tool focus on specific parts of the document, improving the accuracy and relevance of the highlights.
python highlight_pdf.py --data '[{"user_input": "What is said about climate?", "pdf_filename": "example_pdf_document.pdf", "pages": [1, 2]}]'
The highlighted PDF will be saved with _highlighted
appended to the original filename.
This example demonstrates how to use the highlight tool to understand what text in the PDF is relevant for the original user input/question.
If the user has previously used ChromaDB to query for relevant texts, they can use the tool to highlight the relevant text in the PDFs based on the user input/question.
This example assumes that there is a ChromaDB instance with information, and that the filenames and pages where the text is found are stored as metadata in ChromaDB.
A Streamlit example is provided in example_streamlit_app.py
to demonstrate how to use the PDF highlighter tool in a web application.
- Ensure you have installed the required packages and set up the environment variables as described in the Installation section.
- Install streamlit:
pip install streamlit
- Run the Streamlit app:
streamlit run example_streamlit_app.py
- Enter your question or input text.
- Upload a PDF file.
- Optionally, choose to add comments to the highlighted text.
- Click the "Highlight PDF" button to process the PDF.
- Preview the highlighted PDF in the sidebar.
- Download the highlighted PDF.
__init__(self, silent=False, openai_key=None, comment=False, llm_model=None, llm_temperature=0, llm_system_prompt=None, llm_num_ctx=None, llm_memory=True, llm_keep_alive=3600)
: Initializes the Highlighter class with the given parameters.async highlight(self, user_input, docs=None, data=None, pdf_filename=None)
: Highlights sentences in the provided PDF documents based on the user input.async get_sentences_with_llm(self, text, user_input)
: Uses the LLM to generate sentences from the text that should be highlighted based on the user input.async annotate_pdf(self, user_input: str, filename: str, pages: list = None, extend_pages: bool = False)
: Annotates the PDF with highlighted sentences and optional comments.
__init__(self, openai_key=False, model=None, temperature=0, system_prompt=None, num_ctx=None, memory=True, keep_alive=3600)
: Initializes the LLM class with the provided parameters.use_openai(self, key, model)
: Configures the class to use OpenAI for generating responses.use_ollama(self, model)
: Configures the class to use Ollama for generating responses.async generate(self, prompt)
: Asynchronously generates a response based on the provided prompt.
Note: The num_ctx
parameter is set to 20000 by default, which may not be sufficient for all use cases. Adjust this value based on your specific requirements.
The default LLM prompts are stored in the prompts.yaml
file. You can view and edit the prompts directly in this file.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.