Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PDF file uploads as context for LLM queries #3638

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

andrewwan0131
Copy link

Why are these changes needed?

These changes enable users to upload PDF files as context for LLM queries.

Changes made

  1. Added PDF file handling capabilities:

    • Implemented PDF file upload support in the web interface
    • Added PDF text extraction functionality
    • Integrated extracted PDF content as context for LLM queries
  2. Modified relevant files:

    • Updated gradio web server components to handle PDF uploads
    • Added PDF processing utilities
    • Enhanced chat protocol to include document context

Checks

  • I've tested the PDF upload and context integration with various document types by running Chatbot Arena locally

Copy link
Member

@infwinston infwinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andrewwan0131 left some comments!

fastchat/serve/gradio_block_arena_vision_anony.py Outdated Show resolved Hide resolved

post_processed_text = wrap_query_context(text, post_processed_text)

text = text[:BLIND_MODE_INPUT_CHAR_LEN_LIMIT] # Hard cut-off
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably avoid cutting off inputs when dealing with pdf.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I will only cut off input for images.

fastchat/serve/gradio_block_arena_vision_anony.py Outdated Show resolved Hide resolved
@@ -483,6 +562,7 @@ def build_side_by_side_vision_ui_anony(context: Context, random_questions=None):
)

with gr.Row() as button_row:
random_btn = gr.Button(value="🔮 Random Image", interactive=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated with the below

random_btn = gr.Button(value="🔮 Random Image", interactive=True)

@@ -363,6 +421,27 @@ def add_text(
for i in range(num_sides):
if "deluxe" in states[i].model_name:
hint_msg = SLOW_MODEL_MSG

if file_extension == ".pdf":
document_text = llama_parse(files[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we abstract pdf parser here. We should call it something like pdf_parse and by default it uses llamaparse, but we can switch to others when needed.

Copy link
Member

@infwinston infwinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks more comments

Comment on lines +582 to +586
# if random_questions:
# global vqa_samples
# with open(random_questions, "r") as f:
# vqa_samples = json.load(f)
# random_btn = gr.Button(value="🔮 Random Image", interactive=True)
Copy link
Member

@infwinston infwinston Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't remove these.. otherwise random_question button will break

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how random_question works but the current implementation is broken when the if condition isn't true because I get an error that random_btn has no value.

@@ -471,10 +566,10 @@ def build_side_by_side_vision_ui_anony(context: Context, random_questions=None):
)

multimodal_textbox = gr.MultimodalTextbox(
file_types=["image"],
file_types=["file"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to change this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I add ["image", "application/pdf"], it doesn't let me load pdf and if I add ["image", "pdf"], it gives me an error that it's not "application/pdf". I am still trying to find a fix but temporarily just allowed for all file types for testing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am checking if the file type is pdf or image in the add_text function and raising error if not.

Comment on lines 325 to 331
text, files = chat_input["text"], chat_input["files"]
else:
text = chat_input
images = []
files = []

images = []

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break image input!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@@ -267,7 +340,7 @@ def add_text(
if states[0] is None:
assert states[1] is None

if len(images) > 0:
if len(files) > 0 and file_extension != ".pdf":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a reliable implementation to check whether a file is a pdf.

a file can be called "abc.pdf" but it's a jpg.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I changed to magic number checker.

os.makedirs(output_dir, exist_ok=True)

pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
markdown_file_path = os.path.join(output_dir, f"{pdf_name}.md")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown_file_path variable is unused, the pdf_name is required for the extra-info parameter for the parser. The documentation is here: https://github.com/run-llama/llama_parse?tab=readme-ov-file#using-with-file-object

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't. It was part my old implementation for calling LlamaParse with subprocess. I will remove.

result_type="markdown", # Output in Markdown format
num_workers=4, # Number of API calls for batch processing
verbose=True, # Print detailed logs
language="en" # Set language (default is English)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the pdf is in different language?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is optional, we can remove this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if LlamaParse can handle identifying language. Maybe we have to use language detection or translator?

import nest_asyncio
from llama_parse import LlamaParse

nest_asyncio.apply() # Ensure compatibility with async environments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure where it is being used but it was part of the script for LlamaParse API calls.

Comment on lines 249 to 260
def extract_text_from_pdf(pdf_file_path):
"""Extract text from a PDF file."""
try:
with open(pdf_file_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
pdf_text = ""
for page in reader.pages:
pdf_text += page.extract_text()
return pdf_text
except Exception as e:
logger.error(f"Failed to extract text from PDF: {e}")
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this function?

Copy link

@PranavB-11 PranavB-11 Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for unstructured extraction, it's not being used, we just copied it over from the demo.

@@ -4,12 +4,16 @@
"""

import json
import subprocess
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants