Add support for PDF file uploads as context for LLM queries #3638

andrewwan0131 · 2024-12-07T22:40:50Z

Why are these changes needed?

These changes enable users to upload PDF files as context for LLM queries.

Changes made

Added PDF file handling capabilities:
- Implemented PDF file upload support in the web interface
- Added PDF text extraction functionality
- Integrated extracted PDF content as context for LLM queries
Modified relevant files:
- Updated gradio web server components to handle PDF uploads
- Added PDF processing utilities
- Enhanced chat protocol to include document context

Checks

I've tested the PDF upload and context integration with various document types by running Chatbot Arena locally

infwinston

Thanks @andrewwan0131 left some comments!

fastchat/serve/gradio_block_arena_vision_anony.py

infwinston · 2024-12-07T22:48:23Z

fastchat/serve/gradio_block_arena_vision_anony.py

+
+    post_processed_text = wrap_query_context(text, post_processed_text)
+
+    text = text[:BLIND_MODE_INPUT_CHAR_LEN_LIMIT]  # Hard cut-off


We should probably avoid cutting off inputs when dealing with pdf.

Fixed. I will only cut off input for images.

fastchat/serve/gradio_block_arena_vision_anony.py

infwinston · 2024-12-07T22:50:46Z

fastchat/serve/gradio_block_arena_vision_anony.py

@@ -483,6 +562,7 @@ def build_side_by_side_vision_ui_anony(context: Context, random_questions=None):
        )

    with gr.Row() as button_row:
+        random_btn = gr.Button(value="🔮 Random Image", interactive=True) 


duplicated with the below

random_btn = gr.Button(value="🔮 Random Image", interactive=True)

infwinston · 2024-12-07T22:51:44Z

fastchat/serve/gradio_block_arena_vision_anony.py

@@ -363,6 +421,27 @@ def add_text(
    for i in range(num_sides):
        if "deluxe" in states[i].model_name:
            hint_msg = SLOW_MODEL_MSG
+
+    if file_extension == ".pdf":
+        document_text = llama_parse(files[0])


Ideally, we abstract pdf parser here. We should call it something like pdf_parse and by default it uses llamaparse, but we can switch to others when needed.

infwinston

thanks more comments

infwinston · 2024-12-08T05:18:27Z

fastchat/serve/gradio_block_arena_vision_anony.py

+        # if random_questions:
+        #     global vqa_samples
+        #     with open(random_questions, "r") as f:
+        #         vqa_samples = json.load(f)
+        #     random_btn = gr.Button(value="🔮 Random Image", interactive=True)


we shouldn't remove these.. otherwise random_question button will break

I'm not sure how random_question works but the current implementation is broken when the if condition isn't true because I get an error that random_btn has no value.

infwinston · 2024-12-08T05:18:32Z

fastchat/serve/gradio_block_arena_vision_anony.py

@@ -471,10 +566,10 @@ def build_side_by_side_vision_ui_anony(context: Context, random_questions=None):
        )

        multimodal_textbox = gr.MultimodalTextbox(
-            file_types=["image"],
+            file_types=["file"],


why do we need to change this?

When I add ["image", "application/pdf"], it doesn't let me load pdf and if I add ["image", "pdf"], it gives me an error that it's not "application/pdf". I am still trying to find a fix but temporarily just allowed for all file types for testing.

I am checking if the file type is pdf or image in the add_text function and raising error if not.

infwinston · 2024-12-08T05:19:27Z

fastchat/serve/gradio_block_arena_vision_anony.py

+        text, files = chat_input["text"], chat_input["files"]
    else:
        text = chat_input
-        images = []
+        files = []
+
+    images = []
+


This will break image input!

infwinston · 2024-12-08T05:20:14Z

fastchat/serve/gradio_block_arena_vision_anony.py

@@ -267,7 +340,7 @@ def add_text(
    if states[0] is None:
        assert states[1] is None

-        if len(images) > 0:
+        if len(files) > 0 and file_extension != ".pdf":


This is not a reliable implementation to check whether a file is a pdf.

a file can be called "abc.pdf" but it's a jpg.

Got it, I changed to magic number checker.

infwinston · 2024-12-08T05:20:36Z

fastchat/serve/gradio_block_arena_vision_anony.py

+    os.makedirs(output_dir, exist_ok=True)
+
+    pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
+    markdown_file_path = os.path.join(output_dir, f"{pdf_name}.md")


why do we need this?

The markdown_file_path variable is unused, the pdf_name is required for the extra-info parameter for the parser. The documentation is here: https://github.com/run-llama/llama_parse?tab=readme-ov-file#using-with-file-object

We don't. It was part my old implementation for calling LlamaParse with subprocess. I will remove.

infwinston · 2024-12-08T05:21:04Z

fastchat/serve/gradio_block_arena_vision_anony.py

+        result_type="markdown",  # Output in Markdown format
+        num_workers=4,           # Number of API calls for batch processing
+        verbose=True,            # Print detailed logs
+        language="en"            # Set language (default is English)


what if the pdf is in different language?

This is optional, we can remove this

I don't know if LlamaParse can handle identifying language. Maybe we have to use language detection or translator?

infwinston · 2024-12-08T05:21:14Z

fastchat/serve/gradio_block_arena_vision_anony.py

+import nest_asyncio
+from llama_parse import LlamaParse
+
+nest_asyncio.apply()  # Ensure compatibility with async environments


why do we need this?

I'm not exactly sure where it is being used but it was part of the script for LlamaParse API calls.

infwinston · 2024-12-08T05:21:24Z

fastchat/serve/gradio_block_arena_vision_anony.py

+def extract_text_from_pdf(pdf_file_path):
+    """Extract text from a PDF file."""
+    try:
+        with open(pdf_file_path, 'rb') as f:
+            reader = PyPDF2.PdfReader(f)
+            pdf_text = ""
+            for page in reader.pages:
+                pdf_text += page.extract_text()
+            return pdf_text
+    except Exception as e:
+        logger.error(f"Failed to extract text from PDF: {e}")
+        return None


why do we need this function?

It is for unstructured extraction, it's not being used, we just copied it over from the demo.

infwinston · 2024-12-08T05:21:33Z

fastchat/serve/gradio_block_arena_vision_anony.py

@@ -4,12 +4,16 @@
 """

 import json
+import subprocess


…ries and code

added pdf context support

3857e22

infwinston reviewed Dec 7, 2024

View reviewed changes

andrewwan0131 added 2 commits December 7, 2024 20:26

These changes are in response to PR comments

06d056b

These changes are in response to PR comments

cc66890

infwinston reviewed Dec 8, 2024

View reviewed changes

Changed file detection to magic numbers and removed unnecessary libra…

85767e5

…ries and code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for PDF file uploads as context for LLM queries #3638

Add support for PDF file uploads as context for LLM queries #3638

andrewwan0131 commented Dec 7, 2024

infwinston left a comment

infwinston Dec 7, 2024

andrewwan0131 Dec 8, 2024

infwinston Dec 7, 2024

infwinston Dec 7, 2024

infwinston left a comment

infwinston Dec 8, 2024 •

edited

Loading

andrewwan0131 Dec 8, 2024

infwinston Dec 8, 2024

andrewwan0131 Dec 8, 2024

andrewwan0131 Dec 8, 2024

infwinston Dec 8, 2024

andrewwan0131 Dec 8, 2024

infwinston Dec 8, 2024

andrewwan0131 Dec 8, 2024

infwinston Dec 8, 2024

PranavB-11 Dec 8, 2024

andrewwan0131 Dec 8, 2024

infwinston Dec 8, 2024

PranavB-11 Dec 8, 2024

andrewwan0131 Dec 8, 2024

infwinston Dec 8, 2024

andrewwan0131 Dec 8, 2024

infwinston Dec 8, 2024

PranavB-11 Dec 8, 2024 •

edited

Loading

infwinston Dec 8, 2024


		post_processed_text = wrap_query_context(text, post_processed_text)

		text = text[:BLIND_MODE_INPUT_CHAR_LEN_LIMIT] # Hard cut-off

Add support for PDF file uploads as context for LLM queries #3638

Are you sure you want to change the base?

Add support for PDF file uploads as context for LLM queries #3638

Conversation

andrewwan0131 commented Dec 7, 2024

Why are these changes needed?

Changes made

Checks

infwinston left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

infwinston left a comment

Choose a reason for hiding this comment

infwinston Dec 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PranavB-11 Dec 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

infwinston Dec 8, 2024 •

edited

Loading

PranavB-11 Dec 8, 2024 •

edited

Loading