-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for PDF file uploads as context for LLM queries #3638
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andrewwan0131 left some comments!
|
||
post_processed_text = wrap_query_context(text, post_processed_text) | ||
|
||
text = text[:BLIND_MODE_INPUT_CHAR_LEN_LIMIT] # Hard cut-off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably avoid cutting off inputs when dealing with pdf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I will only cut off input for images.
@@ -483,6 +562,7 @@ def build_side_by_side_vision_ui_anony(context: Context, random_questions=None): | |||
) | |||
|
|||
with gr.Row() as button_row: | |||
random_btn = gr.Button(value="🔮 Random Image", interactive=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated with the below
random_btn = gr.Button(value="🔮 Random Image", interactive=True)
@@ -363,6 +421,27 @@ def add_text( | |||
for i in range(num_sides): | |||
if "deluxe" in states[i].model_name: | |||
hint_msg = SLOW_MODEL_MSG | |||
|
|||
if file_extension == ".pdf": | |||
document_text = llama_parse(files[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we abstract pdf parser here. We should call it something like pdf_parse
and by default it uses llamaparse, but we can switch to others when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks more comments
# if random_questions: | ||
# global vqa_samples | ||
# with open(random_questions, "r") as f: | ||
# vqa_samples = json.load(f) | ||
# random_btn = gr.Button(value="🔮 Random Image", interactive=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't remove these.. otherwise random_question button will break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how random_question works but the current implementation is broken when the if condition isn't true because I get an error that random_btn has no value.
@@ -471,10 +566,10 @@ def build_side_by_side_vision_ui_anony(context: Context, random_questions=None): | |||
) | |||
|
|||
multimodal_textbox = gr.MultimodalTextbox( | |||
file_types=["image"], | |||
file_types=["file"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I add ["image", "application/pdf"], it doesn't let me load pdf and if I add ["image", "pdf"], it gives me an error that it's not "application/pdf". I am still trying to find a fix but temporarily just allowed for all file types for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am checking if the file type is pdf or image in the add_text function and raising error if not.
text, files = chat_input["text"], chat_input["files"] | ||
else: | ||
text = chat_input | ||
images = [] | ||
files = [] | ||
|
||
images = [] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break image input!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
@@ -267,7 +340,7 @@ def add_text( | |||
if states[0] is None: | |||
assert states[1] is None | |||
|
|||
if len(images) > 0: | |||
if len(files) > 0 and file_extension != ".pdf": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a reliable implementation to check whether a file is a pdf.
a file can be called "abc.pdf" but it's a jpg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I changed to magic number checker.
os.makedirs(output_dir, exist_ok=True) | ||
|
||
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0] | ||
markdown_file_path = os.path.join(output_dir, f"{pdf_name}.md") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The markdown_file_path variable is unused, the pdf_name is required for the extra-info parameter for the parser. The documentation is here: https://github.com/run-llama/llama_parse?tab=readme-ov-file#using-with-file-object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't. It was part my old implementation for calling LlamaParse with subprocess. I will remove.
result_type="markdown", # Output in Markdown format | ||
num_workers=4, # Number of API calls for batch processing | ||
verbose=True, # Print detailed logs | ||
language="en" # Set language (default is English) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the pdf is in different language?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is optional, we can remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if LlamaParse can handle identifying language. Maybe we have to use language detection or translator?
import nest_asyncio | ||
from llama_parse import LlamaParse | ||
|
||
nest_asyncio.apply() # Ensure compatibility with async environments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly sure where it is being used but it was part of the script for LlamaParse API calls.
def extract_text_from_pdf(pdf_file_path): | ||
"""Extract text from a PDF file.""" | ||
try: | ||
with open(pdf_file_path, 'rb') as f: | ||
reader = PyPDF2.PdfReader(f) | ||
pdf_text = "" | ||
for page in reader.pages: | ||
pdf_text += page.extract_text() | ||
return pdf_text | ||
except Exception as e: | ||
logger.error(f"Failed to extract text from PDF: {e}") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for unstructured extraction, it's not being used, we just copied it over from the demo.
@@ -4,12 +4,16 @@ | |||
""" | |||
|
|||
import json | |||
import subprocess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
Why are these changes needed?
These changes enable users to upload PDF files as context for LLM queries.
Changes made
Added PDF file handling capabilities:
Modified relevant files:
Checks