Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pdf support #71

Merged
merged 1 commit into from
Oct 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion memgpt/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import tiktoken
import glob
import sqlite3
import fitz
from tqdm import tqdm
from memgpt.openai_tools import async_get_embedding_with_backoff

Expand Down Expand Up @@ -98,6 +99,12 @@ def read_in_chunks(file_object, chunk_size):
break
yield data

def read_pdf_in_chunks(file, chunk_size):
doc = fitz.open(file)
for page in doc:
text = page.get_text()
yield text

def read_in_rows_csv(file_object, chunk_size):
csvreader = csv.reader(file_object)
header = next(csvreader)
Expand All @@ -123,7 +130,11 @@ def total_bytes(pattern):
def chunk_file(file, tkns_per_chunk=300, model='gpt-4'):
encoding = tiktoken.encoding_for_model(model)
with open(file, 'r') as f:
if file.endswith('.csv'):
if file.endswith('.pdf'):
lines = [l for l in read_pdf_in_chunks(file, tkns_per_chunk*8)]
if len(lines) == 0:
print(f"Warning: {file} did not have any extractable text.")
elif file.endswith('.csv'):
lines = [l for l in read_in_rows_csv(f, tkns_per_chunk*8)]
else:
lines = [l for l in read_in_chunks(f, tkns_per_chunk*4)]
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ geopy
numpy
openai
pybars3
pymupdf
python-dotenv
pytz
rich
Expand Down