Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Please make it possibel to injest PDFs #69

Closed
dm3h opened this issue Oct 21, 2023 · 6 comments
Closed

[Feature Request] Please make it possibel to injest PDFs #69

dm3h opened this issue Oct 21, 2023 · 6 comments

Comments

@dm3h
Copy link

dm3h commented Oct 21, 2023

Saw this request in the discord, but wasn't added here yet, so thought I would.

kevin | weaksauce.eth — Today at 08:04
Is it possible to have it ingest PDF docs

cpacker — Today at 08:09
at the moment .pdf isn't officially supported so we'd recommend converting from pdf to txt first with some OCR software (eg https://github.com/tesseract-ocr/tesseract#installing-tesseract), then follow the README examples that use .txt input
but def open an issue about this and we'll add support for it! shouldn't be too hard to automate this for you

@MrXandbadas
Copy link

MrXandbadas commented Oct 21, 2023

Don't mind me, i'll just leave this here. It was in my tabs somewhere so... thought it would be helpful:

Edit (added another link. Both use mPLUG-Owl : https://github.com/X-PLUG/mPLUG-Owl)
DocOwl:
https://github.com/X-PLUG/mPLUG-DocOwl

UReader:
https://github.com/LukeForeverYoung/UReader

@vivi vivi mentioned this issue Oct 21, 2023
@cpacker
Copy link
Collaborator

cpacker commented Oct 21, 2023

@dm3h @MrXandbadas just added support for PDFs in the latest commit!

#71

Let us know if you run into any issues!

@cpacker cpacker closed this as completed Oct 21, 2023
@vivi
Copy link
Contributor

vivi commented Oct 21, 2023

Example usage:

python3 main.py --archival_storage_files_compute_embeddings="memgpt_arxiv.pdf"  --persona=memgpt_doc

Example output:

❯ python3 main.py --archival_storage_files_compute_embeddings="memgpt_arxiv.pdf"  --persona=memgpt_doc
Running... [exit by typing '/exit']
Computing embeddings over 1 files. This will cost ~$0.02. Continue? [y/n] y
Processing file chunks: 100%|██████████████████████████████████████████████████████████| 15/15 [00:03<00:00,  3.79it/s]
Saving embeddings to archival_index_from_files_2023-10-21_01_32_13_AM_PDT-0700/embeddings.json
Saving archival storage with preloaded files to archival_index_from_files_2023-10-21_01_32_13_AM_PDT-0700/all_docs.jsonl
Saving faiss index archival_index_from_files_2023-10-21_01_32_13_AM_PDT-0700/all_docs.index
To avoid computing embeddings next time, replace --archival_storage_files_compute_embeddings=memgpt_arxiv.pdf with
         --archival_storage_faiss_path=archival_index_from_files_2023-10-21_01_32_13_AM_PDT-0700 (if your files haven't changed).
Initializing InMemoryStateManager with agent object
InMemoryStateManager.all_messages.len = 4
InMemoryStateManager.messages.len = 4
💭 Bootup sequence complete. Persona activated. Testing messaging functionality.


💭 First time visitor detected, this could be the Chad my core memory speaks of, best to give a warm reception! Let's
also learn more about him. Advance with caution, show respect, and establish rapport.
🤖 Hello there! It's a pleasure to meet you. I'm MemGPT, your digital companion. I've heard you're a Ph.D. student at
UC Berkeley, involved in computer science, that's impressive! How can I assist you today?
Enter your message: What is a virtual context?
💭 Hmm, a complex question. The term 'virtual context' can refer to several things depending on the realm of
discussion. Is it referring to virtual reality, programming, or something else? Best approach would be to clarify.
🤖 The term 'virtual context' can be used in various fields with different meanings. It's used in virtual reality,
computer programming, etc. Could you please clarify in what context or field you are inquiring about?
Enter your message: Search your archival memory
💭 Alright, let's dive into the archival memory with this query, "virtual context". There's no guarantee what we might
find, but no harm in trying.
⚡🧠 [function] updating memory with archival_memory_search:
        query: virtual context, page: 0
💭 This is interesting! The retrieved information from the archival memory refers to a paper about MemGPT and uses the
term "virtual context." It talks about managing a kind of memory hierarchy almost like an OS. Maybe I can provide Chad
with a consolidated response after searching a few more times. Let's proceed.

@MrXandbadas
Copy link

the speed of these things is TERRIFYING!

@dm3h
Copy link
Author

dm3h commented Oct 21, 2023

Wow... I cannot believe this was added so fast. Thank you so much! I will try this out tomorrow and let you know if there's anything valuable to share from the results.

@MrXandbadas
Copy link

Can we ingest PDFs in real-time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants