-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
index store size #1721
Comments
Hi Paul There is a pull request #1706 that solves it. Had exactly the same issue. /Henrik |
Closing in favor of that PR |
No problem. I am testing the postgres nodestore atm. |
@HenrikPedDK how many documents did you use for testing? |
I am ingesting right now. About 10k documents is going in. I am about 1000
now, using 11mill rows in the vectordb table :D It's quite big PDFs from
5MB up to+400MB in size''
/Henrik
Den ons. 13. mar. 2024 kl. 15.48 skrev paul-asvb ***@***.***>:
… @HenrikPedDK <https://github.com/HenrikPedDK> how many documents did you
use for testing?
—
Reply to this email directly, view it on GitHub
<#1721 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A5O3HADEMUQZ525XFZSFTKTYYBRNVAVCNFSM6AAAAABEUCWXIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGU3TIMZYG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
ahhh ok maybe its the shape of the data. |
but i never have gotten so far, before the index always used to crash (with index as file or in mongo). @HenrikPedDK keep me posted |
Just pull the PR, set it up (New docs are in the PR) and run it. Remember
to delete the tables in PG before embedding again. Also i'm using
huggingface embedding. The Ollama embedding didn't work well for me
Den ons. 13. mar. 2024 kl. 15.59 skrev paul-asvb ***@***.***>:
… but i never have gotten so far, before the index always used to crash
(with index as file or in mongo). @HenrikPedDK
<https://github.com/HenrikPedDK> keep me posted
—
Reply to this email directly, view it on GitHub
<#1721 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A5O3HABKLSW6YHNGUC2X6S3YYBSUBAVCNFSM6AAAAABEUCWXIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGU4TQNBTGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The single row that llama-index creates for the docstore index is an issue (agreed). After a bit of digging and some prototyping. However for a database (mongo, redis, postgres etc) this matters as you get a performance benefit with multiple rows in the table/store. With a single row performance degrades once the value of the single key gets too large. The solution is to create a vector store index PER ingested file - instead of using the same one for everything. |
As I am writing this I have ingested 1550 PDFs of varying size in 24 hours, and created around 1.6mill rows in the vectorDB on postgres. Ingestion is slow, I see it generation embeddings with the progress bar, goes to 100% and then for some reasons depending on size of the document it waits for up to several minutes before it continues. In the terminal i see this while it waits: Generating embeddings: 0it [00:00, ?it/s] I am monitoring both my Postgres server and the PGPT server and there's 0 (zero) CPU/GPU activity on the PGPT server and almost nothing on the postgres, so no write is happening either.. So apparently it is waiting for something depending of the size of the documenet/embeddings unti it suddently starts again Also i noticed there's no check for already ingested files. Could be nice that ingest checks or at least removes (or marks) the ingested files |
@dbzoo im not too familiar with llamaindex and i dont know if its necessary for all the usecases that pgpt tries to cover to have all ingestions in one index. To make my use-case work i see two options:
|
@HenrikPedDK the dataset i have is 45000 markdown files. I can actually ingest around 8000 in 1h but eventually the writing to the index becomes so slow i have to stop it. |
@imartinez I think this is still an open issue. |
A minor modification to private_gpt to use multiple indexes would do the trick. llama-index has support for loading multiple indexes In fact loading one loads them all but only returns the 1st. I don't think we need a new KV store to solve this. @imartinez I have some prototype code "very small" that demonstrates the single index issue.
|
@dbzoo would those multiple indices prevent users to query cross-index documents? |
@paul-asvb, I need clarification on whether having multiple indices would prevent users from querying cross-index documents. I'll need to conduct some investigation to confirm. My initial assumption is no, as I expect the system to aggregate and utilize all available indexes. After all, the purpose of the index store is to track all vector indexes that can be utilized |
Best guess: The creation of the embeddings and insertion into the vector table is quick (many rows). The parallel ingest threads are blocking each other trying to update the single vector index KV in the index store. This must be an atomic operation, so thread serialization occurs on the update. With the value of the key being so large this creates a noticeable delay. The output of the empty embedding log line feels like the completion of the index update operation of each thread as it unblocks the next in line. |
I am using 2 workers and doing parallel but they are never writing to the
same row because they are working on each their index. Don't know if one
worker maybe lock the whole table write committing? Everything is going to
PG
Den tors. 14. mar. 2024 kl. 15.41 skrev Brett England <
***@***.***>:
… @HenrikPedDK <https://github.com/HenrikPedDK>
So apparently it is waiting for something depending of the size of the
document/embeddings until it suddenty starts again
Best guess: The creation of the embeddings and insertion into the vector
table is quick (many rows). The parallel ingest threads are blocking each
other trying to update the single vector index KV in the index store. This
must be an atomic operation, so thread serialization occurs on the update.
With the value of the key being so large this creates a noticeable delay.
The output of the empty embedding log line feels like the completion of the
index update operation of each thread as it unblocks the next in line.
—
Reply to this email directly, view it on GitHub
<#1721 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A5O3HAC4D4MTFCPX3IJMXE3YYGZLFAVCNFSM6AAAAABEUCWXIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGYYTMOJRGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Actually i think it's because there's only one pipe for writing and they wait for each other (maybe what you wrote), but since it's a database that shouldn't be an issue. Maybe some change in the ingestion needed? |
@HenrikPedDK how are you creating 2 indexes? 2 pgpt instances? |
@HenrikPedDK @paul-asvb I've added PR #1750 that should help you out with the large index issue by chunking updates. As a bonus the ingestion is now faster too. I also investigated a multi-index solutions and have something working. Now ingestion time stays linear regardless of how many documents you have consumed. The only downside is that I think query times extend as it having to perform query fusion - something I need to look at. What is the optimum size of each index to avoid query slowness during index concat. |
Ill give it a try now 👍 |
We have been testing privateGPT with ~50.000 files with sizes from 10kb to 5mb.
WE LOVE IT.
The only problem we had is the size of the indexstore.
No mather which nodestore we use the size of that file keeps on breaking privateGPT.
im testing the postgres nodestore atm but it also stores it in one big json blob.
Does anyone know why? is there a way to prevent this?
The text was updated successfully, but these errors were encountered: