index store size #1721

paul-asvb · 2024-03-13T12:42:31Z

We have been testing privateGPT with ~50.000 files with sizes from 10kb to 5mb.

WE LOVE IT.

The only problem we had is the size of the indexstore.

No mather which nodestore we use the size of that file keeps on breaking privateGPT.

im testing the postgres nodestore atm but it also stores it in one big json blob.

Does anyone know why? is there a way to prevent this?

HenrikPedDK · 2024-03-13T13:15:12Z

Hi Paul

There is a pull request #1706 that solves it. Had exactly the same issue.

/Henrik

imartinez · 2024-03-13T13:26:08Z

Closing in favor of that PR

paul-asvb · 2024-03-13T13:38:27Z

No problem. I am testing the postgres nodestore atm.
It also writes everything in one huge json blob.
Over time that slows down the ingestion immensely.
Is there a way to force llamaindex to store the index in a different way?

paul-asvb · 2024-03-13T14:48:36Z

@HenrikPedDK how many documents did you use for testing?

HenrikPedDK · 2024-03-13T14:54:14Z

I am ingesting right now. About 10k documents is going in. I am about 1000 now, using 11mill rows in the vectordb table :D It's quite big PDFs from 5MB up to+400MB in size'' /Henrik Den ons. 13. mar. 2024 kl. 15.48 skrev paul-asvb ***@***.***>:

…

@HenrikPedDK <https://github.com/HenrikPedDK> how many documents did you use for testing? — Reply to this email directly, view it on GitHub <#1721 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A5O3HADEMUQZ525XFZSFTKTYYBRNVAVCNFSM6AAAAABEUCWXIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGU3TIMZYG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

paul-asvb · 2024-03-13T14:57:36Z

ahhh ok maybe its the shape of the data.
i have many small documents (markdown text files).
i ingested around 6000 (400k rows in docstore) now and the part where the index is saved it so slow i had to stop.

paul-asvb · 2024-03-13T14:58:50Z

but i never have gotten so far, before the index always used to crash (with index as file or in mongo). @HenrikPedDK keep me posted

HenrikPedDK · 2024-03-13T15:07:27Z

Just pull the PR, set it up (New docs are in the PR) and run it. Remember to delete the tables in PG before embedding again. Also i'm using huggingface embedding. The Ollama embedding didn't work well for me Den ons. 13. mar. 2024 kl. 15.59 skrev paul-asvb ***@***.***>:

…

but i never have gotten so far, before the index always used to crash (with index as file or in mongo). @HenrikPedDK <https://github.com/HenrikPedDK> keep me posted — Reply to this email directly, view it on GitHub <#1721 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A5O3HABKLSW6YHNGUC2X6S3YYBSUBAVCNFSM6AAAAABEUCWXIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJUGU4TQNBTGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

dbzoo · 2024-03-13T22:16:21Z

No problem. I am testing the postgres nodestore atm. It also writes everything in one huge json blob. Over time that slows down the ingestion immensely. Is there a way to force llamaindex to store the index in a different way?

The single row that llama-index creates for the docstore index is an issue (agreed).
This slows down ingestion as the size of the index increases, due to larger read/write I/O due the index growing in size.
I don't know why llama-index does it this way. Seems foolish. Very foolish.
Why don't they split each JSON (thing) into a separate row?
This will be something to investigate and raise a PR against llama-index.

After a bit of digging and some prototyping.
This is not a llama-index problem but a private_gpt one and how its uses vector indices.
The crux of the problem is that only a single VectorStoreIndex is being created for everything ingested.
This means a single row is created in Postgres. One row = one vector index.
(clarification: The data_indexstore PG table is a table that tracks vector indexes and the nodes stored in them)
On the filesystem (chroma) it does not matter if you use one index or many, its still pounding the same file so you wouldn't get any benefit from moving to multiple vectors.

However for a database (mongo, redis, postgres etc) this matters as you get a performance benefit with multiple rows in the table/store. With a single row performance degrades once the value of the single key gets too large.

The solution is to create a vector store index PER ingested file - instead of using the same one for everything.
This requires an overhaul of the ingest_component.py so that it does not initialize a single VectorStoreIndex and/or reuse anything in the system when ingesting more files.
It not quite this simple because once you have many vector indices it will impact a whole bunch of other code.
This change may not play well with the rest of the architecture. The author will know better.

HenrikPedDK · 2024-03-14T07:32:02Z

As I am writing this I have ingested 1550 PDFs of varying size in 24 hours, and created around 1.6mill rows in the vectorDB on postgres. Ingestion is slow, I see it generation embeddings with the progress bar, goes to 100% and then for some reasons depending on size of the document it waits for up to several minutes before it continues. In the terminal i see this while it waits:

Generating embeddings: 0it [00:00, ?it/s]
Generating embeddings: 0it [00:00, ?it/s]
Generating embeddings: 0it [00:00, ?it/s]
Generating embeddings: 0it [00:00, ?it/s]

I am monitoring both my Postgres server and the PGPT server and there's 0 (zero) CPU/GPU activity on the PGPT server and almost nothing on the postgres, so no write is happening either.. So apparently it is waiting for something depending of the size of the documenet/embeddings unti it suddently starts again

Also i noticed there's no check for already ingested files. Could be nice that ingest checks or at least removes (or marks) the ingested files

paul-asvb · 2024-03-14T08:52:39Z

@dbzoo im not too familiar with llamaindex and i dont know if its necessary for all the usecases that pgpt tries to cover to have all ingestions in one index. To make my use-case work i see two options:

dont use the seperate index / doc store, that is possible but i need to extract the doc its then from the vector store directly (i only tried with qdrant, worked)
Implement another index-store in pgpt. That is actually the better long term solution. I will try a prototype with a SimpleKVStore for the index first, then maybe a postgres implementation as seen here llama source

paul-asvb · 2024-03-14T08:55:36Z

@HenrikPedDK the dataset i have is 45000 markdown files. I can actually ingest around 8000 in 1h but eventually the writing to the index becomes so slow i have to stop it.
My theory is that the reading, parsing, adding and then writing of the index (Generating embeddings: 0it [00:00, ?it/s]) is causing this issue.

dmtw · 2024-03-14T12:06:20Z

@imartinez I think this is still an open issue.

dbzoo · 2024-03-14T12:31:11Z

A minor modification to private_gpt to use multiple indexes would do the trick. llama-index has support for loading multiple indexes In fact loading one loads them all but only returns the 1st. I don't think we need a new KV store to solve this.

@imartinez I have some prototype code "very small" that demonstrates the single index issue.

from llama_index.core import Settings, SimpleDirectoryReader, StorageContext
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Document
from llama_index.core.storage.docstore.postgres_docstore import PostgresDocumentStore
from llama_index.core.storage.index_store.postgres_index_store import PostgresIndexStore
from llama_index.core.indices import VectorStoreIndex, load_index_from_storage

Settings.embed_model = OllamaEmbedding( model_name="nomic-embed-text" )
postgres={
    "host": "192.168.1.106",
    "port": "5432",
    "database": "postgres",
    "user": "postgres",
    "password": "admin",
    "schema_name": "private_gpt"
}
index_store = PostgresIndexStore.from_params(**postgres)
doc_store = PostgresDocumentStore.from_params(**postgres)
# I've left the vector store on the filesystem as its location is not important to the index store issue.
storage_context = StorageContext.from_defaults(
    index_store=index_store,
    docstore=doc_store
)

def test1():
    # Creates a new ROW in the PG data_indexstore as the new VectorStoreIndex is persisted.
    document = Document(text="Mary had a little lamb",metadata={"category":"lamb"})
    index = VectorStoreIndex.from_documents([document], storage_context=storage_context)
    index.storage_context.persist()

    document = Document(text="Its fleece was white as snow",metadata={"category":"snow"})
    index = VectorStoreIndex.from_documents([document], storage_context=storage_context)
    index.storage_context.persist()


def test2():
    # The same VectorStoreIndex is reused for every Document.
    # Persisting the same index uses the same KEY and hence appends to the existing row.
    # In private_gpt this index is loaded if its found to be present - ingest_component.py
    index = VectorStoreIndex( [], storage_context=storage_context )
        
    document = Document(text="Mary had a little lamb",metadata={"category":"lamb"})
    index.insert(document)
    index.storage_context.persist()

    document = Document(text="Its fleece was white as snow",metadata={"category":"snow"})
    index.insert(document)
    index.storage_context.persist()

paul-asvb · 2024-03-14T13:23:09Z

@dbzoo would those multiple indices prevent users to query cross-index documents?

dbzoo · 2024-03-14T14:23:05Z

@paul-asvb, I need clarification on whether having multiple indices would prevent users from querying cross-index documents. I'll need to conduct some investigation to confirm. My initial assumption is no, as I expect the system to aggregate and utilize all available indexes. After all, the purpose of the index store is to track all vector indexes that can be utilized

dbzoo · 2024-03-14T14:41:32Z

@HenrikPedDK

So apparently it is waiting for something depending of the size of the document/embeddings until it suddenty starts again

Best guess: The creation of the embeddings and insertion into the vector table is quick (many rows). The parallel ingest threads are blocking each other trying to update the single vector index KV in the index store. This must be an atomic operation, so thread serialization occurs on the update. With the value of the key being so large this creates a noticeable delay. The output of the empty embedding log line feels like the completion of the index update operation of each thread as it unblocks the next in line.

HenrikPedDK · 2024-03-14T16:54:58Z

I am using 2 workers and doing parallel but they are never writing to the same row because they are working on each their index. Don't know if one worker maybe lock the whole table write committing? Everything is going to PG Den tors. 14. mar. 2024 kl. 15.41 skrev Brett England < ***@***.***>:

…

@HenrikPedDK <https://github.com/HenrikPedDK> So apparently it is waiting for something depending of the size of the document/embeddings until it suddenty starts again Best guess: The creation of the embeddings and insertion into the vector table is quick (many rows). The parallel ingest threads are blocking each other trying to update the single vector index KV in the index store. This must be an atomic operation, so thread serialization occurs on the update. With the value of the key being so large this creates a noticeable delay. The output of the empty embedding log line feels like the completion of the index update operation of each thread as it unblocks the next in line. — Reply to this email directly, view it on GitHub <#1721 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A5O3HAC4D4MTFCPX3IJMXE3YYGZLFAVCNFSM6AAAAABEUCWXIGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGYYTMOJRGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

HenrikPedDK · 2024-03-14T17:30:30Z

Actually i think it's because there's only one pipe for writing and they wait for each other (maybe what you wrote), but since it's a database that shouldn't be an issue. Maybe some change in the ingestion needed?

paul-asvb · 2024-03-16T17:58:00Z

@HenrikPedDK how are you creating 2 indexes? 2 pgpt instances?

dbzoo · 2024-03-17T14:26:55Z

@HenrikPedDK @paul-asvb I've added PR #1750 that should help you out with the large index issue by chunking updates. As a bonus the ingestion is now faster too.

I also investigated a multi-index solutions and have something working. Now ingestion time stays linear regardless of how many documents you have consumed. The only downside is that I think query times extend as it having to perform query fusion - something I need to look at. What is the optimum size of each index to avoid query slowness during index concat.
I want to get this PR in place first before merging that work in.

paul-asvb · 2024-03-20T10:34:01Z

Ill give it a try now 👍

imartinez closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index store size #1721

index store size #1721

paul-asvb commented Mar 13, 2024

HenrikPedDK commented Mar 13, 2024

imartinez commented Mar 13, 2024

paul-asvb commented Mar 13, 2024

paul-asvb commented Mar 13, 2024

HenrikPedDK commented Mar 13, 2024 via email

paul-asvb commented Mar 13, 2024 •

edited

Loading

paul-asvb commented Mar 13, 2024

HenrikPedDK commented Mar 13, 2024 via email

dbzoo commented Mar 13, 2024 •

edited

Loading

HenrikPedDK commented Mar 14, 2024 •

edited

Loading

paul-asvb commented Mar 14, 2024

paul-asvb commented Mar 14, 2024

dmtw commented Mar 14, 2024

dbzoo commented Mar 14, 2024 •

edited

Loading

paul-asvb commented Mar 14, 2024

dbzoo commented Mar 14, 2024

dbzoo commented Mar 14, 2024

HenrikPedDK commented Mar 14, 2024 via email

HenrikPedDK commented Mar 14, 2024

paul-asvb commented Mar 16, 2024

dbzoo commented Mar 17, 2024

paul-asvb commented Mar 20, 2024

index store size #1721

index store size #1721

Comments

paul-asvb commented Mar 13, 2024

HenrikPedDK commented Mar 13, 2024

imartinez commented Mar 13, 2024

paul-asvb commented Mar 13, 2024

paul-asvb commented Mar 13, 2024

HenrikPedDK commented Mar 13, 2024 via email

paul-asvb commented Mar 13, 2024 • edited Loading

paul-asvb commented Mar 13, 2024

HenrikPedDK commented Mar 13, 2024 via email

dbzoo commented Mar 13, 2024 • edited Loading

HenrikPedDK commented Mar 14, 2024 • edited Loading

paul-asvb commented Mar 14, 2024

paul-asvb commented Mar 14, 2024

dmtw commented Mar 14, 2024

dbzoo commented Mar 14, 2024 • edited Loading

paul-asvb commented Mar 14, 2024

dbzoo commented Mar 14, 2024

dbzoo commented Mar 14, 2024

HenrikPedDK commented Mar 14, 2024 via email

HenrikPedDK commented Mar 14, 2024

paul-asvb commented Mar 16, 2024

dbzoo commented Mar 17, 2024

paul-asvb commented Mar 20, 2024

paul-asvb commented Mar 13, 2024 •

edited

Loading

dbzoo commented Mar 13, 2024 •

edited

Loading

HenrikPedDK commented Mar 14, 2024 •

edited

Loading

dbzoo commented Mar 14, 2024 •

edited

Loading