Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index store size #1721

Closed
paul-asvb opened this issue Mar 13, 2024 · 22 comments
Closed

index store size #1721

paul-asvb opened this issue Mar 13, 2024 · 22 comments

Comments

@paul-asvb
Copy link

We have been testing privateGPT with ~50.000 files with sizes from 10kb to 5mb.

WE LOVE IT.

The only problem we had is the size of the indexstore.

No mather which nodestore we use the size of that file keeps on breaking privateGPT.

im testing the postgres nodestore atm but it also stores it in one big json blob.

Does anyone know why? is there a way to prevent this?

@HenrikPedDK
Copy link

Hi Paul

There is a pull request #1706 that solves it. Had exactly the same issue.

/Henrik

@imartinez
Copy link
Collaborator

Closing in favor of that PR

@paul-asvb
Copy link
Author

No problem. I am testing the postgres nodestore atm.
It also writes everything in one huge json blob.
Over time that slows down the ingestion immensely.
Is there a way to force llamaindex to store the index in a different way?

@paul-asvb
Copy link
Author

@HenrikPedDK how many documents did you use for testing?

@HenrikPedDK
Copy link

HenrikPedDK commented Mar 13, 2024 via email

@paul-asvb
Copy link
Author

paul-asvb commented Mar 13, 2024

ahhh ok maybe its the shape of the data.
i have many small documents (markdown text files).
i ingested around 6000 (400k rows in docstore) now and the part where the index is saved it so slow i had to stop.

@paul-asvb
Copy link
Author

but i never have gotten so far, before the index always used to crash (with index as file or in mongo). @HenrikPedDK keep me posted

@HenrikPedDK
Copy link

HenrikPedDK commented Mar 13, 2024 via email

@dbzoo
Copy link
Contributor

dbzoo commented Mar 13, 2024

No problem. I am testing the postgres nodestore atm. It also writes everything in one huge json blob. Over time that slows down the ingestion immensely. Is there a way to force llamaindex to store the index in a different way?

The single row that llama-index creates for the docstore index is an issue (agreed).
This slows down ingestion as the size of the index increases, due to larger read/write I/O due the index growing in size.
I don't know why llama-index does it this way. Seems foolish. Very foolish.
Why don't they split each JSON (thing) into a separate row?
This will be something to investigate and raise a PR against llama-index.

After a bit of digging and some prototyping.
This is not a llama-index problem but a private_gpt one and how its uses vector indices.
The crux of the problem is that only a single VectorStoreIndex is being created for everything ingested.
This means a single row is created in Postgres. One row = one vector index.
(clarification: The data_indexstore PG table is a table that tracks vector indexes and the nodes stored in them)
On the filesystem (chroma) it does not matter if you use one index or many, its still pounding the same file so you wouldn't get any benefit from moving to multiple vectors.

However for a database (mongo, redis, postgres etc) this matters as you get a performance benefit with multiple rows in the table/store. With a single row performance degrades once the value of the single key gets too large.

The solution is to create a vector store index PER ingested file - instead of using the same one for everything.
This requires an overhaul of the ingest_component.py so that it does not initialize a single VectorStoreIndex and/or reuse anything in the system when ingesting more files.
It not quite this simple because once you have many vector indices it will impact a whole bunch of other code.
This change may not play well with the rest of the architecture. The author will know better.

@HenrikPedDK
Copy link

HenrikPedDK commented Mar 14, 2024

As I am writing this I have ingested 1550 PDFs of varying size in 24 hours, and created around 1.6mill rows in the vectorDB on postgres. Ingestion is slow, I see it generation embeddings with the progress bar, goes to 100% and then for some reasons depending on size of the document it waits for up to several minutes before it continues. In the terminal i see this while it waits:

Generating embeddings: 0it [00:00, ?it/s]
Generating embeddings: 0it [00:00, ?it/s]
Generating embeddings: 0it [00:00, ?it/s]
Generating embeddings: 0it [00:00, ?it/s]

I am monitoring both my Postgres server and the PGPT server and there's 0 (zero) CPU/GPU activity on the PGPT server and almost nothing on the postgres, so no write is happening either.. So apparently it is waiting for something depending of the size of the documenet/embeddings unti it suddently starts again

image

Also i noticed there's no check for already ingested files. Could be nice that ingest checks or at least removes (or marks) the ingested files

@paul-asvb
Copy link
Author

@dbzoo im not too familiar with llamaindex and i dont know if its necessary for all the usecases that pgpt tries to cover to have all ingestions in one index. To make my use-case work i see two options:

  1. dont use the seperate index / doc store, that is possible but i need to extract the doc its then from the vector store directly (i only tried with qdrant, worked)
  2. Implement another index-store in pgpt. That is actually the better long term solution. I will try a prototype with a SimpleKVStore for the index first, then maybe a postgres implementation as seen here llama source

@paul-asvb
Copy link
Author

@HenrikPedDK the dataset i have is 45000 markdown files. I can actually ingest around 8000 in 1h but eventually the writing to the index becomes so slow i have to stop it.
My theory is that the reading, parsing, adding and then writing of the index (Generating embeddings: 0it [00:00, ?it/s]) is causing this issue.

@dmtw
Copy link

dmtw commented Mar 14, 2024

@imartinez I think this is still an open issue.

@dbzoo
Copy link
Contributor

dbzoo commented Mar 14, 2024

A minor modification to private_gpt to use multiple indexes would do the trick. llama-index has support for loading multiple indexes In fact loading one loads them all but only returns the 1st. I don't think we need a new KV store to solve this.

@imartinez I have some prototype code "very small" that demonstrates the single index issue.

from llama_index.core import Settings, SimpleDirectoryReader, StorageContext
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Document
from llama_index.core.storage.docstore.postgres_docstore import PostgresDocumentStore
from llama_index.core.storage.index_store.postgres_index_store import PostgresIndexStore
from llama_index.core.indices import VectorStoreIndex, load_index_from_storage

Settings.embed_model = OllamaEmbedding( model_name="nomic-embed-text" )
postgres={
    "host": "192.168.1.106",
    "port": "5432",
    "database": "postgres",
    "user": "postgres",
    "password": "admin",
    "schema_name": "private_gpt"
}
index_store = PostgresIndexStore.from_params(**postgres)
doc_store = PostgresDocumentStore.from_params(**postgres)
# I've left the vector store on the filesystem as its location is not important to the index store issue.
storage_context = StorageContext.from_defaults(
    index_store=index_store,
    docstore=doc_store
)

def test1():
    # Creates a new ROW in the PG data_indexstore as the new VectorStoreIndex is persisted.
    document = Document(text="Mary had a little lamb",metadata={"category":"lamb"})
    index = VectorStoreIndex.from_documents([document], storage_context=storage_context)
    index.storage_context.persist()

    document = Document(text="Its fleece was white as snow",metadata={"category":"snow"})
    index = VectorStoreIndex.from_documents([document], storage_context=storage_context)
    index.storage_context.persist()


def test2():
    # The same VectorStoreIndex is reused for every Document.
    # Persisting the same index uses the same KEY and hence appends to the existing row.
    # In private_gpt this index is loaded if its found to be present - ingest_component.py
    index = VectorStoreIndex( [], storage_context=storage_context )
        
    document = Document(text="Mary had a little lamb",metadata={"category":"lamb"})
    index.insert(document)
    index.storage_context.persist()

    document = Document(text="Its fleece was white as snow",metadata={"category":"snow"})
    index.insert(document)
    index.storage_context.persist()

@paul-asvb
Copy link
Author

@dbzoo would those multiple indices prevent users to query cross-index documents?

@dbzoo
Copy link
Contributor

dbzoo commented Mar 14, 2024

@paul-asvb, I need clarification on whether having multiple indices would prevent users from querying cross-index documents. I'll need to conduct some investigation to confirm. My initial assumption is no, as I expect the system to aggregate and utilize all available indexes. After all, the purpose of the index store is to track all vector indexes that can be utilized

@dbzoo
Copy link
Contributor

dbzoo commented Mar 14, 2024

@HenrikPedDK

So apparently it is waiting for something depending of the size of the document/embeddings until it suddenty starts again

Best guess: The creation of the embeddings and insertion into the vector table is quick (many rows). The parallel ingest threads are blocking each other trying to update the single vector index KV in the index store. This must be an atomic operation, so thread serialization occurs on the update. With the value of the key being so large this creates a noticeable delay. The output of the empty embedding log line feels like the completion of the index update operation of each thread as it unblocks the next in line.

@HenrikPedDK
Copy link

HenrikPedDK commented Mar 14, 2024 via email

@HenrikPedDK
Copy link

Actually i think it's because there's only one pipe for writing and they wait for each other (maybe what you wrote), but since it's a database that shouldn't be an issue. Maybe some change in the ingestion needed?

@paul-asvb
Copy link
Author

@HenrikPedDK how are you creating 2 indexes? 2 pgpt instances?

@dbzoo
Copy link
Contributor

dbzoo commented Mar 17, 2024

@HenrikPedDK @paul-asvb I've added PR #1750 that should help you out with the large index issue by chunking updates. As a bonus the ingestion is now faster too.

I also investigated a multi-index solutions and have something working. Now ingestion time stays linear regardless of how many documents you have consumed. The only downside is that I think query times extend as it having to perform query fusion - something I need to look at. What is the optimum size of each index to avoid query slowness during index concat.
I want to get this PR in place first before merging that work in.

@paul-asvb
Copy link
Author

Ill give it a try now 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants