Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate pid_docid_map.values() only once in add_to_index #267

Merged
merged 3 commits into from
Feb 11, 2025

Conversation

vishalbakshi
Copy link
Contributor

@vishalbakshi vishalbakshi commented Feb 11, 2025

I was building an index of 200k documents (from the UKPLab/DAPR's Genomics dataset) and noted that when using add_to_index (to build the index in batches) it was hanging on a list comprehension in line 162 of colbert.py:

new_documents_with_ids = [
    {"content": doc, "document_id": new_pid_docid_map[pid]} 
    for pid, doc in enumerate(new_documents)
    if new_pid_docid_map[pid] not in self.pid_docid_map.values()
]

Profiling this showed that the entire index building process took 1240 seconds and this list comprehension took 352 seconds.

image

The issue was that each iteration of the list comprehension was calculating self.pid_docid_map.values(). Pulling this out of the list comprehension to calculate it once (and only referencing it in the list comprehension) brought its execution time from 352 seconds down to <0.6 seconds.

pid_values = set(self.pid_docid_map.values())
new_documents_with_ids = [
    {"content": doc, "document_id": new_pid_docid_map[pid]} 
    for pid, doc in enumerate(new_documents)
    if new_pid_docid_map[pid] not in pid_values
]

image

This PR implements that code.

This is my first PR in RAGatouille so please let me know if additional information is needed. Thank you for maintaining such an awesome library!

For reference, here's the code I used to profile this issue in a Google Colab Pro notebook (T4 w/High-RAM instance):

!pip install datasets ragatouille
!pip uninstall --y faiss-cpu & pip install faiss-gpu-cu12

from datasets import load_dataset
from ragatouille import RAGPretrainedModel

dataset_name = "Genomics"
passages = load_dataset("UKPLab/dapr", f"{dataset_name}-corpus", split="test")
queries = load_dataset("UKPLab/dapr", f"{dataset_name}-queries", split="test")
qrels_rows = load_dataset("UKPLab/dapr", f"{dataset_name}-qrels", split="test")

RAG = RAGPretrainedModel.from_pretrained("answerdotai/answerai-colbert-small-v1")

batch_size = 100_000
batch = passages[:batch_size]

# Process the first batch separately to create the index
index_path = RAG.index(
    index_name=f"{dataset_name}_index",
    collection=batch["text"],
    document_ids=batch["_id"],
    use_faiss=True
)

# add second batch with `add_to_index`
batch = passages[batch_size : batch_size*2]

%prun -D /content/bs_100000.prof RAG.add_to_index(new_collection=batch["text"], new_document_ids=batch["_id"], use_faiss=True)

Currently `if new_pid_docid_map[pid] not in self.pid_docid_map.values()` calculates `.values()` each iteration of the list comprehension so I'm moving that piece out of the list comprehension and putting it in a `set`. Hoping this speeds up the `add_to_index` operation.
Removing the previous list comprehension that I commented out while testing the updated code.
@bclavie
Copy link
Collaborator

bclavie commented Feb 11, 2025

Very helpful, thnak you! I have no word to answer the shame I feel reading the values() call in a loop 😆

@bclavie bclavie enabled auto-merge (squash) February 11, 2025 01:41
@bclavie
Copy link
Collaborator

bclavie commented Feb 11, 2025

(ignore the CI btw -- it's going away in a hot minute. It's not related to your PR but to getting rid of one of the worst ideas I've ever had, using poetry. We're moving very temporarily to a homebrew pyproject.toml, then uv)

@bclavie bclavie merged commit 0df18e7 into AnswerDotAI:main Feb 11, 2025
@vishalbakshi vishalbakshi deleted the bugfix/add_to_index branch February 11, 2025 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants