Calculate `pid_docid_map.values()` only once in `add_to_index` #267

vishalbakshi · 2025-02-11T01:23:23Z

I was building an index of 200k documents (from the UKPLab/DAPR's Genomics dataset) and noted that when using add_to_index (to build the index in batches) it was hanging on a list comprehension in line 162 of colbert.py:

new_documents_with_ids = [
    {"content": doc, "document_id": new_pid_docid_map[pid]} 
    for pid, doc in enumerate(new_documents)
    if new_pid_docid_map[pid] not in self.pid_docid_map.values()
]

Profiling this showed that the entire index building process took 1240 seconds and this list comprehension took 352 seconds.

The issue was that each iteration of the list comprehension was calculating self.pid_docid_map.values(). Pulling this out of the list comprehension to calculate it once (and only referencing it in the list comprehension) brought its execution time from 352 seconds down to <0.6 seconds.

pid_values = set(self.pid_docid_map.values())
new_documents_with_ids = [
    {"content": doc, "document_id": new_pid_docid_map[pid]} 
    for pid, doc in enumerate(new_documents)
    if new_pid_docid_map[pid] not in pid_values
]

This PR implements that code.

This is my first PR in RAGatouille so please let me know if additional information is needed. Thank you for maintaining such an awesome library!

For reference, here's the code I used to profile this issue in a Google Colab Pro notebook (T4 w/High-RAM instance):

!pip install datasets ragatouille
!pip uninstall --y faiss-cpu & pip install faiss-gpu-cu12

from datasets import load_dataset
from ragatouille import RAGPretrainedModel

dataset_name = "Genomics"
passages = load_dataset("UKPLab/dapr", f"{dataset_name}-corpus", split="test")
queries = load_dataset("UKPLab/dapr", f"{dataset_name}-queries", split="test")
qrels_rows = load_dataset("UKPLab/dapr", f"{dataset_name}-qrels", split="test")

RAG = RAGPretrainedModel.from_pretrained("answerdotai/answerai-colbert-small-v1")

batch_size = 100_000
batch = passages[:batch_size]

# Process the first batch separately to create the index
index_path = RAG.index(
    index_name=f"{dataset_name}_index",
    collection=batch["text"],
    document_ids=batch["_id"],
    use_faiss=True
)

# add second batch with `add_to_index`
batch = passages[batch_size : batch_size*2]

%prun -D /content/bs_100000.prof RAG.add_to_index(new_collection=batch["text"], new_document_ids=batch["_id"], use_faiss=True)

Currently `if new_pid_docid_map[pid] not in self.pid_docid_map.values()` calculates `.values()` each iteration of the list comprehension so I'm moving that piece out of the list comprehension and putting it in a `set`. Hoping this speeds up the `add_to_index` operation.

Removing the previous list comprehension that I commented out while testing the updated code.

bclavie · 2025-02-11T01:41:02Z

Very helpful, thnak you! I have no word to answer the shame I feel reading the values() call in a loop 😆

bclavie · 2025-02-11T04:20:56Z

(ignore the CI btw -- it's going away in a hot minute. It's not related to your PR but to getting rid of one of the worst ideas I've ever had, using poetry. We're moving very temporarily to a homebrew pyproject.toml, then uv)

vishalbakshi added 2 commits February 10, 2025 16:14

Remove commented out code

bd86e5e

Removing the previous list comprehension that I commented out while testing the updated code.

bclavie enabled auto-merge (squash) February 11, 2025 01:41

Merge branch 'main' into bugfix/add_to_index

fe3af93

bclavie approved these changes Feb 11, 2025

View reviewed changes

bclavie merged commit 0df18e7 into AnswerDotAI:main Feb 11, 2025

vishalbakshi deleted the bugfix/add_to_index branch February 11, 2025 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate `pid_docid_map.values()` only once in `add_to_index` #267

Calculate `pid_docid_map.values()` only once in `add_to_index` #267

vishalbakshi commented Feb 11, 2025 •

edited

Loading

bclavie commented Feb 11, 2025

bclavie commented Feb 11, 2025 •

edited

Loading

Calculate pid_docid_map.values() only once in add_to_index #267

Calculate pid_docid_map.values() only once in add_to_index #267

Conversation

vishalbakshi commented Feb 11, 2025 • edited Loading

bclavie commented Feb 11, 2025

bclavie commented Feb 11, 2025 • edited Loading

Calculate `pid_docid_map.values()` only once in `add_to_index` #267

Calculate `pid_docid_map.values()` only once in `add_to_index` #267

vishalbakshi commented Feb 11, 2025 •

edited

Loading

bclavie commented Feb 11, 2025 •

edited

Loading