Update Wikipedia index to August 1st, 2024

stanford-oval · Aug 24, 2024 · 4845662 · 4845662
1 parent ee25ff7
commit 4845662
Show file tree

Hide file tree

Showing 13 changed files with 141 additions and 98 deletions.
diff --git a/README.md b/README.md
@@ -28,9 +28,34 @@
 </p>
 
 
-
 https://github.com/user-attachments/assets/982e8733-f7a7-468d-940c-5c96f411f527
 
+# Table of Contents
+- [Introduction](#introduction)
+  - [🚨 Announcements](#-announcements)
+- [Installation](#installation)
+  - [System Requirements](#system-requirements)
+  - [Install Dependencies](#install-dependencies)
+  - [Configure the LLM of Your Choice](#configure-the-llm-of-your-choice)
+  - [Configure Information Retrieval](#configure-information-retrieval)
+    - [Option 1 (Default): Use our free rate-limited Wikipedia search API](#option-1-default-use-our-free-rate-limited-wikipedia-search-api)
+    - [Option 2: Download and host our Wikipedia index](#option-2-download-and-host-our-wikipedia-index)
+    - [Option 3: Build your own index](#option-3-build-your-own-index)
+      - [To build a Wikipedia index](#to-build-a-wikipedia-index)
+      - [To index custom documents](#to-index-custom-documents)
+      - [To upload a Qdrant index to 🤗 Hub:](#to-upload-a-qdrant-index-to--hub)
+  - [Run WikiChat in Terminal](#run-wikichat-in-terminal)
+  - [\[Optional\] Deploy WikiChat for Multi-user Access](#optional-deploy-wikichat-for-multi-user-access)
+    - [Set up Cosmos DB](#set-up-cosmos-db)
+    - [Run Chainlit](#run-chainlit)
+- [The Free Rate-limited Wikipedia Search API](#the-free-rate-limited-wikipedia-search-api)
+- [Wikipedia Preprocessing: Why is it Difficult?](#wikipedia-preprocessing-why-is-it-difficult)
+- [Other Commands](#other-commands)
+  - [Run a Distilled Model for Lower Latency and Cost](#run-a-distilled-model-for-lower-latency-and-cost)
+  - [Simulate Conversations](#simulate-conversations)
+- [License](#license)
+- [Citation](#citation)
+
 
 
 <!-- <hr /> -->
@@ -49,7 +74,7 @@ WikiChat uses Wikipedia and the following 7-stage pipeline to makes sure its res
 Check out our paper for more details:
 Sina J. Semnani, Violet Z. Yao*, Heidi C. Zhang*, and Monica S. Lam. 2023. [WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia](https://arxiv.org/abs/2305.14292). In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics.
 
-## 🚨 **Announcements** 
+## 🚨 Announcements
 - (August 22, 2024) WikiChat 2.0 is now available! Key updates include:
     - **Multilingual Support**: By default, retrieves information from 10 different Wikipedias: 🇺🇸 English, 🇨🇳 Chinese, 🇪🇸 Spanish, 🇵🇹 Portuguese, 🇷🇺 Russian, 🇩🇪 German, 🇮🇷 Farsi, 🇯🇵 Japanese, 🇫🇷 French, and 🇮🇹 Italian.
     - **Improved Information Retrieval**
@@ -126,7 +151,7 @@ Keep this environment activated for all subsequent commands.
 
 Install Docker for your operating system by following the instructions at https://docs.docker.com/engine/install/. WikiChat uses Docker primarily for creating and serving vector databases for retrieval, specifically [🤗 Text Embedding Inference](https://github.com/huggingface/text-embeddings-inference) and [Qdrant](https://github.com/qdrant/qdrant). On recent Ubuntu versions, you can try running `inv install-docker`. For other operating systems, follow the instructions on the docker website.
 
-WikiChat uses `invoke` (https://www.pyinvoke.org/) to add custom commands for various purposes. To see all available commands and their descriptions, run:
+WikiChat uses [`invoke`](https://www.pyinvoke.org/) to add custom commands for various purposes. To see all available commands and their descriptions, run:
 ```
 invoke --list
 ```
@@ -167,16 +192,16 @@ Note that locally hosted models do NOT need an API key, but you need to provide
 
 ## Configure Information Retrieval
 
-### Option 1 (default): Use our free rate-limited Wikipedia search API
+### Option 1 (Default): Use our free rate-limited Wikipedia search API
 By default, WikiChat retrieves information from 10 Wikipedias via the endpoint at https://wikichat.genie.stanford.edu/search/. If you want to just try WikiChat, you do not need to modify anything.
 
 ### Option 2: Download and host our Wikipedia index
-1. Download the [index](stanford-oval/wikipedia_10-languages_bge-m3_qdrant_index) from 🤗 Hub and extract it:
+1. Download the [August 1, 2024 index of 10 Wikipedia languages](https://huggingface.co/datasets/stanford-oval/wikipedia_20240801_10-languages_bge-m3_qdrant_index) from 🤗 Hub and extract it:
 ```bash
 inv download-wikipedia-index --workdir ./workdir
 ```
 
-Note that this index contains ~180M vector embeddings and therefore requires a at least 800 GB of empty disk space. It uses [Qdrant's binary quantization](https://qdrant.tech/articles/binary-quantization/) to reduce RAM requirements to 55 GB without sacrificing accuracy or latency.
+Note that this index contains ~180M vector embeddings and therefore requires at least 500 GB of empty disk space. It uses [Qdrant's binary quantization](https://qdrant.tech/articles/binary-quantization/) to reduce RAM requirements to 55 GB without sacrificing accuracy or latency.
 
 2. Start a FastAPI server similar to option 1 that responds to HTTP POST requests:
 ```bash
@@ -197,22 +222,34 @@ inv index-wikipedia-dump  --embedding-model BAAI/bge-m3 --workdir ./workdir --la
 
 1. Preprocess your data into a [JSON Lines](https://jsonlines.org/) file (with .jsonl or .jsonl.gz file extension) where each line  has the following fields:
 ```json
-{"content_string": "string", "article_title": "string", "full_section_title": "string", "block_type": "string", "language": "string", "last_edit_date": "string (optional)", "num_tokens": "integer (optional)"}
+{"id": "integer", "content_string": "string", "article_title": "string", "full_section_title": "string", "block_type": "string", "language": "string", "last_edit_date": "string (optional)", "num_tokens": "integer (optional)"}
 ```
 `content_string` should be the chunked text of your documents. We recommend chunking to less than 500 tokens of the embedding model's tokenizer. See [this](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/) for an overview on chunking methods.
 `block_type` and `language` are only used to provide filtering on search results. If you do not need them, you can simply set them to `block_type=text` and `language=en`.
 The script will feed `full_section_title` and `content_string` to the embedding model to create embedding vectors.
 
 See `wikipedia_preprocessing/preprocess_html_dump.py` for details on how this is implemented for Wikipedia HTML dumps.
 
-2. Then run the indexing command:
+1. Run the indexing command:
 
 ```bash
 inv index-collection --collection-path <path to preprocessed JSONL> --collection-name <name>
 ```
 
 This command starts docker containers for [🤗 Text Embedding Inference](https://github.com/huggingface/text-embeddings-inference) (one per available GPU). By default, it uses the docker image compatible with NVIDIA GPUs with Ampere 80 architecture, e.g. A100. Support for some other GPUs is also available, but you would need to choose the right docker image from [available docker images](https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images).
 
+3. (Optional) Add a [payload index](https://qdrant.tech/documentation/concepts/payload/#payload-indexing)
+```bash
+python retrieval/add_payload_index.py
+```
+This will enable queries that filter on `language` or `block_type`. Note that for large indices, it might take several minutes for the index to become available again.
+
+4. After indexing, load and use the index as in option 2. For example:
+```bash
+inv start-retriever --embedding-model BAAI/bge-m3 --retriever-port <port number>
+curl -X POST 0.0.0.0:5100/search -H "Content-Type: application/json" -d '{"query": ["What is GPT-4?", "What is LLaMA-3?"], "num_blocks": 3}'
+```
+
 
 #### To upload a Qdrant index to 🤗 Hub:
 1. Split the index into smaller parts:
@@ -256,19 +293,21 @@ Running this will start the backend and front-end servers. You can then access t
 
 
 
-# The free Rate-limited Wikipedia search API
+# The Free Rate-limited Wikipedia Search API
 You can use this API endpoint for prototyping high-quality RAG systems.
 See https://wikichat.genie.stanford.edu/search/redoc for the full specification.
 
 Note that we do not provide any guarantees about this endpoint, and it is not suitable for production.
 
 
-# Wikipedia Preprocessing: Why is it difficult?
-Coming soon.
+# Wikipedia Preprocessing: Why is it Difficult?
+(Coming soon...)
+
+We publicly release [preprocessed Wikipedia in 10 languages](https://huggingface.co/datasets/stanford-oval/wikipedia).
 
 # Other Commands
 
-## Run a distilled model for lower latency and cost
+## Run a Distilled Model for Lower Latency and Cost
 WikiChat 2.0 is not compatible with [fine-tuned LLaMA-2 checkpoints released](https://huggingface.co/collections/stanford-oval/wikichat-v10-66c580bf15e26b87d622498c). Please refer to v1.0 for now.
 
 ## Simulate Conversations
@@ -282,13 +321,9 @@ Depending on the engine you are using, this might take some time. The simulated
 You can also provide any of the pipeline parameters from above.
 You can experiment with different user characteristics by modifying `user_characteristics` in `benchmark/user_simulator.py`.
 
-
-
 # License
 WikiChat code, and models and data are released under Apache-2.0 license.
 
-
-
 # Citation
 
 If you have used code or data from this repository, please cite the following papers:

diff --git a/benchmark/user_simulator.py b/benchmark/user_simulator.py
@@ -170,7 +170,7 @@ async def main(args):
     make_parent_directories(args.output_file)
     with open(args.output_file, "w") as output_file:
         for idx, dlg in enumerate(all_dialogues):
-            if not dlg["dialogue_history"]:
+            if not dlg or not dlg["dialogue_history"]:
                 logger.error('dialog with topic "%s" failed', topics[idx])
                 # skip dialogs that failed
                 continue

diff --git a/docs/search_api.md b/docs/search_api.md
@@ -6,7 +6,7 @@
 ## Description
 This endpoint allows you to search in text, table and infoboxes of 10 Wikipedias (🇺🇸 English, 🇨🇳 Chinese, 🇪🇸 Spanish, 🇵🇹 Portuguese, 🇷🇺 Russian, 🇩🇪 German, 🇮🇷 Farsi, 🇯🇵 Japanese, 🇫🇷 French, 🇮🇹 Italian) with various query parameters.
 
-It is currently retrieving from the Wikipedia dump of Feb 20, 2024.
+It is currently retrieving from the Wikipedia dump of August 1, 2024.
 
 The search endpoint is a hosted version of `retrieval/retriever_server.py`.
 Specifically, it uses the state-of-the-art multilingual vector embedding models for high quality search results.

diff --git a/pipelines/chatbot.py b/pipelines/chatbot.py
@@ -242,16 +242,17 @@ async def process_refine_prompt_output(
                     return refined_agent_utterance.strip(), feedback
 
         logger.error(
-            "Skipping refinement due to malformatted Refined response: %s",
+            "Skipping refinement due to malformed Refined response: %s",
             refine_prompt_output,
         )
         return utterance_to_refine, None
     else:
         # There is no feedback part to the output
-        if refine_prompt_output.startswith("Chatbot:"):
-            refine_prompt_output = refine_prompt_output[
-                len(refine_prompt_output) :
-            ].strip()
+        refine_identifiers = ["Chatbot:", "Chatbot's revised response:"]
+        for identifier in refine_identifiers:
+            if refine_prompt_output.startswith(identifier):
+                refine_prompt_output = refine_prompt_output[len(identifier) :].strip()
+                break
         return refine_prompt_output, None
 
 

diff --git a/pipelines/prompts/query.prompt b/pipelines/prompts/query.prompt
@@ -55,13 +55,14 @@ Yes. You search "who is Murakami the baseball player?". The year of the results
 
 
 # input
-Person: Did you watch the 1998 movie Shakespeare in Love?
-Is it helpful to search Wikipedia? Yes. You search "the 1998 movie 'Shakespeare in Love'". The year of the results is "1998".
-Person: Did you like it?
+Person: آیا فیلم شکسپیر عاشق را دیده ای؟
+Is it helpful to search Wikipedia? Yes. You search "شکسپیر عاشق فیلم سال ۱٩٩٨". The year of the results is "1998".
+You: بله، می دانستی که جایزهٔ اسکار بهترین فیلم را گرفته؟
+Person: بله. آیا فیلم را دوست داشتی؟
 Is it helpful to search Wikipedia?
 
 # output
-Yes. You search "reviews for the 1998 movie 'Shakespeare in Love'". The year of the results is "none".
+Yes. You search "نظرات درباره شکسپیر عاشق فیلم سال ۱٩٩٨". The year of the results is "none".
 
 
 # input

diff --git a/retrieval/add_payload_index.py b/retrieval/add_payload_index.py
@@ -1,8 +1,8 @@
 import argparse
-
+import sys
 from qdrant_client import QdrantClient
 from qdrant_client.models import PayloadSchemaType
-
+sys.path.insert(0, "./")
 from tasks.defaults import DEFAULT_QDRANT_COLLECTION_NAME
 
 if __name__ == "__main__":

diff --git a/retrieval/qdrant_index.py b/retrieval/qdrant_index.py
@@ -193,7 +193,7 @@ async def search(
             collection_name=self.collection_name,
             requests=[
                 SearchRequest(
-                    vector=v,
+                    vector=vector,
                     with_vector=False,
                     with_payload=True,
                     limit=k,
@@ -202,17 +202,17 @@ async def search(
                         Filter(
                             must=[  # 'must' acts like AND, 'should' acts like OR
                                 FieldCondition(
-                                    key=k,
-                                    match=MatchAny(any=list(v)),
+                                    key=key,
+                                    match=MatchAny(any=list(value)),
                                 )
-                                for k, v in filters.items()
+                                for key, value in filters.items()
                             ]
                         )
                         if filters
                         else None
                     ),
                 )
-                for v in query_embeddings
+                for vector in query_embeddings
             ],
         )
         logger.info("Nearest neighbor search took %.2f seconds", (time() - start_time))
@@ -296,7 +296,7 @@ def embed_queries(self, queries: list[str]):
                 )
 
         logger.info(
-            "Embedding the query vector took %.2f seconds", (time() - start_time)
+            "Embedding the query into a vector took %.2f seconds", (time() - start_time)
         )
 
         return normalized_embeddings.tolist()

diff --git a/retrieval/retriever_server.py b/retrieval/retriever_server.py
@@ -21,7 +21,7 @@
 
 app = FastAPI(
     title="Wikipedia Search API",
-    description="An API for retrieving information from 10 Wikipedia languages from the Wikipedia dump of Feb 20, 2024.",
+    description="An API for retrieving information from 10 Wikipedia languages from the Wikipedia dump of August 1, 2024.",
     version="1.0.0",
     docs_url="/search/docs",
     redoc_url="/search/redoc",

diff --git a/tasks/benchmark.py b/tasks/benchmark.py
@@ -9,14 +9,14 @@
 from tasks.retrieval import get_wikipedia_collection_path
 
 
-@task(pre=[load_api_keys])
+@task(pre=[load_api_keys], iterable=["subset", "language"])
 def simulate_users(
     c,
     num_dialogues,  # -1 to simulate all available topics
     num_turns: int,
     simulation_mode: str,  # passage
-    subset: str,  # head, recent, tail
-    language: str,  # for the topics
+    subset: list[str],  # head, recent, tail
+    language: list[str],
     input_file=None,
     user_simulator_engine="gpt-4o",
     user_temperature=1.0,
@@ -49,6 +49,9 @@ def simulate_users(
     Accepts all parameters that `inv demo` accepts, plus a few additional parameters for the user simulator.
     """
 
+    if not language or not subset:
+        raise ValueError("Specify at least one --language and one --subset")
+
     pipeline_flags = (
         f"--pipeline {pipeline} "
         f"--engine {engine} "
@@ -84,21 +87,23 @@ def simulate_users(
         if enabled:
             pipeline_flags += f"--{arg} "
 
-    if not input_file:
-        input_file = f"{subset}_articles_{language}.json"
+    for l in language:
+        for s in subset:
+            if not input_file:
+                input_file = f"{s}_articles_{l}.json"
 
-    c.run(
-        f"python benchmark/user_simulator.py {pipeline_flags} "
-        f"--num_dialogues {num_dialogues} "
-        f"--user_engine {user_simulator_engine} "
-        f"--user_temperature {user_temperature} "
-        f"--mode {simulation_mode} "
-        f"--input_file benchmark/topics/{input_file} "
-        f"--num_turns {num_turns} "
-        f"--output_file benchmark/simulated_dialogues/{pipeline}_{subset}_{language}_{engine}.txt "
-        f"--language {language} "
-        f"--no_logging"
-    )
+            c.run(
+                f"python benchmark/user_simulator.py {pipeline_flags} "
+                f"--num_dialogues {num_dialogues} "
+                f"--user_engine {user_simulator_engine} "
+                f"--user_temperature {user_temperature} "
+                f"--mode {simulation_mode} "
+                f"--input_file benchmark/topics/{input_file} "
+                f"--num_turns {num_turns} "
+                f"--output_file benchmark/simulated_dialogues/{pipeline}_{s}_{l}_{engine}.txt "
+                f"--language {l} "
+                f"--no_logging"
+            )
 
 
 @task(iterable=["language"])

diff --git a/tasks/docker_utils.py b/tasks/docker_utils.py
@@ -92,7 +92,6 @@ def wait_for_docker_container_to_be_ready(
     Raises:
         RuntimeError: If the container is not ready within the timeout period.
     """
-    timeout = 60
     step_time = timeout // 10
     elapsed_time = 0
     logger.info("Waiting for container '%s' to be ready...", container.name)

diff --git a/tasks/retrieval.py b/tasks/retrieval.py
@@ -186,7 +186,7 @@ def multithreaded_download(url: str, output_path: str, num_parts: int = 3) -> No
 @task
 def download_wikipedia_index(
     c,
-    repo_id: str = "stanford-oval/wikipedia_10-languages_bge-m3_qdrant_index",
+    repo_id: str = "stanford-oval/wikipedia_20240401_10-languages_bge-m3_qdrant_index",
     workdir: str = DEFAULT_WORKDIR,
     num_threads: int = 8,
 ):
@@ -195,7 +195,7 @@ def download_wikipedia_index(
 
     Args:
     - c: Context, automatically passed by invoke.
-    - repo_id (str): The 🤗 hub repository ID from which to download the index files. Defaults to "stanford-oval/wikipedia_10-languages_bge-m3_qdrant_index".
+    - repo_id (str): The 🤗 Hub repository ID from which to download the index files.
     - workdir (str): The working directory where the files will be downloaded and extracted. Defaults to DEFAULT_WORKDIR.
     - num_threads (int): The number of threads to use for downloading and decompressing the files. Defaults to 8.
 
@@ -220,7 +220,7 @@ def download_wikipedia_index(
 
     # Decompress and extract the files
     c.run(
-        f"cat {part_files} | pigz -d -p {num_threads} | tar --strip-components=2 -xv -C {workdir}"
+        f"cat {part_files} | pigz -d -p {num_threads} | tar --strip-components=2 -xv -C {os.path.join(workdir, 'qdrant_index')}"
     )  # strip-components gets rid of the extra workdir/
 
 
@@ -410,14 +410,16 @@ def preprocess_wikipedia_dump(
 
     index_dir = get_wikipedia_collection_dir(workdir, language, wikipedia_date)
     input_path = os.path.join(index_dir, "articles-html.json.tar.gz")
-    translation_cache = os.path.join(workdir, "translation_cache.jsonl.gz")
+    wikidata_translation_map = os.path.join(
+        workdir, "wikidata_translation_map.jsonl.gz"
+    )
 
     # Constructing the command with parameters
     command = (
         f"python wikipedia_preprocessing/preprocess_html_dump.py "
         f"--input_path {input_path} "
         f"--output_path {output_path} "
-        f"--translation_cache {translation_cache} "
+        f"--wikidata_translation_map {wikidata_translation_map} "
         f"--language {language} "
         f"--should_translate "
         f"--pack_to_tokens {pack_to_tokens} "