Skip to content

Commit

Permalink
Update Wikipedia index to August 1st, 2024
Browse files Browse the repository at this point in the history
  • Loading branch information
s-jse authored Aug 24, 2024
1 parent ee25ff7 commit 4845662
Show file tree
Hide file tree
Showing 13 changed files with 141 additions and 98 deletions.
67 changes: 51 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,34 @@
</p>



https://github.com/user-attachments/assets/982e8733-f7a7-468d-940c-5c96f411f527

# Table of Contents
- [Introduction](#introduction)
- [🚨 Announcements](#-announcements)
- [Installation](#installation)
- [System Requirements](#system-requirements)
- [Install Dependencies](#install-dependencies)
- [Configure the LLM of Your Choice](#configure-the-llm-of-your-choice)
- [Configure Information Retrieval](#configure-information-retrieval)
- [Option 1 (Default): Use our free rate-limited Wikipedia search API](#option-1-default-use-our-free-rate-limited-wikipedia-search-api)
- [Option 2: Download and host our Wikipedia index](#option-2-download-and-host-our-wikipedia-index)
- [Option 3: Build your own index](#option-3-build-your-own-index)
- [To build a Wikipedia index](#to-build-a-wikipedia-index)
- [To index custom documents](#to-index-custom-documents)
- [To upload a Qdrant index to 🤗 Hub:](#to-upload-a-qdrant-index-to--hub)
- [Run WikiChat in Terminal](#run-wikichat-in-terminal)
- [\[Optional\] Deploy WikiChat for Multi-user Access](#optional-deploy-wikichat-for-multi-user-access)
- [Set up Cosmos DB](#set-up-cosmos-db)
- [Run Chainlit](#run-chainlit)
- [The Free Rate-limited Wikipedia Search API](#the-free-rate-limited-wikipedia-search-api)
- [Wikipedia Preprocessing: Why is it Difficult?](#wikipedia-preprocessing-why-is-it-difficult)
- [Other Commands](#other-commands)
- [Run a Distilled Model for Lower Latency and Cost](#run-a-distilled-model-for-lower-latency-and-cost)
- [Simulate Conversations](#simulate-conversations)
- [License](#license)
- [Citation](#citation)



<!-- <hr /> -->
Expand All @@ -49,7 +74,7 @@ WikiChat uses Wikipedia and the following 7-stage pipeline to makes sure its res
Check out our paper for more details:
Sina J. Semnani, Violet Z. Yao*, Heidi C. Zhang*, and Monica S. Lam. 2023. [WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia](https://arxiv.org/abs/2305.14292). In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics.

## 🚨 **Announcements**
## 🚨 Announcements
- (August 22, 2024) WikiChat 2.0 is now available! Key updates include:
- **Multilingual Support**: By default, retrieves information from 10 different Wikipedias: 🇺🇸 English, 🇨🇳 Chinese, 🇪🇸 Spanish, 🇵🇹 Portuguese, 🇷🇺 Russian, 🇩🇪 German, 🇮🇷 Farsi, 🇯🇵 Japanese, 🇫🇷 French, and 🇮🇹 Italian.
- **Improved Information Retrieval**
Expand Down Expand Up @@ -126,7 +151,7 @@ Keep this environment activated for all subsequent commands.

Install Docker for your operating system by following the instructions at https://docs.docker.com/engine/install/. WikiChat uses Docker primarily for creating and serving vector databases for retrieval, specifically [🤗 Text Embedding Inference](https://github.com/huggingface/text-embeddings-inference) and [Qdrant](https://github.com/qdrant/qdrant). On recent Ubuntu versions, you can try running `inv install-docker`. For other operating systems, follow the instructions on the docker website.

WikiChat uses `invoke` (https://www.pyinvoke.org/) to add custom commands for various purposes. To see all available commands and their descriptions, run:
WikiChat uses [`invoke`](https://www.pyinvoke.org/) to add custom commands for various purposes. To see all available commands and their descriptions, run:
```
invoke --list
```
Expand Down Expand Up @@ -167,16 +192,16 @@ Note that locally hosted models do NOT need an API key, but you need to provide

## Configure Information Retrieval

### Option 1 (default): Use our free rate-limited Wikipedia search API
### Option 1 (Default): Use our free rate-limited Wikipedia search API
By default, WikiChat retrieves information from 10 Wikipedias via the endpoint at https://wikichat.genie.stanford.edu/search/. If you want to just try WikiChat, you do not need to modify anything.

### Option 2: Download and host our Wikipedia index
1. Download the [index](stanford-oval/wikipedia_10-languages_bge-m3_qdrant_index) from 🤗 Hub and extract it:
1. Download the [August 1, 2024 index of 10 Wikipedia languages](https://huggingface.co/datasets/stanford-oval/wikipedia_20240801_10-languages_bge-m3_qdrant_index) from 🤗 Hub and extract it:
```bash
inv download-wikipedia-index --workdir ./workdir
```

Note that this index contains ~180M vector embeddings and therefore requires a at least 800 GB of empty disk space. It uses [Qdrant's binary quantization](https://qdrant.tech/articles/binary-quantization/) to reduce RAM requirements to 55 GB without sacrificing accuracy or latency.
Note that this index contains ~180M vector embeddings and therefore requires at least 500 GB of empty disk space. It uses [Qdrant's binary quantization](https://qdrant.tech/articles/binary-quantization/) to reduce RAM requirements to 55 GB without sacrificing accuracy or latency.

2. Start a FastAPI server similar to option 1 that responds to HTTP POST requests:
```bash
Expand All @@ -197,22 +222,34 @@ inv index-wikipedia-dump --embedding-model BAAI/bge-m3 --workdir ./workdir --la

1. Preprocess your data into a [JSON Lines](https://jsonlines.org/) file (with .jsonl or .jsonl.gz file extension) where each line has the following fields:
```json
{"content_string": "string", "article_title": "string", "full_section_title": "string", "block_type": "string", "language": "string", "last_edit_date": "string (optional)", "num_tokens": "integer (optional)"}
{"id": "integer", "content_string": "string", "article_title": "string", "full_section_title": "string", "block_type": "string", "language": "string", "last_edit_date": "string (optional)", "num_tokens": "integer (optional)"}
```
`content_string` should be the chunked text of your documents. We recommend chunking to less than 500 tokens of the embedding model's tokenizer. See [this](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/) for an overview on chunking methods.
`block_type` and `language` are only used to provide filtering on search results. If you do not need them, you can simply set them to `block_type=text` and `language=en`.
The script will feed `full_section_title` and `content_string` to the embedding model to create embedding vectors.

See `wikipedia_preprocessing/preprocess_html_dump.py` for details on how this is implemented for Wikipedia HTML dumps.

2. Then run the indexing command:
1. Run the indexing command:

```bash
inv index-collection --collection-path <path to preprocessed JSONL> --collection-name <name>
```

This command starts docker containers for [🤗 Text Embedding Inference](https://github.com/huggingface/text-embeddings-inference) (one per available GPU). By default, it uses the docker image compatible with NVIDIA GPUs with Ampere 80 architecture, e.g. A100. Support for some other GPUs is also available, but you would need to choose the right docker image from [available docker images](https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images).

3. (Optional) Add a [payload index](https://qdrant.tech/documentation/concepts/payload/#payload-indexing)
```bash
python retrieval/add_payload_index.py
```
This will enable queries that filter on `language` or `block_type`. Note that for large indices, it might take several minutes for the index to become available again.

4. After indexing, load and use the index as in option 2. For example:
```bash
inv start-retriever --embedding-model BAAI/bge-m3 --retriever-port <port number>
curl -X POST 0.0.0.0:5100/search -H "Content-Type: application/json" -d '{"query": ["What is GPT-4?", "What is LLaMA-3?"], "num_blocks": 3}'
```


#### To upload a Qdrant index to 🤗 Hub:
1. Split the index into smaller parts:
Expand Down Expand Up @@ -256,19 +293,21 @@ Running this will start the backend and front-end servers. You can then access t



# The free Rate-limited Wikipedia search API
# The Free Rate-limited Wikipedia Search API
You can use this API endpoint for prototyping high-quality RAG systems.
See https://wikichat.genie.stanford.edu/search/redoc for the full specification.

Note that we do not provide any guarantees about this endpoint, and it is not suitable for production.


# Wikipedia Preprocessing: Why is it difficult?
Coming soon.
# Wikipedia Preprocessing: Why is it Difficult?
(Coming soon...)

We publicly release [preprocessed Wikipedia in 10 languages](https://huggingface.co/datasets/stanford-oval/wikipedia).

# Other Commands

## Run a distilled model for lower latency and cost
## Run a Distilled Model for Lower Latency and Cost
WikiChat 2.0 is not compatible with [fine-tuned LLaMA-2 checkpoints released](https://huggingface.co/collections/stanford-oval/wikichat-v10-66c580bf15e26b87d622498c). Please refer to v1.0 for now.

## Simulate Conversations
Expand All @@ -282,13 +321,9 @@ Depending on the engine you are using, this might take some time. The simulated
You can also provide any of the pipeline parameters from above.
You can experiment with different user characteristics by modifying `user_characteristics` in `benchmark/user_simulator.py`.



# License
WikiChat code, and models and data are released under Apache-2.0 license.



# Citation

If you have used code or data from this repository, please cite the following papers:
Expand Down
2 changes: 1 addition & 1 deletion benchmark/user_simulator.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ async def main(args):
make_parent_directories(args.output_file)
with open(args.output_file, "w") as output_file:
for idx, dlg in enumerate(all_dialogues):
if not dlg["dialogue_history"]:
if not dlg or not dlg["dialogue_history"]:
logger.error('dialog with topic "%s" failed', topics[idx])
# skip dialogs that failed
continue
Expand Down
2 changes: 1 addition & 1 deletion docs/search_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
## Description
This endpoint allows you to search in text, table and infoboxes of 10 Wikipedias (🇺🇸 English, 🇨🇳 Chinese, 🇪🇸 Spanish, 🇵🇹 Portuguese, 🇷🇺 Russian, 🇩🇪 German, 🇮🇷 Farsi, 🇯🇵 Japanese, 🇫🇷 French, 🇮🇹 Italian) with various query parameters.

It is currently retrieving from the Wikipedia dump of Feb 20, 2024.
It is currently retrieving from the Wikipedia dump of August 1, 2024.

The search endpoint is a hosted version of `retrieval/retriever_server.py`.
Specifically, it uses the state-of-the-art multilingual vector embedding models for high quality search results.
Expand Down
11 changes: 6 additions & 5 deletions pipelines/chatbot.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,16 +242,17 @@ async def process_refine_prompt_output(
return refined_agent_utterance.strip(), feedback

logger.error(
"Skipping refinement due to malformatted Refined response: %s",
"Skipping refinement due to malformed Refined response: %s",
refine_prompt_output,
)
return utterance_to_refine, None
else:
# There is no feedback part to the output
if refine_prompt_output.startswith("Chatbot:"):
refine_prompt_output = refine_prompt_output[
len(refine_prompt_output) :
].strip()
refine_identifiers = ["Chatbot:", "Chatbot's revised response:"]
for identifier in refine_identifiers:
if refine_prompt_output.startswith(identifier):
refine_prompt_output = refine_prompt_output[len(identifier) :].strip()
break
return refine_prompt_output, None


Expand Down
9 changes: 5 additions & 4 deletions pipelines/prompts/query.prompt
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,14 @@ Yes. You search "who is Murakami the baseball player?". The year of the results


# input
Person: Did you watch the 1998 movie Shakespeare in Love?
Is it helpful to search Wikipedia? Yes. You search "the 1998 movie 'Shakespeare in Love'". The year of the results is "1998".
Person: Did you like it?
Person: آیا فیلم شکسپیر عاشق را دیده ای؟
Is it helpful to search Wikipedia? Yes. You search "شکسپیر عاشق فیلم سال ۱٩٩٨". The year of the results is "1998".
You: بله، می دانستی که جایزهٔ اسکار بهترین فیلم را گرفته؟
Person: بله. آیا فیلم را دوست داشتی؟
Is it helpful to search Wikipedia?

# output
Yes. You search "reviews for the 1998 movie 'Shakespeare in Love'". The year of the results is "none".
Yes. You search "نظرات درباره شکسپیر عاشق فیلم سال ۱٩٩٨". The year of the results is "none".


# input
Expand Down
4 changes: 2 additions & 2 deletions retrieval/add_payload_index.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import argparse

import sys
from qdrant_client import QdrantClient
from qdrant_client.models import PayloadSchemaType

sys.path.insert(0, "./")
from tasks.defaults import DEFAULT_QDRANT_COLLECTION_NAME

if __name__ == "__main__":
Expand Down
12 changes: 6 additions & 6 deletions retrieval/qdrant_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ async def search(
collection_name=self.collection_name,
requests=[
SearchRequest(
vector=v,
vector=vector,
with_vector=False,
with_payload=True,
limit=k,
Expand All @@ -202,17 +202,17 @@ async def search(
Filter(
must=[ # 'must' acts like AND, 'should' acts like OR
FieldCondition(
key=k,
match=MatchAny(any=list(v)),
key=key,
match=MatchAny(any=list(value)),
)
for k, v in filters.items()
for key, value in filters.items()
]
)
if filters
else None
),
)
for v in query_embeddings
for vector in query_embeddings
],
)
logger.info("Nearest neighbor search took %.2f seconds", (time() - start_time))
Expand Down Expand Up @@ -296,7 +296,7 @@ def embed_queries(self, queries: list[str]):
)

logger.info(
"Embedding the query vector took %.2f seconds", (time() - start_time)
"Embedding the query into a vector took %.2f seconds", (time() - start_time)
)

return normalized_embeddings.tolist()
Expand Down
2 changes: 1 addition & 1 deletion retrieval/retriever_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

app = FastAPI(
title="Wikipedia Search API",
description="An API for retrieving information from 10 Wikipedia languages from the Wikipedia dump of Feb 20, 2024.",
description="An API for retrieving information from 10 Wikipedia languages from the Wikipedia dump of August 1, 2024.",
version="1.0.0",
docs_url="/search/docs",
redoc_url="/search/redoc",
Expand Down
39 changes: 22 additions & 17 deletions tasks/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@
from tasks.retrieval import get_wikipedia_collection_path


@task(pre=[load_api_keys])
@task(pre=[load_api_keys], iterable=["subset", "language"])
def simulate_users(
c,
num_dialogues, # -1 to simulate all available topics
num_turns: int,
simulation_mode: str, # passage
subset: str, # head, recent, tail
language: str, # for the topics
subset: list[str], # head, recent, tail
language: list[str],
input_file=None,
user_simulator_engine="gpt-4o",
user_temperature=1.0,
Expand Down Expand Up @@ -49,6 +49,9 @@ def simulate_users(
Accepts all parameters that `inv demo` accepts, plus a few additional parameters for the user simulator.
"""

if not language or not subset:
raise ValueError("Specify at least one --language and one --subset")

pipeline_flags = (
f"--pipeline {pipeline} "
f"--engine {engine} "
Expand Down Expand Up @@ -84,21 +87,23 @@ def simulate_users(
if enabled:
pipeline_flags += f"--{arg} "

if not input_file:
input_file = f"{subset}_articles_{language}.json"
for l in language:
for s in subset:
if not input_file:
input_file = f"{s}_articles_{l}.json"

c.run(
f"python benchmark/user_simulator.py {pipeline_flags} "
f"--num_dialogues {num_dialogues} "
f"--user_engine {user_simulator_engine} "
f"--user_temperature {user_temperature} "
f"--mode {simulation_mode} "
f"--input_file benchmark/topics/{input_file} "
f"--num_turns {num_turns} "
f"--output_file benchmark/simulated_dialogues/{pipeline}_{subset}_{language}_{engine}.txt "
f"--language {language} "
f"--no_logging"
)
c.run(
f"python benchmark/user_simulator.py {pipeline_flags} "
f"--num_dialogues {num_dialogues} "
f"--user_engine {user_simulator_engine} "
f"--user_temperature {user_temperature} "
f"--mode {simulation_mode} "
f"--input_file benchmark/topics/{input_file} "
f"--num_turns {num_turns} "
f"--output_file benchmark/simulated_dialogues/{pipeline}_{s}_{l}_{engine}.txt "
f"--language {l} "
f"--no_logging"
)


@task(iterable=["language"])
Expand Down
1 change: 0 additions & 1 deletion tasks/docker_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,6 @@ def wait_for_docker_container_to_be_ready(
Raises:
RuntimeError: If the container is not ready within the timeout period.
"""
timeout = 60
step_time = timeout // 10
elapsed_time = 0
logger.info("Waiting for container '%s' to be ready...", container.name)
Expand Down
12 changes: 7 additions & 5 deletions tasks/retrieval.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ def multithreaded_download(url: str, output_path: str, num_parts: int = 3) -> No
@task
def download_wikipedia_index(
c,
repo_id: str = "stanford-oval/wikipedia_10-languages_bge-m3_qdrant_index",
repo_id: str = "stanford-oval/wikipedia_20240401_10-languages_bge-m3_qdrant_index",
workdir: str = DEFAULT_WORKDIR,
num_threads: int = 8,
):
Expand All @@ -195,7 +195,7 @@ def download_wikipedia_index(
Args:
- c: Context, automatically passed by invoke.
- repo_id (str): The 🤗 hub repository ID from which to download the index files. Defaults to "stanford-oval/wikipedia_10-languages_bge-m3_qdrant_index".
- repo_id (str): The 🤗 Hub repository ID from which to download the index files.
- workdir (str): The working directory where the files will be downloaded and extracted. Defaults to DEFAULT_WORKDIR.
- num_threads (int): The number of threads to use for downloading and decompressing the files. Defaults to 8.
Expand All @@ -220,7 +220,7 @@ def download_wikipedia_index(

# Decompress and extract the files
c.run(
f"cat {part_files} | pigz -d -p {num_threads} | tar --strip-components=2 -xv -C {workdir}"
f"cat {part_files} | pigz -d -p {num_threads} | tar --strip-components=2 -xv -C {os.path.join(workdir, 'qdrant_index')}"
) # strip-components gets rid of the extra workdir/


Expand Down Expand Up @@ -410,14 +410,16 @@ def preprocess_wikipedia_dump(

index_dir = get_wikipedia_collection_dir(workdir, language, wikipedia_date)
input_path = os.path.join(index_dir, "articles-html.json.tar.gz")
translation_cache = os.path.join(workdir, "translation_cache.jsonl.gz")
wikidata_translation_map = os.path.join(
workdir, "wikidata_translation_map.jsonl.gz"
)

# Constructing the command with parameters
command = (
f"python wikipedia_preprocessing/preprocess_html_dump.py "
f"--input_path {input_path} "
f"--output_path {output_path} "
f"--translation_cache {translation_cache} "
f"--wikidata_translation_map {wikidata_translation_map} "
f"--language {language} "
f"--should_translate "
f"--pack_to_tokens {pack_to_tokens} "
Expand Down
Loading

0 comments on commit 4845662

Please sign in to comment.