diff --git a/README.md b/README.md
index 1436c02..39ce8c9 100644
--- a/README.md
+++ b/README.md
@@ -28,9 +28,34 @@
-
https://github.com/user-attachments/assets/982e8733-f7a7-468d-940c-5c96f411f527
+# Table of Contents
+- [Introduction](#introduction)
+ - [🚨 Announcements](#-announcements)
+- [Installation](#installation)
+ - [System Requirements](#system-requirements)
+ - [Install Dependencies](#install-dependencies)
+ - [Configure the LLM of Your Choice](#configure-the-llm-of-your-choice)
+ - [Configure Information Retrieval](#configure-information-retrieval)
+ - [Option 1 (Default): Use our free rate-limited Wikipedia search API](#option-1-default-use-our-free-rate-limited-wikipedia-search-api)
+ - [Option 2: Download and host our Wikipedia index](#option-2-download-and-host-our-wikipedia-index)
+ - [Option 3: Build your own index](#option-3-build-your-own-index)
+ - [To build a Wikipedia index](#to-build-a-wikipedia-index)
+ - [To index custom documents](#to-index-custom-documents)
+ - [To upload a Qdrant index to 🤗 Hub:](#to-upload-a-qdrant-index-to--hub)
+ - [Run WikiChat in Terminal](#run-wikichat-in-terminal)
+ - [\[Optional\] Deploy WikiChat for Multi-user Access](#optional-deploy-wikichat-for-multi-user-access)
+ - [Set up Cosmos DB](#set-up-cosmos-db)
+ - [Run Chainlit](#run-chainlit)
+- [The Free Rate-limited Wikipedia Search API](#the-free-rate-limited-wikipedia-search-api)
+- [Wikipedia Preprocessing: Why is it Difficult?](#wikipedia-preprocessing-why-is-it-difficult)
+- [Other Commands](#other-commands)
+ - [Run a Distilled Model for Lower Latency and Cost](#run-a-distilled-model-for-lower-latency-and-cost)
+ - [Simulate Conversations](#simulate-conversations)
+- [License](#license)
+- [Citation](#citation)
+
@@ -49,7 +74,7 @@ WikiChat uses Wikipedia and the following 7-stage pipeline to makes sure its res
Check out our paper for more details:
Sina J. Semnani, Violet Z. Yao*, Heidi C. Zhang*, and Monica S. Lam. 2023. [WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia](https://arxiv.org/abs/2305.14292). In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics.
-## 🚨 **Announcements**
+## 🚨 Announcements
- (August 22, 2024) WikiChat 2.0 is now available! Key updates include:
- **Multilingual Support**: By default, retrieves information from 10 different Wikipedias: 🇺🇸 English, 🇨🇳 Chinese, 🇪🇸 Spanish, 🇵🇹 Portuguese, 🇷🇺 Russian, 🇩🇪 German, 🇮🇷 Farsi, 🇯🇵 Japanese, 🇫🇷 French, and 🇮🇹 Italian.
- **Improved Information Retrieval**
@@ -126,7 +151,7 @@ Keep this environment activated for all subsequent commands.
Install Docker for your operating system by following the instructions at https://docs.docker.com/engine/install/. WikiChat uses Docker primarily for creating and serving vector databases for retrieval, specifically [🤗 Text Embedding Inference](https://github.com/huggingface/text-embeddings-inference) and [Qdrant](https://github.com/qdrant/qdrant). On recent Ubuntu versions, you can try running `inv install-docker`. For other operating systems, follow the instructions on the docker website.
-WikiChat uses `invoke` (https://www.pyinvoke.org/) to add custom commands for various purposes. To see all available commands and their descriptions, run:
+WikiChat uses [`invoke`](https://www.pyinvoke.org/) to add custom commands for various purposes. To see all available commands and their descriptions, run:
```
invoke --list
```
@@ -167,16 +192,16 @@ Note that locally hosted models do NOT need an API key, but you need to provide
## Configure Information Retrieval
-### Option 1 (default): Use our free rate-limited Wikipedia search API
+### Option 1 (Default): Use our free rate-limited Wikipedia search API
By default, WikiChat retrieves information from 10 Wikipedias via the endpoint at https://wikichat.genie.stanford.edu/search/. If you want to just try WikiChat, you do not need to modify anything.
### Option 2: Download and host our Wikipedia index
-1. Download the [index](stanford-oval/wikipedia_10-languages_bge-m3_qdrant_index) from 🤗 Hub and extract it:
+1. Download the [August 1, 2024 index of 10 Wikipedia languages](https://huggingface.co/datasets/stanford-oval/wikipedia_20240801_10-languages_bge-m3_qdrant_index) from 🤗 Hub and extract it:
```bash
inv download-wikipedia-index --workdir ./workdir
```
-Note that this index contains ~180M vector embeddings and therefore requires a at least 800 GB of empty disk space. It uses [Qdrant's binary quantization](https://qdrant.tech/articles/binary-quantization/) to reduce RAM requirements to 55 GB without sacrificing accuracy or latency.
+Note that this index contains ~180M vector embeddings and therefore requires at least 500 GB of empty disk space. It uses [Qdrant's binary quantization](https://qdrant.tech/articles/binary-quantization/) to reduce RAM requirements to 55 GB without sacrificing accuracy or latency.
2. Start a FastAPI server similar to option 1 that responds to HTTP POST requests:
```bash
@@ -197,7 +222,7 @@ inv index-wikipedia-dump --embedding-model BAAI/bge-m3 --workdir ./workdir --la
1. Preprocess your data into a [JSON Lines](https://jsonlines.org/) file (with .jsonl or .jsonl.gz file extension) where each line has the following fields:
```json
-{"content_string": "string", "article_title": "string", "full_section_title": "string", "block_type": "string", "language": "string", "last_edit_date": "string (optional)", "num_tokens": "integer (optional)"}
+{"id": "integer", "content_string": "string", "article_title": "string", "full_section_title": "string", "block_type": "string", "language": "string", "last_edit_date": "string (optional)", "num_tokens": "integer (optional)"}
```
`content_string` should be the chunked text of your documents. We recommend chunking to less than 500 tokens of the embedding model's tokenizer. See [this](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/) for an overview on chunking methods.
`block_type` and `language` are only used to provide filtering on search results. If you do not need them, you can simply set them to `block_type=text` and `language=en`.
@@ -205,7 +230,7 @@ The script will feed `full_section_title` and `content_string` to the embedding
See `wikipedia_preprocessing/preprocess_html_dump.py` for details on how this is implemented for Wikipedia HTML dumps.
-2. Then run the indexing command:
+1. Run the indexing command:
```bash
inv index-collection --collection-path --collection-name
@@ -213,6 +238,18 @@ inv index-collection --collection-path --collection
This command starts docker containers for [🤗 Text Embedding Inference](https://github.com/huggingface/text-embeddings-inference) (one per available GPU). By default, it uses the docker image compatible with NVIDIA GPUs with Ampere 80 architecture, e.g. A100. Support for some other GPUs is also available, but you would need to choose the right docker image from [available docker images](https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images).
+3. (Optional) Add a [payload index](https://qdrant.tech/documentation/concepts/payload/#payload-indexing)
+```bash
+python retrieval/add_payload_index.py
+```
+This will enable queries that filter on `language` or `block_type`. Note that for large indices, it might take several minutes for the index to become available again.
+
+4. After indexing, load and use the index as in option 2. For example:
+```bash
+inv start-retriever --embedding-model BAAI/bge-m3 --retriever-port
+curl -X POST 0.0.0.0:5100/search -H "Content-Type: application/json" -d '{"query": ["What is GPT-4?", "What is LLaMA-3?"], "num_blocks": 3}'
+```
+
#### To upload a Qdrant index to 🤗 Hub:
1. Split the index into smaller parts:
@@ -256,19 +293,21 @@ Running this will start the backend and front-end servers. You can then access t
-# The free Rate-limited Wikipedia search API
+# The Free Rate-limited Wikipedia Search API
You can use this API endpoint for prototyping high-quality RAG systems.
See https://wikichat.genie.stanford.edu/search/redoc for the full specification.
Note that we do not provide any guarantees about this endpoint, and it is not suitable for production.
-# Wikipedia Preprocessing: Why is it difficult?
-Coming soon.
+# Wikipedia Preprocessing: Why is it Difficult?
+(Coming soon...)
+
+We publicly release [preprocessed Wikipedia in 10 languages](https://huggingface.co/datasets/stanford-oval/wikipedia).
# Other Commands
-## Run a distilled model for lower latency and cost
+## Run a Distilled Model for Lower Latency and Cost
WikiChat 2.0 is not compatible with [fine-tuned LLaMA-2 checkpoints released](https://huggingface.co/collections/stanford-oval/wikichat-v10-66c580bf15e26b87d622498c). Please refer to v1.0 for now.
## Simulate Conversations
@@ -282,13 +321,9 @@ Depending on the engine you are using, this might take some time. The simulated
You can also provide any of the pipeline parameters from above.
You can experiment with different user characteristics by modifying `user_characteristics` in `benchmark/user_simulator.py`.
-
-
# License
WikiChat code, and models and data are released under Apache-2.0 license.
-
-
# Citation
If you have used code or data from this repository, please cite the following papers:
diff --git a/benchmark/user_simulator.py b/benchmark/user_simulator.py
index 9ecce17..d49bdcc 100644
--- a/benchmark/user_simulator.py
+++ b/benchmark/user_simulator.py
@@ -170,7 +170,7 @@ async def main(args):
make_parent_directories(args.output_file)
with open(args.output_file, "w") as output_file:
for idx, dlg in enumerate(all_dialogues):
- if not dlg["dialogue_history"]:
+ if not dlg or not dlg["dialogue_history"]:
logger.error('dialog with topic "%s" failed', topics[idx])
# skip dialogs that failed
continue
diff --git a/docs/search_api.md b/docs/search_api.md
index 1eb7886..3cd1e0f 100644
--- a/docs/search_api.md
+++ b/docs/search_api.md
@@ -6,7 +6,7 @@
## Description
This endpoint allows you to search in text, table and infoboxes of 10 Wikipedias (🇺🇸 English, 🇨🇳 Chinese, 🇪🇸 Spanish, 🇵🇹 Portuguese, 🇷🇺 Russian, 🇩🇪 German, 🇮🇷 Farsi, 🇯🇵 Japanese, 🇫🇷 French, 🇮🇹 Italian) with various query parameters.
-It is currently retrieving from the Wikipedia dump of Feb 20, 2024.
+It is currently retrieving from the Wikipedia dump of August 1, 2024.
The search endpoint is a hosted version of `retrieval/retriever_server.py`.
Specifically, it uses the state-of-the-art multilingual vector embedding models for high quality search results.
diff --git a/pipelines/chatbot.py b/pipelines/chatbot.py
index e2085b3..bfdfec1 100644
--- a/pipelines/chatbot.py
+++ b/pipelines/chatbot.py
@@ -242,16 +242,17 @@ async def process_refine_prompt_output(
return refined_agent_utterance.strip(), feedback
logger.error(
- "Skipping refinement due to malformatted Refined response: %s",
+ "Skipping refinement due to malformed Refined response: %s",
refine_prompt_output,
)
return utterance_to_refine, None
else:
# There is no feedback part to the output
- if refine_prompt_output.startswith("Chatbot:"):
- refine_prompt_output = refine_prompt_output[
- len(refine_prompt_output) :
- ].strip()
+ refine_identifiers = ["Chatbot:", "Chatbot's revised response:"]
+ for identifier in refine_identifiers:
+ if refine_prompt_output.startswith(identifier):
+ refine_prompt_output = refine_prompt_output[len(identifier) :].strip()
+ break
return refine_prompt_output, None
diff --git a/pipelines/prompts/query.prompt b/pipelines/prompts/query.prompt
index b8f58d1..d8dd9c7 100644
--- a/pipelines/prompts/query.prompt
+++ b/pipelines/prompts/query.prompt
@@ -55,13 +55,14 @@ Yes. You search "who is Murakami the baseball player?". The year of the results
# input
-Person: Did you watch the 1998 movie Shakespeare in Love?
-Is it helpful to search Wikipedia? Yes. You search "the 1998 movie 'Shakespeare in Love'". The year of the results is "1998".
-Person: Did you like it?
+Person: آیا فیلم شکسپیر عاشق را دیده ای؟
+Is it helpful to search Wikipedia? Yes. You search "شکسپیر عاشق فیلم سال ۱٩٩٨". The year of the results is "1998".
+You: بله، می دانستی که جایزهٔ اسکار بهترین فیلم را گرفته؟
+Person: بله. آیا فیلم را دوست داشتی؟
Is it helpful to search Wikipedia?
# output
-Yes. You search "reviews for the 1998 movie 'Shakespeare in Love'". The year of the results is "none".
+Yes. You search "نظرات درباره شکسپیر عاشق فیلم سال ۱٩٩٨". The year of the results is "none".
# input
diff --git a/retrieval/add_payload_index.py b/retrieval/add_payload_index.py
index 1bee3e2..7fb9996 100644
--- a/retrieval/add_payload_index.py
+++ b/retrieval/add_payload_index.py
@@ -1,8 +1,8 @@
import argparse
-
+import sys
from qdrant_client import QdrantClient
from qdrant_client.models import PayloadSchemaType
-
+sys.path.insert(0, "./")
from tasks.defaults import DEFAULT_QDRANT_COLLECTION_NAME
if __name__ == "__main__":
diff --git a/retrieval/qdrant_index.py b/retrieval/qdrant_index.py
index 924884b..a39f113 100644
--- a/retrieval/qdrant_index.py
+++ b/retrieval/qdrant_index.py
@@ -193,7 +193,7 @@ async def search(
collection_name=self.collection_name,
requests=[
SearchRequest(
- vector=v,
+ vector=vector,
with_vector=False,
with_payload=True,
limit=k,
@@ -202,17 +202,17 @@ async def search(
Filter(
must=[ # 'must' acts like AND, 'should' acts like OR
FieldCondition(
- key=k,
- match=MatchAny(any=list(v)),
+ key=key,
+ match=MatchAny(any=list(value)),
)
- for k, v in filters.items()
+ for key, value in filters.items()
]
)
if filters
else None
),
)
- for v in query_embeddings
+ for vector in query_embeddings
],
)
logger.info("Nearest neighbor search took %.2f seconds", (time() - start_time))
@@ -296,7 +296,7 @@ def embed_queries(self, queries: list[str]):
)
logger.info(
- "Embedding the query vector took %.2f seconds", (time() - start_time)
+ "Embedding the query into a vector took %.2f seconds", (time() - start_time)
)
return normalized_embeddings.tolist()
diff --git a/retrieval/retriever_server.py b/retrieval/retriever_server.py
index 7c3e6b1..615b64e 100644
--- a/retrieval/retriever_server.py
+++ b/retrieval/retriever_server.py
@@ -21,7 +21,7 @@
app = FastAPI(
title="Wikipedia Search API",
- description="An API for retrieving information from 10 Wikipedia languages from the Wikipedia dump of Feb 20, 2024.",
+ description="An API for retrieving information from 10 Wikipedia languages from the Wikipedia dump of August 1, 2024.",
version="1.0.0",
docs_url="/search/docs",
redoc_url="/search/redoc",
diff --git a/tasks/benchmark.py b/tasks/benchmark.py
index 63f7403..9933f9c 100644
--- a/tasks/benchmark.py
+++ b/tasks/benchmark.py
@@ -9,14 +9,14 @@
from tasks.retrieval import get_wikipedia_collection_path
-@task(pre=[load_api_keys])
+@task(pre=[load_api_keys], iterable=["subset", "language"])
def simulate_users(
c,
num_dialogues, # -1 to simulate all available topics
num_turns: int,
simulation_mode: str, # passage
- subset: str, # head, recent, tail
- language: str, # for the topics
+ subset: list[str], # head, recent, tail
+ language: list[str],
input_file=None,
user_simulator_engine="gpt-4o",
user_temperature=1.0,
@@ -49,6 +49,9 @@ def simulate_users(
Accepts all parameters that `inv demo` accepts, plus a few additional parameters for the user simulator.
"""
+ if not language or not subset:
+ raise ValueError("Specify at least one --language and one --subset")
+
pipeline_flags = (
f"--pipeline {pipeline} "
f"--engine {engine} "
@@ -84,21 +87,23 @@ def simulate_users(
if enabled:
pipeline_flags += f"--{arg} "
- if not input_file:
- input_file = f"{subset}_articles_{language}.json"
+ for l in language:
+ for s in subset:
+ if not input_file:
+ input_file = f"{s}_articles_{l}.json"
- c.run(
- f"python benchmark/user_simulator.py {pipeline_flags} "
- f"--num_dialogues {num_dialogues} "
- f"--user_engine {user_simulator_engine} "
- f"--user_temperature {user_temperature} "
- f"--mode {simulation_mode} "
- f"--input_file benchmark/topics/{input_file} "
- f"--num_turns {num_turns} "
- f"--output_file benchmark/simulated_dialogues/{pipeline}_{subset}_{language}_{engine}.txt "
- f"--language {language} "
- f"--no_logging"
- )
+ c.run(
+ f"python benchmark/user_simulator.py {pipeline_flags} "
+ f"--num_dialogues {num_dialogues} "
+ f"--user_engine {user_simulator_engine} "
+ f"--user_temperature {user_temperature} "
+ f"--mode {simulation_mode} "
+ f"--input_file benchmark/topics/{input_file} "
+ f"--num_turns {num_turns} "
+ f"--output_file benchmark/simulated_dialogues/{pipeline}_{s}_{l}_{engine}.txt "
+ f"--language {l} "
+ f"--no_logging"
+ )
@task(iterable=["language"])
diff --git a/tasks/docker_utils.py b/tasks/docker_utils.py
index 1579f45..63c8a6d 100644
--- a/tasks/docker_utils.py
+++ b/tasks/docker_utils.py
@@ -92,7 +92,6 @@ def wait_for_docker_container_to_be_ready(
Raises:
RuntimeError: If the container is not ready within the timeout period.
"""
- timeout = 60
step_time = timeout // 10
elapsed_time = 0
logger.info("Waiting for container '%s' to be ready...", container.name)
diff --git a/tasks/retrieval.py b/tasks/retrieval.py
index 7a9d6be..88e34fc 100644
--- a/tasks/retrieval.py
+++ b/tasks/retrieval.py
@@ -186,7 +186,7 @@ def multithreaded_download(url: str, output_path: str, num_parts: int = 3) -> No
@task
def download_wikipedia_index(
c,
- repo_id: str = "stanford-oval/wikipedia_10-languages_bge-m3_qdrant_index",
+ repo_id: str = "stanford-oval/wikipedia_20240401_10-languages_bge-m3_qdrant_index",
workdir: str = DEFAULT_WORKDIR,
num_threads: int = 8,
):
@@ -195,7 +195,7 @@ def download_wikipedia_index(
Args:
- c: Context, automatically passed by invoke.
- - repo_id (str): The 🤗 hub repository ID from which to download the index files. Defaults to "stanford-oval/wikipedia_10-languages_bge-m3_qdrant_index".
+ - repo_id (str): The 🤗 Hub repository ID from which to download the index files.
- workdir (str): The working directory where the files will be downloaded and extracted. Defaults to DEFAULT_WORKDIR.
- num_threads (int): The number of threads to use for downloading and decompressing the files. Defaults to 8.
@@ -220,7 +220,7 @@ def download_wikipedia_index(
# Decompress and extract the files
c.run(
- f"cat {part_files} | pigz -d -p {num_threads} | tar --strip-components=2 -xv -C {workdir}"
+ f"cat {part_files} | pigz -d -p {num_threads} | tar --strip-components=2 -xv -C {os.path.join(workdir, 'qdrant_index')}"
) # strip-components gets rid of the extra workdir/
@@ -410,14 +410,16 @@ def preprocess_wikipedia_dump(
index_dir = get_wikipedia_collection_dir(workdir, language, wikipedia_date)
input_path = os.path.join(index_dir, "articles-html.json.tar.gz")
- translation_cache = os.path.join(workdir, "translation_cache.jsonl.gz")
+ wikidata_translation_map = os.path.join(
+ workdir, "wikidata_translation_map.jsonl.gz"
+ )
# Constructing the command with parameters
command = (
f"python wikipedia_preprocessing/preprocess_html_dump.py "
f"--input_path {input_path} "
f"--output_path {output_path} "
- f"--translation_cache {translation_cache} "
+ f"--wikidata_translation_map {wikidata_translation_map} "
f"--language {language} "
f"--should_translate "
f"--pack_to_tokens {pack_to_tokens} "
diff --git a/wikipedia_preprocessing/preprocess_html_dump.py b/wikipedia_preprocessing/preprocess_html_dump.py
index b06a6e9..fd1688d 100644
--- a/wikipedia_preprocessing/preprocess_html_dump.py
+++ b/wikipedia_preprocessing/preprocess_html_dump.py
@@ -480,7 +480,7 @@ def get_entity_translation_to_english(
separated by a specific prefix, if the translation is found and deemed
non-redundant. Returns just the entity name if the translation is redundant or not found.
"""
- cached_english = get_from_translation_cache(
+ cached_english = get_from_translation_map(
source_language, entity_name, inverse_redirection_map
)
if cached_english is not None:
@@ -499,7 +499,7 @@ def get_entity_translation_to_english(
else:
return entity_name
else:
- logger.debug("Excluded %s because it is too frequnt", cached_english)
+ logger.debug("Excluded '%s' because it is too frequent", cached_english)
return entity_name
else:
logger.debug(
@@ -782,9 +782,9 @@ def articles_without_disambiguation_or_redirections(
help="If we should translate named entities to English using Wikidata. Has no effect if `--language` is English",
)
arg_parser.add_argument(
- "--translation_cache",
+ "--wikidata_translation_map",
type=str,
- help="Where to read/write the translation cache.",
+ help="Where to read/write the translation mapping we obtain from Wikidata.",
)
arg_parser.add_argument("--num_workers", type=int, default=max(1, cpu_count() - 4))
arg_parser.add_argument(
@@ -832,24 +832,24 @@ def articles_without_disambiguation_or_redirections(
):
break
- load_translation_cache(args.translation_cache)
+ load_translation_map(args.wikidata_translation_map)
non_cached_titles = []
for url in redirection_map:
if (
- get_from_translation_cache(args.language, url, inverse_redirection_map)
+ get_from_translation_map(args.language, url, inverse_redirection_map)
is None
):
non_cached_titles.append(url)
if len(non_cached_titles) > 0:
logger.info(
- "Did not find %d articles in the cache, will call the Wikidata API for them",
+ "Did not find %d articles in the translation map, will call the Wikidata API for them",
len(non_cached_titles),
)
asyncio.run(
batch_get_wikidata_english_name(non_cached_titles, args.language)
)
- save_translation_cache(args.translation_cache)
+ save_translation_map(args.wikidata_translation_map)
input_queue = SimpleQueue()
output_queue = SimpleQueue()
diff --git a/wikipedia_preprocessing/utils.py b/wikipedia_preprocessing/utils.py
index a9e74b0..7389c95 100644
--- a/wikipedia_preprocessing/utils.py
+++ b/wikipedia_preprocessing/utils.py
@@ -15,10 +15,10 @@
logger = get_logger(__name__)
-# cache for all-languages-to-english. e.g. global_translation_dict["fa"] is a dictionary of Farsi -> English translations
-# values can be the emtpy string "", which means we have already looked up the translations in Wikidata, but did not find the English translation
+# Mapping for all languages to English. E.g. global_translation_map["fa"] is a dictionary of Farsi -> English translations
+# values can be the empty string "", which means we have already looked up the translations in Wikidata, but did not find the English translation
# this is different from the case that the key is absent, which means we have never looked up that translation in Wikidata
-global_translation_cache = {}
+global_translation_map = {}
translation_prefix = "(in English: "
@@ -84,57 +84,57 @@ def replace_except_first(s, old, new):
return first_part + rest_replaced
-def get_from_translation_cache(
+def get_from_translation_map(
source_language: str, entity: str, inverse_redirection_map: dict = {}
):
- global global_translation_cache
- if source_language not in global_translation_cache:
+ global global_translation_map
+ if source_language not in global_translation_map:
return None
- if entity not in global_translation_cache[source_language] and (
+ if entity not in global_translation_map[source_language] and (
entity not in inverse_redirection_map
or inverse_redirection_map[entity]
- not in global_translation_cache[source_language]
+ not in global_translation_map[source_language]
):
return None
if (
- entity in global_translation_cache[source_language]
- and global_translation_cache[source_language][entity] is not None
+ entity in global_translation_map[source_language]
+ and global_translation_map[source_language][entity] is not None
):
- return global_translation_cache[source_language][entity]
+ return global_translation_map[source_language][entity]
else:
- return global_translation_cache[source_language][
+ return global_translation_map[source_language][
inverse_redirection_map[entity]
]
-def load_translation_cache(file_name: str):
- global global_translation_cache
+def load_translation_map(file_name: str):
+ global global_translation_map
try:
for language in tqdm(
- orjsonl.stream(file_name), desc="Loading translation cache", smoothing=0
+ orjsonl.stream(file_name), desc="Loading translation map", smoothing=0
):
- global_translation_cache[language["language"]] = language["translations"]
+ global_translation_map[language["language"]] = language["translations"]
except FileNotFoundError as e:
logger.warning(
- "Could not find the translation cache file at %s. Initializing the cache as an empty dictionary.",
+ "Could not find the Wikidata translation map file at %s. Initializing the translation map as an empty dictionary.",
file_name,
)
- global_translation_cache = {}
+ global_translation_map = {}
-def save_translation_cache(file_name: str):
- global global_translation_cache
+def save_translation_map(file_name: str):
+ global global_translation_map
orjsonl.save(
file_name,
tqdm(
[
{
"language": language,
- "translations": global_translation_cache[language],
+ "translations": global_translation_map[language],
}
- for language in global_translation_cache
+ for language in global_translation_map
],
- desc="Saving translation cache",
+ desc="Saving translation map",
smoothing=0,
),
compression_format="gz",
@@ -146,9 +146,9 @@ async def get_wikidata_english_name(article_title: str, session, language: str):
Returns
(english_name: str, new_translation_dict: dict)
"""
- global global_translation_cache
- if get_from_translation_cache(language, article_title) is not None:
- return get_from_translation_cache(language, article_title), {}
+ global global_translation_map
+ if get_from_translation_map(language, article_title) is not None:
+ return get_from_translation_map(language, article_title), {}
try:
# the API expects a user agent
# labels cover more entity-languages, but are sometimes ambiguous. Therefore, we give priority to sitelinks and fallback to labels if needed.
@@ -221,7 +221,7 @@ async def get_wikidata_english_name(article_title: str, session, language: str):
list(sitelink_dict.keys()) + list(wikidata_entity["labels"].keys())
)
- # No need to include these in the translation cache
+ # No need to include these in the translation map
for l in ["en", "en-gb", "en-ca", "commons", "simple"]:
set_of_available_languages.discard(l)
@@ -250,7 +250,7 @@ async def get_wikidata_english_name(article_title: str, session, language: str):
async def batch_get_wikidata_english_name(article_titles: list[str], language: str):
- global global_translation_cache
+ global global_translation_map
async with aiohttp.ClientSession() as session:
with logging_redirect_tqdm():
minibatch_size = 100 # The wikipedia API only allows 100 requests per second, so we batch the requests.
@@ -285,10 +285,10 @@ async def batch_get_wikidata_english_name(article_titles: list[str], language: s
# Add to the global translation dictionary
for translation_dict in batch_new_translation_dicts:
for lang in translation_dict.keys():
- if lang not in global_translation_cache:
- global_translation_cache[lang] = {}
+ if lang not in global_translation_map:
+ global_translation_map[lang] = {}
for k, v in translation_dict[lang].items():
- global_translation_cache[lang][k] = v
+ global_translation_map[lang][k] = v
time_passed = time() - start_time
time_to_wait = 1.1 - time_passed
if time_to_wait > 0: