Pokemon-Game-RAG-Chatbot

Group Project for CS605 Natural Language Processing for Smart Assistant

Project Title: Leveraging Advanced NLP and RAG for Dynamic Gameplay Chatbot
Group Member: Mary Vanessa Heng Hui Khim, Yee Jun Yit, Yeo York Yong, Zhuang Yiwen

Overview

Problem & Objectives

The evolution of open-world games has significantly increased the complexity and richness of gameplay. These games often feature intricate mechanics, diverse ecosystems, and dynamic systems that respond to player actions. However, this complexity presents challenges for many players, especially casual gamers, novices, and those new to the genre. Players struggle to understand intricate game mechanics, locate specific items, and navigate various systems (e.g., crafting, character progression, combat mechanics, quests). This can lead to frustration and a sense of being lost in the game world (Wang et al. 2023).

Issues with Existing Solutions

Current Large Language Models (LLMs) lack domain-specific knowledge related to gaming mechanics, strategies, and objectives. They provide generic or irrelevant responses when players seek assistance with in-game challenges or objectives. Additionally, LLMs struggle to handle dynamic game situations or understand individual player objectives without significant customization (Kim et al. 2023).

Our Solution

To address these challenges, we propose developing a sophisticated game chatbot. Our contributions include:

Implementing NLP and RAG Techniques: We leverage Natural Language Processing (NLP) and Retrieval-Augmented Generation (RAG) techniques to assist players with their game-related queries.
Creating a Dynamic and Personalized Game Chatbot: Our chatbot provides gameplay tips, strategies, and a companion experience tailored to individual player needs. It adapts to player progression and style, refining its assistance based on feedback.
Experimenting with Parameters: We explore different embeddings, LLM types, and prompt testing to optimize the chatbot’s accuracy.

How RAG Works

Retrieval Component:

User Query: The process begins with a user query (e.g., “What are the three starter Pokemon in the game?”).
Embedding Model: The user query is converted into an embedded representation using an embedding model (e.g., Ollama, OpenAI, GoogleGenerativeAI).
Vector Database: A database of pre-processed documents using (e.g. Facebook AI Similarity Search (FAISS)) stores the embedding and is made available.
Vector Comparison: The embedded user query is compared against the document embeddings in the database.
Top-k Retrieval: The system retrieves the top-k most similar documents based on vector similarity scores.

Generation Component:

Context Creation: The retrieved documents are combined with the original user query to create context (e.g., “The starter Pokémon options in Paldea are Sprigatito (Grass), Fuecoco (Fire), and Quaxly (Water).”).
Language Model (LLM): A pre-trained language model (e.g., Llama3) takes this context as input.
Prompt Generation: The LLM generates a prompt that incorporates both the user query and the retrieved document information.
Answer Generation: Using this prompt, the LLM generates a specific answer (e.g., “The new Pokemon Scarlet and Pokemon Violet starters are Sprigatito, Fuecoco, and Quaxly,”).

Evaluation Metrics (RAGAS)

RAGAS evaluation framework is designed to assess and quantify the performance of Retrieval-Augmented Generation (RAG) pipelines. The ragas score consist of these two components:

Generation Assessment: Evaluates the quality of LLM-generated text. Metrics include faithfulness (alignment with retrieved context) and answer correctness.
Retrieval Assessment: Focuses on the effectiveness of the retrieval component. Measures how well the system retrieves relevant documents and ensures that retrieved context enhances LLM responses.

Faithfulness: (Generation)

Faithfulness measures how well the generated answer aligns with the information provided in the retrieved context. It ensures that the response remains consistent with the facts presented in the context.
The answer is scaled to (0,1) range. Higher the better.

$\text{Faithfulness score} = {|\text{Number of claims in the generated answer that can be inferred from given context}| \over |\text{Total number of claims in the generated answer}|}$

Question: Where and when was Einstein born?

Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time

High faithfulness answer: Einstein was born in Germany on 14th March 1879.

Low faithfulness answer: Einstein was born in Germany on 20th March 1879.

Answer Relevancy: (Generation)

Answer Relevancy assesses how pertinent the generated answer is to the given prompt. It measures the alignment between the answer and the original question. Lower scores are assigned to incomplete or irrelevant answers, while higher scores indicate better relevancy.

$\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} cos(E_{g_i}, E_o)$

$\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} \frac{E_{g_i} \cdot E_o}{|E_{g_i}||E_o|}$

Where:

$E_{g_i}$ is the embedding of the generated question
$E_{o}$ is the embedding of the original question.
$N$ is the number of generated questions, which is 3 default.

Context Precision: (Retreival)

Context Precision evaluates whether all relevant items (chunks) from the ground truth appear at higher ranks in the retrieved contexts. Ideally, relevant chunks should be ranked at the top. It measures how well the system prioritizes relevant context.
The resulting value ranges between 0 and 1, where higher scores indicate better precision.

$\text{Context Precision@K} = \frac{\sum_{k=1}^{K} \left( \text{Precision@k} \times v_k \right)}{\text{Total number of relevant items in the top } K \text{ results}}$

$\text{Precision@k} = {\text{true positives@k} \over (\text{true positives@k} + \text{false positives@k})}$

Where $K$ is the total number of chunks in contexts and $v_k \in {0, 1}$ is the relevance indicator at rank $k$.

Context Relevancy: (Retreival)

This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy.
Ideally, the retrieved context should exclusively contain essential information to address the provided query. To compute this, we initially estimate the value of by identifying sentences within the retrieved context that are relevant for answering the given question. The final score is determined by the following formula:

$\text{context relevancy} = {|S| \over |\text{Total number of sentences in retrieved context}|}$

Question: What is the capital of France?

High context relevancy: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.

Low context relevancy: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.

Context Recall: (Retreival)

Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance. The formula for calculating context recall is as follows:

$\text{context recall} = {|\text{GT sentences that can be attributed to context}| \over |\text{Number of sentences in GT}|}$

Question: Where is France and what is it’s capital?

Ground truth: France is in Western Europe and its capital is Paris.

High context recall: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.

Low context recall: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.

Context Entities Recall: (Retreival)

Context Recall measures the proportion of relevant entities (e.g., facts, names, locations) that are correctly retrieved from the ground truth context. It quantifies how well the system recalls entities from the retrieved context.
To compute this metric, we use two sets, $GE$ and $CE$, as set of entities present in ground_truths and set of entities present in contexts respectively.
We then take the number of elements in intersection of these sets and divide it by the number of elements present in the $GE$, given by the formula:

$\text{context entity recall} = \frac{| CE \cap GE |}{| GE |}$

Ground truth: The Taj Mahal is an ivory-white marble mausoleum on the right bank of the river Yamuna in the Indian city of Agra. It was commissioned in 1631 by the Mughal emperor Shah Jahan to house the tomb of his favorite wife, Mumtaz Mahal.

High entity recall context: The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.

Low entity recall context: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.

Answer semantic Similarity: (End to End)

The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth.
This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.
Measuring the semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.

Ground truth: Albert Einstein’s theory of relativity revolutionized our understanding of the universe.”

High similarity answer: Einstein’s groundbreaking theory of relativity transformed our comprehension of the cosmos.

Low similarity answer: Isaac Newton’s laws of motion greatly influenced classical physics.

Answer Correctness: (End to End)

Explain how accurate the generated answer is assessed based on the ground truth.
This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.
Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score. Users also have the option to employ a ‘threshold’ value to round the resulting score to binary, if desired.

Ground truth: Einstein was born in 1879 in Germany.

High answer correctness: In 1879, Einstein was born in Germany.

Low answer correctness: Einstein was born in Spain in 1879.

Reproducing Results

Environment and Code:

Clone the following github repo https://github.com/yYorky/Pokemon-Game-RAG-Chatbot.git
Create a virtual environment first using conda create -p venv python==3.10 -y
Ensure you have the necessary dependencies installed pip install -r requirements.txt
Set up environment variables .env for API keys (e.g., OpenAI, GoogleGenerativeAI, GROQ).

Data Preparation:

Prepare an evaluation dataset in CSV format that is based on the reference document.
The dataset should contain at least two columns: ‘question’ and ‘ground_truth’.

Running the System:

Run the Streamlit app.
- Use streamlit run groq/model_eval.py to run evaluation version
- Use streamlit run groq/model_base.py to run basic version
In the Customization sidebar: select the appropriate settings for testing
Choose a model (e.g. llama3)
Choose an embedding type (e.g., OpenAI, Ollama, or GoogleGenerativeAI).
Select a conversational memory length (how long the chatbot should use past conversation for inputs)
Choose a Chunk size and Chunk Overlap for document embedding
Type a prompt for the LLM if necessary
Click on Documents Embedding to embed document
The RAG chatbot is now ready to use.

Using model_base.py

Ask questions related to Pokemon Scarlet & Violet and view the response
Click on document similarity search to view the retreived chunks

Using model_eval.py

Upload the evaluation dataset in the sidebar.
Ask questions related to Pokemon Scarlet & Violet.
The system will retrieve context and generate responses using RAG as well as compute evaluation metrics using RAGAS framework
Continue asking questions as necessary, if there are questions asked that are not in the evaluation dataset it will skip the RAGAS evaluation
Click on Save Evaluation Results and Download results to review

References and Acknowledgment

Wang, B., Gao, Z., & Shidujaman, M. (2023). Meaningful place: A phenomenological approach to the design of spatial experience in open-world games. Journal of Game Design, 10(2), 45-621. https://doi.org/10.1177/15554120231171290
Hugging Face. (n.d.). Advanced RAG (Retrieval-Augmented Generation) Cookbook. Retrieved from https://huggingface.co/learn/cookbook/advanced_rag
Ragas Documentation (n.d.). Retrieved from https://docs.ragas.io/en/latest/concepts/metrics/index.html
Github by Krishnaik06. (n.d.). Updated Langchain. Retrieved from https://github.com/krishnaik06/Updated-Langchain
Guthub by Alejandro-ao. (n.d.). Ask Multiple PDFs. Retrieved from https://github.com/alejandro-ao/ask-multiple-pdfs
(reddit)Brittlebear (2023) Pokemon Scarlet And Violet Walkthrough. Retreived from https://docs.google.com/document/d/1xL1NNZnKRabyl93BewLzcZkcvmDJj2612K0Ih10hqXQ/edit#heading=h.82hsajnz9z9b

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Basline LLM evaluation		Basline LLM evaluation
Test Results		Test Results
groq		groq
static		static
.gitignore		.gitignore
README.md		README.md
evaluation_results.csv		evaluation_results.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pokemon-Game-RAG-Chatbot

Overview

Problem & Objectives

Issues with Existing Solutions

Our Solution

How RAG Works

Evaluation Metrics (RAGAS)

Faithfulness: (Generation)

Answer Relevancy: (Generation)

Context Precision: (Retreival)

Context Relevancy: (Retreival)

Context Recall: (Retreival)

Context Entities Recall: (Retreival)

Answer semantic Similarity: (End to End)

Answer Correctness: (End to End)

Reproducing Results

References and Acknowledgment

About

Releases

Packages

Languages

yYorky/Pokemon-Game-RAG-Chatbot

Folders and files

Latest commit

History

Repository files navigation

Pokemon-Game-RAG-Chatbot

Overview

Problem & Objectives

Issues with Existing Solutions

Our Solution

How RAG Works

Evaluation Metrics (RAGAS)

Faithfulness: (Generation)

Answer Relevancy: (Generation)

Context Precision: (Retreival)

Context Relevancy: (Retreival)

Context Recall: (Retreival)

Context Entities Recall: (Retreival)

Answer semantic Similarity: (End to End)

Answer Correctness: (End to End)

Reproducing Results

References and Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages