HintEval💡 is a powerful framework designed for both generating and evaluating hints for input questions. These hints serve as subtle clues, guiding users toward the correct answer without directly revealing it. As the first tool of its kind, HintEval allows users to create and assess hints from various perspectives.
- Unified Framework: HintEval combines datasets, models, and evaluation metrics into a single Python-based library. This integration allows researchers to seamlessly conduct hint generation and evaluation tasks.
- Comprehensive Metrics: Implements five core metrics (fifteen evaluation methods)—Relevance, Readability, Convergence, Familiarity, and Answer Leakage—with lightweight to resource-intensive methods to cater to diverse research needs.
- Dataset Support: Provides access to multiple preprocessed and evaluated datasets, including TriviaHG, WikiHint, HintQA, and KG-Hint, supporting both answer-aware and answer-agnostic hint generation approaches.
- Customizability: Allows users to define their own datasets, models, and evaluation methods with minimal effort using a structured design based on Python classes.
- Extensive Documentation: Accompanied by detailed 📖online documentation and tutorials for easy adoption.
- Enhanced Datasets: Expand the repository with additional datasets to support diverse hint-related tasks.
- Advanced Evaluation Metrics: Introduce new evaluation techniques such as Unieval and cross-lingual compatibility.
- Broader Compatibility: Ensure support for emerging language models and APIs.
- Community Involvement: Encourage contributions of new datasets, metrics, and use cases from the research community.
It's recommended to install HintEval in a virtual environment using Python 3.11.9. If you're not familiar with Python virtual environments, check out this user guide. Alternatively, you can create a new environment using Conda.
First, create and activate a virtual environment with Python 3.11.9:
conda create -n hinteval_env python=3.11.9 --no-default-packages
conda activate hinteval_env
You'll need PyTorch 2.4.0 for HintEval. Refer to the PyTorch installation page for platform-specific installation commands. If you have access to GPUs, it's recommended to install the CUDA version of PyTorch, as many of the evaluation metrics are optimized for GPU use.
Once PyTorch 2.4.0 is installed, you can install HintEval via pip:
pip install hinteval
For the latest features, you can install the most recent version from the main branch:
pip install git+https://github.com/DataScienceUIBK/HintEval
This tutorial provides step-by-step guidance on how to generate a synthetic hint dataset using large language models via the TogetherAI platform. To proceed, ensure you have an active API key for TogetherAI.
api_key = "your-api-key"
base_url = "https://api.together.xyz/v1"
First, gather a collection of question/answer pairs as the foundation for generating Question/Answer/Hint triples. For example, load 10 questions from the WebQuestions dataset using the 🤗datasets library:
from datasets import load_dataset
webq = load_dataset("Stanford/web_questions", split='test')
question_answers = webq.select_columns(['question', 'answers'])[10:20]
qa_pairs = zip(question_answers['question'], question_answers['answers'])
At this point, you have a set of question/answer pairs ready for creating synthetic Question/Answer/Hint instances.
Use HintEval's Dataset
class to create a new dataset called synthetic_hint_dataset
, which includes the 10 question/answer pairs within a subset named entire
.
from hinteval import Dataset
from hinteval.cores import Subset, Instance
dataset = Dataset('synthetic_hint_dataset')
subset = Subset('entire')
for q_id, (question, answers) in enumerate(qa_pairs, 1):
instance = Instance.from_strings(question, answers, [])
subset.add_instance(instance, f'id_{q_id}')
dataset.add_subset(subset)
dataset.prepare_dataset(fill_question_types=True)
Generate 5 hints for each question using HintEval’s AnswerAware
model. For this example, we will use the Meta LLaMA-3.1-70b-Instruct-Turbo model from TogetherAI.
from hinteval.model import AnswerAware
generator = AnswerAware(
'meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo',
api_key, base_url, num_of_hints=5, enable_tqdm=True
)
generator.generate(dataset['entire'].get_instances())
Note: Depending on the LLM provider, you may need to configure the model and other parameters in the
AnswerAware
function. See the 📖documentation for more information.
Once the hints are generated, export the synthetic hint dataset to a pickle file:
dataset.store('./synthetic_hint_dataset.pickle')
Finally, view the hints generated for the third question in the dataset:
dataset = Dataset.load('./synthetic_hint_dataset.pickle')
third_question = dataset['entire'].get_instance('id_3')
print(f'Question: {third_question.question.question}')
print(f'Answer: {third_question.answers[0].answer}')
print()
for idx, hint in enumerate(third_question.hints, 1):
print(f'Hint {idx}: {hint.hint}')
Example output:
Question: who is governor of ohio 2011?
Answer: John Kasich
Hint 1: The answer is a Republican politician who served as the 69th governor of the state.
Hint 2: This person was a member of the U.S. House of Representatives for 18 years before becoming governor.
Hint 3: The governor was known for his conservative views and efforts to reduce government spending.
Hint 4: During their term, they implemented several reforms related to education, healthcare, and the economy.
Hint 5: This governor served two consecutive terms, from 2011 to 2019, and ran for the U.S. presidency in 2016.
Once your hint dataset is ready, it’s time to evaluate the hints. This section guides you through the evaluation process.
api_key = "your-api-key"
base_url = "https://api.together.xyz/v1"
For this tutorial, use the synthetic dataset generated earlier. Alternatively, you can load a preprocessed dataset using the Dataset.download_and_load_dataset()
function.
from hinteval import Dataset
dataset = Dataset.load('./synthetic_hint_dataset.pickle')
HintEval provides several metrics to evaluate different aspects of the hints:
- Relevance: Measures how relevant the hints are to the question.
- Readability: Assesses the readability of the hints.
- Convergence: Evaluates how effectively hints narrow down potential answers.
- Familiarity: Rates how common or well-known the hints' information is.
- Answer Leakage: Detects how much the hints reveal the correct answers.
Here’s how to import the metrics:
from hinteval.evaluation.relevance import Rouge
from hinteval.evaluation.readability import MachineLearningBased
from hinteval.evaluation.convergence import LlmBased
from hinteval.evaluation.familiarity import Wikipedia
from hinteval.evaluation.answer_leakage import ContextualEmbeddings
Extract the question, hints, and answers from the dataset and evaluate using different metrics:
instances = dataset['entire'].get_instances()
questions = [instance.question for instance in instances]
answers = []
[answers.extend(instance.answers) for instance in instances]
hints = []
[hints.extend(instance.hints) for instance in instances]
# Example evaluations
Rouge('rougeL', enable_tqdm=True).evaluate(instances)
MachineLearningBased('random_forest', enable_tqdm=True).evaluate(questions + hints)
LlmBased('llama-3-70b', together_ai_api_key=api_key, enable_tqdm=True).evaluate(instances)
Wikipedia(enable_tqdm=True).evaluate(questions + hints + answers)
ContextualEmbeddings(enable_tqdm=True).evaluate(instances)
Export the evaluated dataset to a JSON file for further analysis:
dataset.store_json('./evaluated_synthetic_hint_dataset.json')
Note: Evaluated scores and metrics are automatically stored in the dataset. Saving the dataset includes the scores.
Refer to our 📖documentation to learn more.
HintEval is modular and customizable, with core components designed to handle every stage of the hint generation and evaluation pipeline:
- Preprocessed Datasets: Includes widely used datasets like TriviaHG, WikiHint, HintQA, and KG-Hint.
- Dynamic Dataset Loading: Use
Dataset.available_datasets()
to list, download, and load datasets effortlessly. - Custom Dataset Creation: Define datasets using the
Dataset
andInstance
classes for tailored hint generation.
- Answer-Aware Models: Generate hints tailored to specific answers using LLMs.
- Answer-Agnostic Models: Generate hints without requiring specific answers for open-ended tasks.
- Relevance: Measures how relevant the hints are to the question.
- Readability: Assesses the readability of the hints.
- Convergence: Evaluates how effectively hints narrow down potential answers.
- Familiarity: Rates how common or well-known the hints' information is.
- Answer Leakage: Detects how much the hints reveal the correct answers.
- Integrates seamlessly with API-based platforms (e.g., TogetherAI).
- Supports custom models and local inference setups.
Community contributions are essential to our project, and we value every effort to improve it. From bug fixes to feature enhancements and documentation updates, your involvement makes a big difference, and we’re thrilled to have you join us! For more details, please refer to development.
If you have a dataset on hints that you'd like to share with the community, we'd love to help make it available within HintEval! Adding new, high-quality datasets enriches the framework and supports other users' research and study efforts.
To contribute your dataset, please reach out to us. We’ll review its quality and suitability for the framework, and if it meets the criteria, we’ll include it in our preprocessed datasets, making it readily accessible to all users.
To view the available preprocessed datasets, use the following code:
from hinteval import Dataset
available_datasets = Dataset.available_datasets(show_info=True, update=True)
Thank you for considering this valuable contribution! Expanding HintEval's resources with your work benefits the entire community.
Follow these steps to get involved:
-
Fork this repository to your GitHub account.
-
Create a new branch for your feature or fix:
git checkout -b feature/YourFeatureName
-
Make your changes and commit them:
git commit -m "Add YourFeatureName"
-
Push the changes to your branch:
git push origin feature/YourFeatureName
-
Submit a Pull Request to propose your changes.
Thank you for helping make this project better!
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.
If you find this work useful, please cite 📜our paper:
Mozafari, J., Piryani, B., Abdallah, A., & Jatowt, A. (2025). HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions. ArXiv. https://arxiv.org/abs/2502.00857
@article{mozafari2025hintevalcomprehensiveframeworkhint,
title = {HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions},
author = {Jamshid Mozafari and Bhawna Piryani and Abdelrahman Abdallah and Adam Jatowt},
year = 2025,
doi = {10.48550/arXiv.2502.00857},
url = {https://arxiv.org/abs/2502.00857},
eprint = {2502.00857},
archiveprefix = {arXiv},
primaryclass = {cs.CL}
}
Thanks to our contributors and the University of Innsbruck for supporting this project.