Skip to content

This repository contains a system for generating question-answer pairs for FINE-TUNING LLMs from the data you have.

License

Notifications You must be signed in to change notification settings

shrijayan/dataset_generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataset Generator for Fine-Tuning

This repository contains a system for generating question-answer pairs for FINE-TUNING LLMs from the data you have. The system leverages various modules to extract text, generate questions using a language model, and save the generated questions.

Latest Update

  • Added a feature to prcoess HTML as input files.
  • Added a feature to remove duplicate and similar questions.
  • Simplified the JSONL ouput format cleaning process.

Architecture Diagram

Architecture diagram

Table of Contents

Supported Inference Engine

  1. VLLM
  2. OpenAI API
  3. Azure OpenAI API
  4. Ollama

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/question-generation.git
cd question-generation
  1. Create a virtual environment and activate it:
python3.11 -m venv .venv
source .venv/bin/activate # On Windows use `venv\Scripts\activate`
  1. Install the required dependencies:
pip install -r requirements.txt
  1. Copy the example environment file and configure it:
cp .env.example .env
  1. Update the .env file with your API URL and API Key.

Configuration

The configuration for the model is specified in the config.json file. You can update the model name or other parameters as needed:

{   
    "inference_engine": "azure", # inference engine name here
    "model_name": "llama3.1", # model name here
    "model_max_tokens": 10000, # model's max tokens here
    "input_folder": "input_data", # input data location
    "output_folder": "generated_questions", # output data location
    "chroma_db_path": "chromadb", # vector db location
    "chroma_collection_name": "questions", # vectordb collection name
    "duplicate_threshold": 0.1 # duplicate checking threshold
}

Usage

  1. Place your input files in the input_data folder.

  2. To run the question generation process, execute the main.py script:

python main.py

Prompts

  • The system prompt for generating question-answer pairs is located in the prompts folder as generateQA-sys_prompt.txt

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any changes.

License

This project is licensed under the Apache-2.0 license. See the LICENSE file for details.

About

This repository contains a system for generating question-answer pairs for FINE-TUNING LLMs from the data you have.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages