Skip to content

Extracts Google Sheets to JSONL for fine-tuning, estimates task costs with tiktoken.

Notifications You must be signed in to change notification settings

farithadnan/DatasetForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DatasetForge ⚒️

DatasetForge is a Python project designed to extract data from Google Sheets and convert it into JSONL formatted dataset, which is suitable for fine-tuning (davinci-002 model) tasks (OpenAI). This tool also uses the library called tiktoken to estimate the cost of fine-tuning (davinci-002 model) tasks.

Requirements ⭐

How to Run the Project 🏃🏽‍♂️

Step 1: Clone the repo

Open Git bash and type:

  git clone https://github.com/farithadnan/DatasetForge.git

Step 2: Installation

Install the required Python packages by running below command on your terminal:

  pip install -r requirements.txt

Step 3: Set Up Google Sheets Config

Ensure that the configuration file (e.g., config.yaml) contains essential settings such as:

  • Path to Google Sheets credentials file (private keys).
  • URL of the Google Sheet to extract data from.
  • Index of the specific sheet within the Google Sheet.
  • Name for the output JSONL file.

Refer to a file called config.yaml.sample for more info.

Step 4: Set up model for Encoding

To estimate the cost of your dataset when it is fine-tuned later, you need to configure the encoding in config.yaml. By default, it is configured to r50k_base encoding, which refers to GPT-3 models like (davinci-002).

For more details, refer to How to count tokens with tiktoken

Step 5: Run the Project

Activate your virtual environment then run the main python script:

python app.py

This will authenticate with Google Sheets, extract the specified data, and convert it into a JSONL format, creating a dataset ready for fine-tuning tasks.

About

Extracts Google Sheets to JSONL for fine-tuning, estimates task costs with tiktoken.

Topics

Resources

Stars

Watchers

Forks

Languages