Skip to content

gdevakumar/Illustrative-Tokenizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Illustrating GPT Tokenizers

Large Language Models introduced by OpenAI (called GPT models) use a process called Tokenization to convert words to numbers since neural networks only understand numbers. This repo is a fun project which shows how the text is actually converted to tokens and the number of tokens for various Encodings.

  • gpt2 - This is used in GPT-2 model.
  • gpt-3.5 - This is used in GPT-3 and GPT-4 models.
  • gpt-4o - This is the one used in latest GPT-4o model

The dashboard provides an interactive way to show how words are tokenized with each tokenizer in real-time as a side-by-side comparison.

Live -> On render

Steps to run this localy (via command line)

  1. Clone this repository
git clone https://github.com/gdevakumar/Illustrative-Tokenizers.git
cd Illustrative-Tokenizers
  1. Install Python from here and project dependencies
pip install -r requirements.txt
  1. Launch the Web UI with Flask application
python3 app.py

Steps to run this localy (via Docker)

Use this method if you have Docker/Docker Desktop installed.

  1. Clone this repository
git clone https://github.com/gdevakumar/Illustrative-Tokenizers.git 
cd Illustrative-Tokenizers
  1. Build the docker image (Notice the dot(.) at the end of command). This may take sometime initially
docker build -t tokenizers .
  1. Run the Docker image
docker run -p 5000:5000 tokenizers

Demo

Screenshot

Image


Video

Releases

No releases published

Packages

No packages published