Skip to content

High compression text tokenizers via VQAEs for efficient and democratic language modeling.

License

Notifications You must be signed in to change notification settings

elyxlz/neural-tokenizer

Repository files navigation

neural-tokenizer

High compression text tokenizers via VQAEs for efficient and democratic language modeling.

Language models struggle with semantic modeling due to high frequency details in tokens from typical tokenizers, employing stronger textual compression via neural tokenizers may solve alleviate this problem.

Usage

from neural_tokenizer import NeuralTokenizer

# Load pretrained model
model = NeuralTokenizer.from_pretrained("elyxlz/neural-tokenizer-v1")

text = ["Hello", "World :)"]
tokens = model.encode(text)
print(tokens.data)
# [[0, 1235, 1236, 1], [0, 1237, 1238, 1239, 1240, 1]]

recon = model.decode(tokens)
print(recon)
# ["Hello", "World :)"]

loss = model.forward(text, max_len=2048)
# 5.56...

Training

Install train dependencies

pip install -e '.[train]'

Setup accelerate config

accelerate config

Create a config file like the one in configs/demo_run.py

Then run the training

accelerate launch train.py demo_run

TODO

  • Dataloader with HF datasets
  • Add training
  • Implement varlen windowed flash attn
  • Validate idea with a simple experiment
  • GAN training
  • Variational + continuous bottleneck

About

High compression text tokenizers via VQAEs for efficient and democratic language modeling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages