Skip to content

Latest commit

 

History

History
33 lines (29 loc) · 885 Bytes

readme.md

File metadata and controls

33 lines (29 loc) · 885 Bytes

Karpathy's minBPE in Rust

minbpe

byte-pair encoder made in order to practice rust i wouldnt imagine it to be following best practices atm as it's mostly identical to mingpts python code, will review reformat and try to make more rust like later.

todo:

  • BasicTokenizer
    • train
    • encode
    • decode
    • save
    • load
    • vocab type shud be vec u32?
    • encode different from minbpe?
  • REPL <- (next)
    • correct prints/whitespaces
    • take train model params
  • CLI
    • take train model params
  • Validate results <- (next)
  • Set-up Tests <- (next)
    • self
    • vs minbpe
    • vs tiktoken
  • RegexTokenizer
  • GPT4Tokenizer
  • Tests + Compare
  • Structs Traits:?
  • Review, Reorg, rustify
  • pyo3 python lib?