byte-pair encoder made in order to practice rust i wouldnt imagine it to be following best practices atm as it's mostly identical to mingpts python code, will review reformat and try to make more rust like later.
- BasicTokenizer
- train
- encode
- decode
- save
- load
- vocab type shud be vec u32?
- encode different from minbpe?
- REPL <- (next)
- correct prints/whitespaces
- take train model params
- CLI
- take train model params
- Validate results <- (next)
- Set-up Tests <- (next)
- self
- vs minbpe
- vs tiktoken
- RegexTokenizer
- GPT4Tokenizer
- Tests + Compare
- Structs Traits:?
- Review, Reorg, rustify
- pyo3 python lib?