GitHub - avada-z/LLamaTokenizerOnBash: Very, VERY badly coded llama tokenizer using only Bash and some tools (bc, awk, sed, tr, od, etc..)

Very, VERY badly coded llama tokenizer using only Bash and some tools (bc, awk, sed, tr, od, etc..)
Since I have nothing to do in my free time, I coded this, using https://github.com/belladoreai/llama-tokenizer-js as inspiration.

To use this, first you need to install all the necessary tools that are used (you'll figure them out looking at errors)
Just pass your input text as an argument to this script, example:
./tokenizer.sh " grabbed"

Output:

▁▁grabbed
1
29871
2646
1327
287
<s>▁▁grabbed

The first line is rappresentation of your input before being processed into tokens, then tokens, then reconstruction of input from tokens you got (with colors, yay). Since I can't get the sorting algorithm of the original js implementation (I have no clue about nodes stuff),
I did something that seems to work in a very similar way, even tho it doesn't match 100% of cases.
It also doesn't work with hex tokens on the merging stage, but maybe I'll fix it (or someone does a pull request..?)

Also I'm sure I'll forget how this works in a week lol.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
tokenizer.sh		tokenizer.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

avada-z/LLamaTokenizerOnBash

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages