This repository can serve for dataset preparation of RWKV model,
To speedup the tokenization, please install the fast RWKV Tokenizer written in Rust.
pip install pyrwkv-tokenizer
python tools/preprocess_data.py --input ./sample.jsonl --output-prefix ./data/sample --vocab ./rwkv_vocab_v20230424.txt --dataset-impl mmap --tokenizer-type RWKVTokenizer --append-eod
The jsonl format sample (one line for each document):
{"text": "This is the first document."}
{"text": "Hello\nWorld"}
{"text": "1+1=2\n1+2=3\n2+2=4"}
generated by code like this:
ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False)
out.write(ss + "\n")