The train_hf_tokenizer.py script is designed for setting up and training a HuggingFace tokenizer. Because the HuggingFace Transfomers and Tokenizers library is vast, and because tokenization is highly task-specific, the user is encouraged to develop the train_hf_tokenizer.py script to fit the needs of the task.
In general, train_hf_tokenizer.py prepares and saves a HuggingFace tokenizer, for later use in tokenization and training.
Helpful HuggingFace Tokenizer docs:
- HuggingFace Tokenizers docs
- Transformer's PreTrainedTokenizer class docs
- Tutorial on building HuggingFace tokenizer from scratch
There are a few different approaches to tokenizing data; it can be done in the preprocessing stage, or during training. This should be taken into account when creating the tokenizer. See Tokenizing Data for more information.
Ensure that the proper parameters are set in the configuration file.
tokenizer_path
should be a path to a folder, for HuggingFace, or just a file, for SentencePiece.pad_id
is more SentencePiece specificvocab_size
needs to be defined for HuggingFace. Not necessary for SentencePiece
Ensure that your train_hf_tokenizer.sh points to the configuration file. Then, it can be run: sbatch train_hf_tokenizer.sh
.