Vocabulary Adaptation MPT and BLOOM model

Tokenizer-embed Pipeline

To train Indic Tokenizer and get the final tokenizer follow tokenizer_setup directory
To evaluate the resulting tokenizer follow tokenizer_evaluation directory
To get embedding using wechsel follow Wechsel_Setup
To initialize the word embedding layer of model follow InitializationWordEmbed

Result

Please find result on https://docs.google.com/spreadsheets/d/1npkCffkNyztbPZokK9vis19zvzzT07l-uWnN06aiOeQ/edit#gid=868636088
Please find Meeting Notes/To-Do list/observation/.. on - https://docs.google.com/document/d/1dOegfXg8v5NBYXlCZgLDnkLBjP1YD_6K47kHh_5ojd0/edit

File specification

seed_data_test_split.py contains code to split seed dataset for train(90%) and test(10%)
merge_training_seed.py -> code to merge the training data
tokenizer_specification.py -> code to find how two tokenizer are related, such as intersecting token, or avg tokenization length per sentence
combine_tokenizer.py -> contains code to combine two tokenizer (The one used for extended version)
train_tokenizer.py -> train tokenizer from scratch
MPT_inference.py and IndicMPT_inference.py -> code to calculate the perplexity score of just inferncing(no training)
MPT_train.py and IndicMPT_train.py -> contains code to train LoRA adapetr and Word Embedding layer of model