Skip to content

Language Models for Dataset generation

Compare
Choose a tag to compare
@mdsrqbl mdsrqbl released this 19 Jul 08:26
· 18 commits to main since this release

This release includes language models to write text that can be translated by a rule-based text-to-sign translator.

The tlm_14.0.pt (sign_language_translator.models.TransformerLanguageModel) is a custom transformer trained on ~800 MB of text composed only of the words for which PakistanSignLanguage signs are available (see sign_recordings/collection_to_label_to_language_to_words.json). The tokenizer used is sign_language_translator.languages.text.urdu.Urdu().tokenizer + the digits in numbers and letter in acronyms are split apart as individual tokens to limit the vocab size. Later update will generate disambiguated words. The start & end token are "<" & ">".

The -mixed-.pkl model is trained on unambiguous supported urdu words from a corpus of around 10 MB (2.4 million tokens). It is a mix of 6 ngram models with context window size from 1 to 6. It cannot handle any longer range dependencies so concept drift can be observed in the longer generations. The tokenizer used is slt.languages.text.urdu.Urdu().tokenizer. The start & end token are "<" & ">".

The *.json models are made to demonstrate the functionality of n-gram models. The training data is text_preprocessing.json:person_names.

  • Contains n-gram based statistical language models trained on 366 Urdu and 366 English names commonly used in Pakistan.
  • The models predict the next character based on previous 1-3 characters.
  • The start and end of sequence tokens are "[" and "]".