Result produced by model:
Language modelling requires a sequence input data, as given a sequence (of words/tokens) the aim is the predict next word/token.
The next step is Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. Python’s library Keras has inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus. After this step, every text document in the dataset is converted into sequence of tokens.
Lets take a look at one of the input Sequence:
[[1337, 2623],
[1337, 2623, 12],
[1337, 2623, 12, 39],
[1337, 2623, 12, 39, 103]]
In the above output [1337, 2623], [1337, 2623, 12], [1337, 2623, 12, 39] and so on represents the ngram phrases generated from the input data, where every integer corresponds to the index of a particular word in the complete vocabulary of words present in the text. For example
नछाडी जानोस् हे मेरा प्राण
Ngram | Sequence of tokens |
---|---|
नछाडी जानोस् | [1337, 2623] |
नछाडी जानोस् हे | [1337, 2623, 12] |
नछाडी जानोस् हे मेरा | [1337, 2623, 12, 39] |
नछाडी जानोस् हे मेरा प्राण | [1337, 2623, 12, 39, 103] |
You can find the original dataset
📰📰HERE📰📰
Download pretrained weight from
📦📦HERE📦📦