MuRIL: Multilingual Representations for Indian Languages is a BERT based model retrained on the Embeddings of Indian Languages like Hindi, Tamil, Kannada etc.
This repository walks you through the processes involved in fine tuning an NLP model for task specific applications using Transformers (:hugs:) implementation. We will deal with a hate-speech classification task in this one. But remember, You can always generalise it to any number of classes (as long as you can procure the right dataset :grinning:) just y adjusting the number of outputs in the final layer.
TASK : A six-class classification problem based on Tamil, Kannada and Malayalam language tweets (credits: ACL).
Class Labels:
- 'Not_offensive'
- 'Offensive_Targeted_Insult_Group'
- 'Offensive_Targeted_Insult_Individual'
- 'Offensive_Targeted_Insult_Other'
- <'Offensive_Untargeted'
- 'Not {language_name}'
NOTE: You might want to make a copy of the notebook first.
What do we need?
- A basic Understanding of How BERT works? (Insightful: Article)
- Understanding of tokenizers and word embeddings.
- PyTorch framework.
Click the icon to know more about PyTorch and how it works