Tashkeel

Overview

This projects addresses the problem of arabic diacritization using Bi-LSTM.

for example:

ذَهَبَ اَلْوَلَدُ إِلَى اَلْمَدْرَسَةِ <-- ذهب الولد إلى المدرسة

Our Achievement

We have ranked the 5th team on the leader-board with an accuracy of 97% on the hidden test set

Get Started

For Cleaning

python cleaning.py --mode (choices: train, test, validate)

For Training

val_train_sentences, val_train_labels, val_train_size = get_params(vocab_map, classes, 'val_train_X.pickle', 'val_train_Y.pickle')
val_train_dataset = TashkeelDataset(val_train_sentences, val_train_labels, vocab_map['<PAD>'],max_length)
model = Tashkeel()

For Inference

inferencing = model.inference(test_input)

Modules

Preprocessing

Cleaning Process [Train & Validation Only]

  1. Remove HTML tags
  2. Remove URLs
  3. Remove special Arabic character (Kashida)
  4. Separate Numbers
  5. Remove Multiple Whitespaces
  6. Clear Punctuations
  7. Remove english letters and english and arabic numbers
  8. Remove shifts

Tokenization

  • Split Using: [\n.,،؛:«»?؟]+

Fix Diacritization Issue [Train & Validation Only]

  1. Replace consecutive diacritics with a single diacritic
  2. Ending Diacritics: Remove diacritics at the end of a word
  3. Misplaced Diacritics: Remove spaces between characters and diacritics

Tashkel Removal [Train & Validation Only]

  • Remove gold class for every character
  • Harakat:
    1. "Fatha":"\u064e"
    2. "Fathatan":  "\u064b"
    3. "Damma":"\u064f"
    4. "Dammatan":"\u064c"
    5. "Kasra":"\u0650"
    6. "Kasratan":"\u064d"
    7. "Sukun":"\u0652"
    8. "Shadda":"\u0651"
    9. "Shadda Fatha":"\u0651\u064e"
    10. "Shadda Fathatan":"\u0651\u064b"
    11. "Shadda Damma":"\u0651\u064f"
    12. "Shadda Dammatan":"\u0651\u064c"
    13. "Shadda Kasra":"\u0651\u0650"
    14. "Shadda Kasratan":"\u0651\u064d"

Reference Arabic Text Diacritization Using Deep Neural Networks

Network

class Tashkeel(nn.Module):
  def __init__(self, vocab_size=vocab_size, embedding_dim=100, hidden_size=256, n_classes=n_classes):
    """
    The constructor of our Tashkeel model
    Inputs:
    - vacab_size: the number of unique words
    - embedding_dim: the embedding dimension
    - n_classes: the number of final classes (tags)
    """
    super(Tashkeel, self).__init__()
    # (1) Create the embedding layer
    self.embedding = nn.Embedding(num_embeddings=vocab_size,embedding_dim=embedding_dim)

    # (2) Create an LSTM layer with hidden size = hidden_size and batch_first = True
    # self.lstm =  nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,batch_first=True)
    self.lstm =  nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,batch_first=True,num_layers=2,bidirectional=True)

    # (3) Create a linear layer with number of neorons = n_classes
    self.linear =  nn.Linear(2*hidden_size,n_classes)

Reference Effective Deep Learning Models for Automatic Diacritization of Arabic Text

Contributors

_{Ahmed Hany}

_{Mohab Zaghloul}

_{Shaza Mohamed}

_{Basma Elhoseny}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Papers		Papers
Tashkeel		Tashkeel
assets		assets
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
NLP-Project-F23.pdf		NLP-Project-F23.pdf
README.md		README.md
train.txt		train.txt
val.txt		val.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tashkeel

Table of Contents

Overview

Our Achievement

Get Started

Modules

Preprocessing

Cleaning Process [Train & Validation Only]

Tokenization

Fix Diacritization Issue [Train & Validation Only]

Tashkel Removal [Train & Validation Only]

Reference Arabic Text Diacritization Using Deep Neural Networks

Network

Reference Effective Deep Learning Models for Automatic Diacritization of Arabic Text

Contributors

License

About

Releases

Packages

Contributors 3

Languages

License

BasmaElhoseny01/Tashkeel

Folders and files

Latest commit

History

Repository files navigation

Tashkeel

Table of Contents

Overview

Our Achievement

Get Started

Modules

Preprocessing

Cleaning Process [Train & Validation Only]

Tokenization

Fix Diacritization Issue [Train & Validation Only]

Tashkel Removal [Train & Validation Only]

Reference Arabic Text Diacritization Using Deep Neural Networks

Network

Reference Effective Deep Learning Models for Automatic Diacritization of Arabic Text

Contributors

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages