Skip to content

A system that takes a sentence and produces the same sentence after restoring the missing diacritics.

License

Notifications You must be signed in to change notification settings

BasmaElhoseny01/Tashkeel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tashkeel

Table of Contents

Overview

This projects addresses the problem of arabic diacritization using Bi-LSTM.

for example:

ذَهَبَ اَلْوَلَدُ إِلَى اَلْمَدْرَسَةِ <-- ذهب الولد إلى المدرسة

Our Achievement

We have ranked the 5th team on the leader-board with an accuracy of 97% on the hidden test set score_board

Get Started

For Cleaning

python cleaning.py --mode (choices: train, test, validate)

For Training

val_train_sentences, val_train_labels, val_train_size = get_params(vocab_map, classes, 'val_train_X.pickle', 'val_train_Y.pickle')
val_train_dataset = TashkeelDataset(val_train_sentences, val_train_labels, vocab_map['<PAD>'],max_length)
model = Tashkeel()

For Inference

inferencing = model.inference(test_input)

Modules

Preprocessing

Cleaning Process [Train & Validation Only]
  1. Remove HTML tags
  2. Remove URLs
  3. Remove special Arabic character (Kashida)
  4. Separate Numbers
  5. Remove Multiple Whitespaces
  6. Clear Punctuations
  7. Remove english letters and english and arabic numbers
  8. Remove shifts
Tokenization
  • Split Using: [\n.,،؛:«»?؟]+
Fix Diacritization Issue [Train & Validation Only]
  1. Replace consecutive diacritics with a single diacritic
  2. Ending Diacritics: Remove diacritics at the end of a word
  3. Misplaced Diacritics: Remove spaces between characters and diacritics
Tashkel Removal [Train & Validation Only]
  • Remove gold class for every character
  • Harakat:
    1. "Fatha":"\u064e"
    2. "Fathatan":  "\u064b"
    3. "Damma":"\u064f"
    4. "Dammatan":"\u064c"
    5. "Kasra":"\u0650"
    6. "Kasratan":"\u064d"
    7. "Sukun":"\u0652"
    8. "Shadda":"\u0651"
    9. "Shadda Fatha":"\u0651\u064e"
    10. "Shadda Fathatan":"\u0651\u064b"
    11. "Shadda Damma":"\u0651\u064f"
    12. "Shadda Dammatan":"\u0651\u064c"
    13. "Shadda Kasra":"\u0651\u0650"
    14. "Shadda Kasratan":"\u0651\u064d"      

Network

WhatsApp Image 2024-02-12 at 13 42 10_584857b3

class Tashkeel(nn.Module):
  def __init__(self, vocab_size=vocab_size, embedding_dim=100, hidden_size=256, n_classes=n_classes):
    """
    The constructor of our Tashkeel model
    Inputs:
    - vacab_size: the number of unique words
    - embedding_dim: the embedding dimension
    - n_classes: the number of final classes (tags)
    """
    super(Tashkeel, self).__init__()
    # (1) Create the embedding layer
    self.embedding = nn.Embedding(num_embeddings=vocab_size,embedding_dim=embedding_dim)

    # (2) Create an LSTM layer with hidden size = hidden_size and batch_first = True
    # self.lstm =  nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,batch_first=True)
    self.lstm =  nn.LSTM(input_size=embedding_dim,hidden_size=hidden_size,batch_first=True,num_layers=2,bidirectional=True)

    # (3) Create a linear layer with number of neorons = n_classes
    self.linear =  nn.Linear(2*hidden_size,n_classes)

Contributors


Ahmed Hany

Mohab Zaghloul


Shaza Mohamed


Basma Elhoseny

License

This software is licensed under MIT License, See License for more information ©Basma Elhoseny.

About

A system that takes a sentence and produces the same sentence after restoring the missing diacritics.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •