This project aims to develop a system for Arabic text diacritization using natural language processing (NLP) techniques. Diacritization involves adding diacritical marks (e.g., vowels, short vowels, etc.) to Arabic text, which aids in pronunciation and comprehension, particularly for learners or in automated text processing tasks.
- Diacritize Arabic text input.
- Support for various diacritical marks commonly used in Arabic.
- Evaluate diacritization accuracy through metrics such as accuracy, precision, and recall.
- Trainable model for improving diacritization performance.
- Clone the repository:
git clone https://github.com/khaHesham/arabic-diacritization.git
- Install dependencies:
cd arabic-diacritization pip install -r requirements.txt
- Prepare your Arabic text data.
- Run the diacritization script:
Replace
python diacritize.py --input input.txt --output output.txt
input.txt
with the path to your input file andoutput.txt
with the desired output file path. - Evaluate diacritization accuracy:
Replace
python evaluate.py --predicted predicted.txt --gold gold.txt
predicted.txt
with the path to the predicted diacritized text file andgold.txt
with the path to the gold standard diacritized text file.
If you wish to train your own diacritization model:
- Prepare a training dataset with diacritized Arabic text.
- Train the model:
Replace
python train.py --train_data train.txt --dev_data dev.txt --model_dir model/
train.txt
with the path to your training data,dev.txt
with the path to your development data, andmodel/
with the desired directory for saving the trained model.
- Abdelaziz Salah
- Abdelrahman Noaman
- Khaled Hesham
- Kirollos Samy
This project is licensed under the MIT License - see the LICENSE file for details.
For any inquiries or feedback, please contact AEyeTeam.