The lists of dialectal words for 15 countries are collected from Twitter. Every word in each Arabic dialect list is mentioned along with its PMI score, representing the word's degree of relatedness to that dialect.
The unsupervised approach to build the lists uses an iterative procedure consisting of three main steps: automatic creation of dialectal word lists, selection of seed words, and collection of dialectal sentences. The Pointwise Mutual Information (PMI) association measure, along with the geographical frequency of word occurrence online were used to classify dialectal words. The poor performance of MSA POS tagger on dialectal Arabic contents was exploited in order to extract the dialectal words.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).
https://creativecommons.org/licenses/by-nc-nd/4.0/
You are free to:
Share — copy and redistribute the material in any medium or format
Under the following terms:
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.
NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material.
Please cite our paper in any published work using this resource:
@article{althobaiti2021creation,
title={Creation of annotated country-level dialectal Arabic resources: An unsupervised approach},
author={Althobaiti, Maha J},
journal={Natural Language Engineering},
pages={1--42},
year={2021},
publisher={Cambridge University Press}
}