Bagvalal is an endangered language from the Nakh-Daghestanian family. This repository contains a prototype for a Bagvalal morphological analyzer. It is a part of a larger project by the students of the School of Linguistics at the NRU HSE that aims to provide digital tools for endangered languages. See the paper draft for a detailed description.
You can cite the paper draft using one of the forms listed below:
- Developing morphological analyzers for low-resource languages of the Caucasus. D. Arakelova, D. Ignatiev. Term paper at the School of Linguistics, NRU HSE, 2021.
- Daria Arakelova, Daniil Ignatiev. “Developing morphological analyzers for low-resource languages of the Caucasus”. NRU HSE (2021): 22. pag.
- Аракелова Д., Игнатьев Д. Разработка морфологических анализаторов для малоресурсных языков Кавказа. НИУ ВШЭ. Москва, 2021. 22 с.
A working project demo can be found here. It includes a full-fledge GUI and an API for parsing user-provided texts.
The API specs are available here.
The project is distributed under the GNU General Public License v3.0.
The current work is based on the linguistic description of Bagvalal by A. E. Kibrik et al. (2001) and the Bagvalal dictionary by P. T. Magomedova (2005).
The used texts were collected and annotated by a group of MSU students under A. E. Kibrik during field trips to Daghestan that took place in 1997 and 1998.
The texts were transcribed after oral narration and include folk tales and anecdotes. See the paper draft for more information.
The texts can be found in the corpora directory. They have been lemmatized with each lemma positioned on a separate line, so as to simplify the measures.
To use or extend the analyzer, pull the repository and use the makefile commands described below.
lexd and hfst are required to build the project. You can get them by adding Apertium to your apt repositories.
curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
apt install lexd
apt install hfst
You can also build a docker image with all the dependencies, using the provided dockerfile.
cyrillic version:
make merged.ana.hfst
Caucasiologist transcription version
make merged.tr.hfstol
View the statistics:
make check-coverage-stats
- cd to corpora & run make *corpus name*.analyzed to analyze with the cyrillic transducer
- cd to corpora & run make *corpus name*.tr.analyzed to analyze with the IPA transducer
check and analyze
make check-coverage-stats
cd corpora
make k_newline.tr.analyzed
Current performance: Naive Coverage ~82%
The project has been tested and is guaranteed to run on Debian and Ubuntu. We make no promises regarding the performance on other platforms.