Chukchi morphological parser

Description

This is a Google Summer of Code project for a morphological analyzer (autoglosser) for Chukchi based on Helsinki Finite State Transducers (HFST).

Chukchi

Chukchi is a language with rich morphology, both inflexional and derivational, and a language with vowel harmony. Vowel harmony is a linguistic phenomenon when vowels are grouped and only vowels from the same group can occur in one word (e.g. Turkic languages). That means, (almost) all vowels in a word can change if a certain affix is attached.
Chukchi also has quite a lot of morphophonology, which is expressed via mutations of letters in some contexts. This is further complicated by Chukchi orthography, which in some cases doesn't reflect the order of the sounds adequately.
Morphological parsers that have been used for Chukchi by now (e.g. regexp-based) cannot natively and easily account for all these complications. That is why HFST technology was chosen in this project.

This transducer

The full transducer is composed of three transducers:

lexc deals with morphology and lexicon;
twoc helps to implement morphology that is not possible (or too difficult) to implement in lexc;
twol deals with phonology and morphophonology.

Compilation produces the before mentioned transducers plus two composed ones:

ckt.mor.hfst is the transducer for morphological analysis;
ckt.gen.hfst is the one for forms generation.

The files in .hfstol format are faster versions of .hfst.

The last GSoC commit is here.

Compilation

command	result
make	to compile twolc & lexc files in a single transducer
make twolc	to compile twolc only
make lexc	to compile lexc only
make final	to compose existing transducers into a single one

Compilation + coverage count is done by ./make_and_count.sh script.

Testing

test.sh is a hfst-pair-test for twolc. The input file for the positive test is test.test.
alltest.sh is a hfst-lookup test for the whole transducer.

Usage

by token

If you need to analyze a token, use this command:

$ echo "token" | hfst-lookup ckt.mor.hfst

To generate a form from a succession of tags, use this command:

$ echo "root\<tag1\>\<tag2\>" | hfst-lookup ckt.gen.hfst

whole text

This requires a bit more preparation. Go to scripts/corpus-stat.sh and change this line:

CMD="cat ../corpora/ckt.crp.txt"

according to the location of your text. The file with the text should be in .txt format. Then run corpus-stat.sh.

The output of the script (e.g. analyzed tokens) will be in the file corpora/corpus-stat-res.txt. If you want to change location of the output, change this line in scripts/corpus-stat.sh:

F=../corpora/corpus-stat-res.txt

Results

Coverage

Coverage (percentage of tokens analyzed) is calculated over the corpora/ckt.crp.txt file, which is a compilation of Chukchi fairytales.

Current state: 76.2%

Paradigms

Current state of this parser accounts for:

all of morphophonology and orthography issues;
nominal, pronominal, adjectival inflection and derivation;
uniflexionable parts of speech;
verbal inflection except certain future paradigm cells;
verbal derivation.

It also contains structures to parse compounds and incorporation, but they burst the complexity of the transducer and it was impossible to test them on the capacities availiable.

Further work

Refactoring for better and quicker performance;
Fix future;
Incorporation on refactored transducer;
Dictionary enlargement.

Name		Name	Last commit message	Last commit date
Latest commit History 499 Commits
corpora		corpora
eval		eval
lexicons		lexicons
meetings		meetings
scripts		scripts
translit		translit
AUTHORS		AUTHORS
COPYING		COPYING
ChangeLog		ChangeLog
INSTALL		INSTALL
Makefile.am		Makefile.am
Makefile~		Makefile~
NEWS		NEWS
README		README
README.md		README.md
alltest.sh		alltest.sh
autogen.sh		autogen.sh
ckt.rules.lexc		ckt.rules.lexc
ckt.twoc		ckt.twoc
ckt.twol		ckt.twol
configure.ac		configure.ac
intr.txt		intr.txt
lat_to_cyr.twol		lat_to_cyr.twol
make_and_count.sh		make_and_count.sh
neg.test.test		neg.test.test
new_words.lexc		new_words.lexc
new_words.txt		new_words.txt
tables		tables
test.lookup.new		test.lookup.new
test.lookup.new2		test.lookup.new2
test.lookup.txt		test.lookup.txt
test.sh		test.sh
test.test		test.test
tr.txt		tr.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chukchi morphological parser

Description

Chukchi

This transducer

Compilation

Testing

Usage

by token

whole text

Results

Coverage

Paradigms

Further work

About

Releases

Packages

Contributors 2

Languages

License

BasilisAndr/chkchn

Folders and files

Latest commit

History

Repository files navigation

Chukchi morphological parser

Description

Chukchi

This transducer

Compilation

Testing

Usage

by token

whole text

Results

Coverage

Paradigms

Further work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages