Mismatch between special verb composition and its incorporation with the standard general make #2

aarppe · 2021-06-02T01:43:54Z

When compiling verb morphology using verb_lexicon.xfscript (which accesses the various stem and affix and other LEXC files), we get a well behaving FST with some 486 analysis-form pairings, which appear all to be correct (1.pairs.txt.

But when we run make in the standard GT compilation, we end up having substantially more, 16321, analysis form pairings, most of which are gibberish (2.pairs.txt).

@snomos Where might this go wrong? It is as if the TAMA flag-diacritics seem no longer to be applied, when the verb_lexicon.hfst is combined with the rest.

The text was updated successfully, but these errors were encountered:

snomos · 2021-06-02T05:56:02Z

How did you generate those lists (exact command)? Which fst tool did you use? Foma or Hfst?

aarppe · 2021-06-02T06:15:03Z

I ran pairs > blaa.txt in FOMA (HFST doesn't support that command).

As input for one I used a FOMA-transformed version of verb_lexicon.hfst, for the other the normative FOMA analyzer/generator that the make compilation creates in lang-srs/src/

snomos · 2021-06-02T06:47:32Z

I ran pairs > blaa.txt in FOMA (HFST doesn't support that command).

The corresponding HFST command is:

hfst-fst2strings -X obey-flags src/fst/verb_lexicon.hfst

For whatever reason it is unbelievably slow. @flammie do you know why?

As input for one I used a FOMA-transformed version of verb_lexicon.hfst, for the other the normative FOMA analyzer/generator that the make compilation creates in lang-srs/src/

Ok.

flammie · 2021-06-02T15:09:47Z

I ran pairs > blaa.txt in FOMA (HFST doesn't support that command).

The corresponding HFST command is:
hfst-fst2strings -X obey-flags src/fst/verb_lexicon.hfst
For whatever reason it is unbelievably slow. @flammie do you know why?

I cannot see anything obvious; if you do it without flags it seems quite fast though very long as well, and there seems to be a lot of flags in the automaton, maybe the flag minding print even collects first and filters and prints afterwards? I'll try if I can debug print to find something...

aarppe · 2021-06-02T17:18:29Z

Examining that the correct set of pairings is around 500 cases, and that the incorrect set seems to allow all possible inner inflectional prefix chunks (of which there are some 20 types), it seems that the longer lists in terms of its size might just be 500 x 20/30 ~ 10/15k.

If this is the correct diagnosis here, the crucial question now is, where and how in the general GT compilation might it happen that the @U.TAMA....@ (or other) flags are rendered ineffective? Is there some explicit option or implicit behavior in compilation that works so? And/or are some flag-diacritics left un-defined somewhere?

snomos · 2021-06-03T08:38:26Z

I have now run three different tests, to see whether there are meaningful differences between them. The tests are as follows:

hfst-fst2strings -X obey-flags src/fst/verb_lexicon.hfst > verbtest.txt
hfst-fst2strings -X obey-flags src/fst/lexicon.hfst > lextest.txt
hfst-fst2strings -X obey-flags src/analyser-gt-norm.hfst > ananormtest.txt

That is, extract all pair strings from the verb_lexicon.hfst file (the one built using the SRS specific xfscript), from the unified lexicon (in practice: verbs + symbols and punctuation marks), and finally from one of the derived fst's.

The output is in practice identical in all cases:

wc -l verbtest.txt 
     486 verbtest.txt
grep -v 'PUNCT' lextest.txt | grep -v 'CLB' | wc -l
     486
grep -v 'PUNCT' ananormtest.txt | grep -v 'CLB' | wc -l
     962

The larger number for the last case is caused by automatic initial uppercasing, essentially doubling the count. When divided by 2, the result is 481, and if we assume a handful of non-casing initial letters, then the numbers add up perfectly. And in any case it is clear that this is far from the 16k + reported in the opening comment.

That is, using pure HFST, I see no issue at all. I thus suspect that the error is related to the conversion from HFST to FOMA.

aarppe · 2021-06-03T08:50:26Z

Hmmm... When I use hfst-fst2fst to convert the aforementioned file src/analyser-gt-norm.hfst to foma, I also get reasonable numbers, cf.

hfst-fst2fst -F -b src/analyser-gt-norm.hfst -o tmp.foma
...
foma[0]: load tmp.foma 
889.1 kB. 21905 states, 56346 arcs, more than 9223372036854775807 paths.

foma[1]: pairs > pairs.txt
Writing to pairs.txt.
...
wc -l pairs.txt 
    1021 pairs.txt

So I'm wondering what file is used as the source for the hfst-to-foma conversion?

aarppe · 2021-06-10T01:24:23Z

A further note is that the normative FOMA generator seems to work appropriately as well, cf.

foma[0]: load src/generator-gt-norm.foma 
900.2 kB. 22213 states, 56618 arcs, more than 9223372036854775807 paths.
foma[1]: pairs > pairs.txt
Writing to pairs.txt.
foma[1]: 
zsh: suspended  foma

wc -l pairs.txt 
     694 pairs.txt

So the glitch seems to be in the conversion of the normative FOMA analyzer. Maybe that is the explicit place we should look into with the GT compilation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between special verb composition and its incorporation with the standard general make #2

Mismatch between special verb composition and its incorporation with the standard general make #2

aarppe commented Jun 2, 2021

snomos commented Jun 2, 2021

aarppe commented Jun 2, 2021

snomos commented Jun 2, 2021 •

edited

Loading

flammie commented Jun 2, 2021

aarppe commented Jun 2, 2021

snomos commented Jun 3, 2021

aarppe commented Jun 3, 2021

aarppe commented Jun 10, 2021

Mismatch between special verb composition and its incorporation with the standard general make #2

Mismatch between special verb composition and its incorporation with the standard general make #2

Comments

aarppe commented Jun 2, 2021

snomos commented Jun 2, 2021

aarppe commented Jun 2, 2021

snomos commented Jun 2, 2021 • edited Loading

flammie commented Jun 2, 2021

aarppe commented Jun 2, 2021

snomos commented Jun 3, 2021

aarppe commented Jun 3, 2021

aarppe commented Jun 10, 2021

snomos commented Jun 2, 2021 •

edited

Loading