Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between special verb composition and its incorporation with the standard general make #2

Open
aarppe opened this issue Jun 2, 2021 · 8 comments

Comments

@aarppe
Copy link
Contributor

aarppe commented Jun 2, 2021

When compiling verb morphology using verb_lexicon.xfscript (which accesses the various stem and affix and other LEXC files), we get a well behaving FST with some 486 analysis-form pairings, which appear all to be correct (1.pairs.txt.

But when we run make in the standard GT compilation, we end up having substantially more, 16321, analysis form pairings, most of which are gibberish (2.pairs.txt).

@snomos Where might this go wrong? It is as if the TAMA flag-diacritics seem no longer to be applied, when the verb_lexicon.hfst is combined with the rest.

@snomos
Copy link
Member

snomos commented Jun 2, 2021

How did you generate those lists (exact command)? Which fst tool did you use? Foma or Hfst?

@aarppe
Copy link
Contributor Author

aarppe commented Jun 2, 2021

I ran pairs > blaa.txt in FOMA (HFST doesn't support that command).

As input for one I used a FOMA-transformed version of verb_lexicon.hfst, for the other the normative FOMA analyzer/generator that the make compilation creates in lang-srs/src/

@snomos
Copy link
Member

snomos commented Jun 2, 2021

I ran pairs > blaa.txt in FOMA (HFST doesn't support that command).

The corresponding HFST command is:

hfst-fst2strings -X obey-flags src/fst/verb_lexicon.hfst

For whatever reason it is unbelievably slow. @flammie do you know why?

As input for one I used a FOMA-transformed version of verb_lexicon.hfst, for the other the normative FOMA analyzer/generator that the make compilation creates in lang-srs/src/

Ok.

@flammie
Copy link
Contributor

flammie commented Jun 2, 2021

I ran pairs > blaa.txt in FOMA (HFST doesn't support that command).

The corresponding HFST command is:

hfst-fst2strings -X obey-flags src/fst/verb_lexicon.hfst

For whatever reason it is unbelievably slow. @flammie do you know why?

I cannot see anything obvious; if you do it without flags it seems quite fast though very long as well, and there seems to be a lot of flags in the automaton, maybe the flag minding print even collects first and filters and prints afterwards? I'll try if I can debug print to find something...

@aarppe
Copy link
Contributor Author

aarppe commented Jun 2, 2021

Examining that the correct set of pairings is around 500 cases, and that the incorrect set seems to allow all possible inner inflectional prefix chunks (of which there are some 20 types), it seems that the longer lists in terms of its size might just be 500 x 20/30 ~ 10/15k.

If this is the correct diagnosis here, the crucial question now is, where and how in the general GT compilation might it happen that the @U.TAMA....@ (or other) flags are rendered ineffective? Is there some explicit option or implicit behavior in compilation that works so? And/or are some flag-diacritics left un-defined somewhere?

@snomos
Copy link
Member

snomos commented Jun 3, 2021

I have now run three different tests, to see whether there are meaningful differences between them. The tests are as follows:

hfst-fst2strings -X obey-flags src/fst/verb_lexicon.hfst > verbtest.txt
hfst-fst2strings -X obey-flags src/fst/lexicon.hfst > lextest.txt
hfst-fst2strings -X obey-flags src/analyser-gt-norm.hfst > ananormtest.txt

That is, extract all pair strings from the verb_lexicon.hfst file (the one built using the SRS specific xfscript), from the unified lexicon (in practice: verbs + symbols and punctuation marks), and finally from one of the derived fst's.

The output is in practice identical in all cases:

wc -l verbtest.txt 
     486 verbtest.txt
grep -v 'PUNCT' lextest.txt | grep -v 'CLB' | wc -l
     486
grep -v 'PUNCT' ananormtest.txt | grep -v 'CLB' | wc -l
     962

The larger number for the last case is caused by automatic initial uppercasing, essentially doubling the count. When divided by 2, the result is 481, and if we assume a handful of non-casing initial letters, then the numbers add up perfectly. And in any case it is clear that this is far from the 16k + reported in the opening comment.

That is, using pure HFST, I see no issue at all. I thus suspect that the error is related to the conversion from HFST to FOMA.

@aarppe
Copy link
Contributor Author

aarppe commented Jun 3, 2021

Hmmm... When I use hfst-fst2fst to convert the aforementioned file src/analyser-gt-norm.hfst to foma, I also get reasonable numbers, cf.

hfst-fst2fst -F -b src/analyser-gt-norm.hfst -o tmp.foma
...
foma[0]: load tmp.foma 
889.1 kB. 21905 states, 56346 arcs, more than 9223372036854775807 paths.

foma[1]: pairs > pairs.txt
Writing to pairs.txt.
...
wc -l pairs.txt 
    1021 pairs.txt

So I'm wondering what file is used as the source for the hfst-to-foma conversion?

@aarppe
Copy link
Contributor Author

aarppe commented Jun 10, 2021

A further note is that the normative FOMA generator seems to work appropriately as well, cf.

foma[0]: load src/generator-gt-norm.foma 
900.2 kB. 22213 states, 56618 arcs, more than 9223372036854775807 paths.
foma[1]: pairs > pairs.txt
Writing to pairs.txt.
foma[1]: 
zsh: suspended  foma

wc -l pairs.txt 
     694 pairs.txt

So the glitch seems to be in the conversion of the normative FOMA analyzer. Maybe that is the explicit place we should look into with the GT compilation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants