Strange output when transducer is compiled from a .att file #45

AMR-KELEG · 2019-03-21T21:42:37Z

I am trying to add weights to the morphological analyser.
So while I was checking last year's project (http://wiki.apertium.org/wiki/User:Techievena/GSoC_2018_Work_Product_Submission) I noticed that the output of the analyser isn't correct (according to my understanding).
The wiki suggests that to do so I will need to:

$ cat test.att
0	1	c	c	4.567895
1	2	a	a	0.989532
2	3	t	t	2.796193
3	4	@0@	+	-3.824564
4	5	@0@	n	1.824564
5	0.525487
4	5	@0@	v	2.845989
 
$ lt-comp lr test.att test.bin 
main@standard 6 6
 
$ lt-print test.bin
0	1	c	c	4.567895	
1	2	a	a	0.989532	
2	3	t	t	2.796193	
3	4	ε	+	-3.824564	
4	5	ε	n	1.824564	
4	5	ε	v	2.845989	
5	0.525487

However, the output of the transducer is a bit strange:

$ echo "cats" | lt-proc test.bin
^cat/cat+n/cat+v$s

Shouldn't the $ sign mark the end of the analysis. Why is there an s following the $ sign?

The text was updated successfully, but these errors were encountered:

unhammer · 2019-03-22T07:55:56Z

Your analyser doesn't include an analysis for cats, just for cat. Unanalyzed and non-alphabetic symbols are just output as blanks (whitespace etc.) without surrounding ^$. In regular dix files, you can define an <alphabet> containing the alphabetic symbols. If a symbol is alphabetic, it can't be output as a blank, so if 's' were in <alphabet> in a .dix file, you'd actually get ^cats/*cats$. I thought the .att compiler actually had a heuristic for alphabetics using something like isalpha() or similar. So this *might* be a locale issue, if isalpha uses locale information to say that a certain character is alphabetic. What do you get from $ echo $LC_ALL $ echo $LANG $ locale -a ?

AMR-KELEG · 2019-03-22T08:56:47Z

I have checked the values of the environment values and here are the results:
The $LC_ALL environment variable isn't set.

$ echo $LANG
en_US.UTF-8

$ local -a
ar_AE.utf8
ar_BH.utf8
ar_DZ.utf8
ar_EG.utf8
ar_IN
ar_IN.utf8
ar_IQ.utf8
ar_JO.utf8
ar_KW.utf8
ar_LB.utf8
ar_LY.utf8
ar_MA.utf8
ar_OM.utf8
ar_QA.utf8
ar_SA.utf8
ar_SD.utf8
ar_SS
ar_SS.utf8
ar_SY.utf8
ar_TN.utf8
ar_YE.utf8
C
C.UTF-8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IL
en_IL.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
es_SV.utf8
POSIX

unhammer · 2019-03-22T09:31:04Z

Amr Mohamed <notifications@github.com> čálii:

I have checked the values of the environment values and here are the results: The `$LC_ALL` environment variable isn't set. ``` $ echo $LANG en_US.UTF-8

This doesn't appear in your list of `locale -a`. Could you try again with LANG set to one from your list, e.g. export LANG=C.UTF-8 and recompile+test. And if that doesn't help, try also export LC_ALL=C.UTF-8

AMR-KELEG · 2019-05-10T19:53:06Z

@unhammer I have tried setting both the LANG and the LC_ALL environment variables yet the problem still exists.

AMR-KELEG · 2019-05-10T22:58:01Z

Your analyser doesn't include an analysis for cats, just for cat. Unanalyzed and non-alphabetic symbols are just output as blanks (whitespace etc.) without surrounding ^$. In regular dix files, you can define an containing the alphabetic symbols. If a symbol is alphabetic, it can't be output as a blank, so if 's' were in in a .dix file, you'd actually get ^cats/*cats$. I thought the .att compiler actually had a heuristic for alphabetics using something like isalpha() or similar. So this might be a locale issue, if isalpha uses locale information to say that a certain character is alphabetic. What do you get from $ echo $LC_ALL $ echo $LANG $ locale -a ?

Actually the alphabet is represented by a set of the unique characters in the .att file:
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/fst_processor.cc#L844

And a character is considered to be part of the alphabet if it exists in the set of unique characters:
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/fst_processor.cc#L838

Should we update the isAlphabetic method?

Solves apertium#45 Consider alphanumeric characters to be part of the vocabulary.

Solves #45 Consider alphanumeric characters to be part of the vocabulary.

AMR-KELEG · 2019-05-11T15:15:11Z

We may also need to use the version of the function that accepts a locale as a parameter:
http://www.cplusplus.com/reference/locale/isalpha/

flammie · 2019-05-13T10:55:16Z

I think it's not clear what the best way of handling unknown alphabets is, a locale-aware isalpha is probably better than the current approach though. I think one step further would be insome of the unicode technical reports about word-breaking / line-breaking / text-segmentation etc, perhaps ICU has some implementation of which. For latin- and cyrillic-based languages this is mostly relevant in early stages of development of any analyser, or with toy examples, as any non-trivial dictionary will end up covering all non-marginal characters.

unhammer · 2019-05-14T07:33:29Z

It's also relevant for developed analysers using lt-proc -e since only unknown tokens will get compound analyses (we used to analyse VM-kampen as three tokens in nno-nob, but now we have - in <alphabet> so that it first becomes one big unknown token which is then compound-analysed.)

flammie · 2019-05-14T19:24:57Z

It's also relevant for developed analysers using lt-proc -e since only unknown tokens will get compound analyses (we used to analyse VM-kampen as three tokens in nno-nob, but now we have - in <alphabet> so that it first becomes one big unknown token which is then compound-analysed.)

Ah, I haven't set compounds up properly yet for my languages, but I guess even for that case isalnum is a gooder solution than not?

unhammer · 2019-05-15T07:49:26Z

isalpha is a gooder solution at least, but for isalnum you have to make sure compounding works correctly on numbers in all directions, or things that used to get translated will become unknowns

Solves apertium#45 Consider alphanumeric characters to be part of the vocabulary.

AMR-KELEG changed the title ~~Strange output when transducer is compiled from an .att file~~ Strange output when transducer is compiled from a .att file Mar 21, 2019

AMR-KELEG added a commit to AMR-KELEG/lttoolbox that referenced this issue May 10, 2019

Fix the out of alphabet token handling in analyses generation

5c8ec17

Solves apertium#45 Consider alphanumeric characters to be part of the vocabulary.

AMR-KELEG mentioned this issue May 10, 2019

Fix the out of alphabet token handling in analyses generation #52

Merged

TinoDidriksen pushed a commit that referenced this issue May 11, 2019

Fix the out of alphabet token handling in analyses generation

944ed25

Solves #45 Consider alphanumeric characters to be part of the vocabulary.

AMR-KELEG added a commit to AMR-KELEG/lttoolbox that referenced this issue Jun 20, 2019

Fix the out of alphabet token handling in analyses generation

3129368

Solves apertium#45 Consider alphanumeric characters to be part of the vocabulary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange output when transducer is compiled from a .att file #45

Strange output when transducer is compiled from a .att file #45

AMR-KELEG commented Mar 21, 2019

unhammer commented Mar 22, 2019 via email

AMR-KELEG commented Mar 22, 2019

unhammer commented Mar 22, 2019 via email

AMR-KELEG commented May 10, 2019

AMR-KELEG commented May 10, 2019

AMR-KELEG commented May 11, 2019

flammie commented May 13, 2019

unhammer commented May 14, 2019

flammie commented May 14, 2019

unhammer commented May 15, 2019

Strange output when transducer is compiled from a .att file #45

Strange output when transducer is compiled from a .att file #45

Comments

AMR-KELEG commented Mar 21, 2019

unhammer commented Mar 22, 2019 via email

AMR-KELEG commented Mar 22, 2019

unhammer commented Mar 22, 2019 via email

AMR-KELEG commented May 10, 2019

AMR-KELEG commented May 10, 2019

AMR-KELEG commented May 11, 2019

flammie commented May 13, 2019

unhammer commented May 14, 2019

flammie commented May 14, 2019

unhammer commented May 15, 2019