Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange output when transducer is compiled from a .att file #45

Open
AMR-KELEG opened this issue Mar 21, 2019 · 10 comments
Open

Strange output when transducer is compiled from a .att file #45

AMR-KELEG opened this issue Mar 21, 2019 · 10 comments

Comments

@AMR-KELEG
Copy link
Contributor

I am trying to add weights to the morphological analyser.
So while I was checking last year's project (http://wiki.apertium.org/wiki/User:Techievena/GSoC_2018_Work_Product_Submission) I noticed that the output of the analyser isn't correct (according to my understanding).
The wiki suggests that to do so I will need to:

$ cat test.att
0	1	c	c	4.567895
1	2	a	a	0.989532
2	3	t	t	2.796193
3	4	@0@	+	-3.824564
4	5	@0@	n	1.824564
5	0.525487
4	5	@0@	v	2.845989
 
$ lt-comp lr test.att test.bin 
main@standard 6 6
 
$ lt-print test.bin
0	1	c	c	4.567895	
1	2	a	a	0.989532	
2	3	t	t	2.796193	
3	4	ε	+	-3.824564	
4	5	ε	n	1.824564	
4	5	ε	v	2.845989	
5	0.525487

However, the output of the transducer is a bit strange:

$ echo "cats" | lt-proc test.bin
^cat/cat+n/cat+v$s

Shouldn't the $ sign mark the end of the analysis. Why is there an s following the $ sign?

@AMR-KELEG AMR-KELEG changed the title Strange output when transducer is compiled from an .att file Strange output when transducer is compiled from a .att file Mar 21, 2019
@unhammer
Copy link
Member

unhammer commented Mar 22, 2019 via email

@AMR-KELEG
Copy link
Contributor Author

I have checked the values of the environment values and here are the results:
The $LC_ALL environment variable isn't set.

$ echo $LANG
en_US.UTF-8

$ local -a
ar_AE.utf8
ar_BH.utf8
ar_DZ.utf8
ar_EG.utf8
ar_IN
ar_IN.utf8
ar_IQ.utf8
ar_JO.utf8
ar_KW.utf8
ar_LB.utf8
ar_LY.utf8
ar_MA.utf8
ar_OM.utf8
ar_QA.utf8
ar_SA.utf8
ar_SD.utf8
ar_SS
ar_SS.utf8
ar_SY.utf8
ar_TN.utf8
ar_YE.utf8
C
C.UTF-8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IL
en_IL.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
es_SV.utf8
POSIX

@unhammer
Copy link
Member

unhammer commented Mar 22, 2019 via email

@AMR-KELEG
Copy link
Contributor Author

@unhammer I have tried setting both the LANG and the LC_ALL environment variables yet the problem still exists.

@AMR-KELEG
Copy link
Contributor Author

Your analyser doesn't include an analysis for cats, just for cat. Unanalyzed and non-alphabetic symbols are just output as blanks (whitespace etc.) without surrounding ^$. In regular dix files, you can define an containing the alphabetic symbols. If a symbol is alphabetic, it can't be output as a blank, so if 's' were in in a .dix file, you'd actually get ^cats/*cats$. I thought the .att compiler actually had a heuristic for alphabetics using something like isalpha() or similar. So this might be a locale issue, if isalpha uses locale information to say that a certain character is alphabetic. What do you get from $ echo $LC_ALL $ echo $LANG $ locale -a ?

Actually the alphabet is represented by a set of the unique characters in the .att file:
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/fst_processor.cc#L844

And a character is considered to be part of the alphabet if it exists in the set of unique characters:
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/fst_processor.cc#L838

Should we update the isAlphabetic method?

AMR-KELEG added a commit to AMR-KELEG/lttoolbox that referenced this issue May 10, 2019
Solves apertium#45
Consider alphanumeric characters to be part of the vocabulary.
TinoDidriksen pushed a commit that referenced this issue May 11, 2019
Solves #45
Consider alphanumeric characters to be part of the vocabulary.
@AMR-KELEG
Copy link
Contributor Author

We may also need to use the version of the function that accepts a locale as a parameter:
http://www.cplusplus.com/reference/locale/isalpha/

@flammie
Copy link
Member

flammie commented May 13, 2019

I think it's not clear what the best way of handling unknown alphabets is, a locale-aware isalpha is probably better than the current approach though. I think one step further would be insome of the unicode technical reports about word-breaking / line-breaking / text-segmentation etc, perhaps ICU has some implementation of which. For latin- and cyrillic-based languages this is mostly relevant in early stages of development of any analyser, or with toy examples, as any non-trivial dictionary will end up covering all non-marginal characters.

@unhammer
Copy link
Member

It's also relevant for developed analysers using lt-proc -e since only unknown tokens will get compound analyses (we used to analyse VM-kampen as three tokens in nno-nob, but now we have - in <alphabet> so that it first becomes one big unknown token which is then compound-analysed.)

@flammie
Copy link
Member

flammie commented May 14, 2019

It's also relevant for developed analysers using lt-proc -e since only unknown tokens will get compound analyses (we used to analyse VM-kampen as three tokens in nno-nob, but now we have - in <alphabet> so that it first becomes one big unknown token which is then compound-analysed.)

Ah, I haven't set compounds up properly yet for my languages, but I guess even for that case isalnum is a gooder solution than not?

@unhammer
Copy link
Member

isalpha is a gooder solution at least, but for isalnum you have to make sure compounding works correctly on numbers in all directions, or things that used to get translated will become unknowns

AMR-KELEG added a commit to AMR-KELEG/lttoolbox that referenced this issue Jun 20, 2019
Solves apertium#45
Consider alphanumeric characters to be part of the vocabulary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants