-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange output when transducer is compiled from a .att file #45
Comments
Your analyser doesn't include an analysis for cats, just for
cat. Unanalyzed and non-alphabetic symbols are just output as blanks
(whitespace etc.) without surrounding ^$.
In regular dix files, you can define an <alphabet> containing the
alphabetic symbols. If a symbol is alphabetic, it can't be output as a
blank, so if 's' were in <alphabet> in a .dix file, you'd actually get
^cats/*cats$.
I thought the .att compiler actually had a heuristic for alphabetics
using something like isalpha() or similar. So this *might* be a locale
issue, if isalpha uses locale information to say that a certain
character is alphabetic. What do you get from
$ echo $LC_ALL
$ echo $LANG
$ locale -a
?
|
I have checked the values of the environment values and here are the results:
|
Amr Mohamed <notifications@github.com> čálii:
I have checked the values of the environment values and here are the results:
The `$LC_ALL` environment variable isn't set.
```
$ echo $LANG
en_US.UTF-8
This doesn't appear in your list of `locale -a`. Could you try again
with LANG set to one from your list, e.g.
export LANG=C.UTF-8
and recompile+test. And if that doesn't help, try also
export LC_ALL=C.UTF-8
|
@unhammer I have tried setting both the LANG and the LC_ALL environment variables yet the problem still exists. |
Actually the alphabet is represented by a set of the unique characters in the .att file: And a character is considered to be part of the alphabet if it exists in the set of unique characters: Should we update the isAlphabetic method? |
Solves apertium#45 Consider alphanumeric characters to be part of the vocabulary.
Solves #45 Consider alphanumeric characters to be part of the vocabulary.
We may also need to use the version of the function that accepts a locale as a parameter: |
I think it's not clear what the best way of handling unknown alphabets is, a locale-aware isalpha is probably better than the current approach though. I think one step further would be insome of the unicode technical reports about word-breaking / line-breaking / text-segmentation etc, perhaps ICU has some implementation of which. For latin- and cyrillic-based languages this is mostly relevant in early stages of development of any analyser, or with toy examples, as any non-trivial dictionary will end up covering all non-marginal characters. |
It's also relevant for developed analysers using |
Ah, I haven't set compounds up properly yet for my languages, but I guess even for that case isalnum is a gooder solution than not? |
isalpha is a gooder solution at least, but for isalnum you have to make sure compounding works correctly on numbers in all directions, or things that used to get translated will become unknowns |
Solves apertium#45 Consider alphanumeric characters to be part of the vocabulary.
I am trying to add weights to the morphological analyser.
So while I was checking last year's project (http://wiki.apertium.org/wiki/User:Techievena/GSoC_2018_Work_Product_Submission) I noticed that the output of the analyser isn't correct (according to my understanding).
The wiki suggests that to do so I will need to:
However, the output of the transducer is a bit strange:
Shouldn't the $ sign mark the end of the analysis. Why is there an s following the $ sign?
The text was updated successfully, but these errors were encountered: