Implement support for getting plural categories #167

filips123 · 2020-12-16T09:35:43Z

Description

This implementes support for getting plural categories and prepared for future full pliralization functions. It also changes how _translate_word works to make it possible to get correct plural forms in nice_duration.

I added those functions to new utils module, because I think they don't fit completely into existing parse and format. If you think adding them to existing modules would be better, I can move them. Because of #160 I added empty utils modules for for all other languages.

Format of translations in res/text is now changed to use JSON files that contain all translations for all plural forms. However, I kept existing format in most languages (except English and Slovenian) and added "legacy" translation function. That languages need to be converted in the future and legacy function can be deleted.

Currently the only languages that implement plural categories are English and Slovenian, but there is a fallback to English-style pluralization for all other languages for backwards compatibility.

Type of PR

Testing

There are tests for English and Slovenian language.

Documentation

I added/fixed docstrings where needed and created documentation in readme and project structure files.

JarbasAl · 2020-12-16T09:40:13Z

i think this belongs in format module, dont think we should create a new utils submodule for this

essentially if it outputs text -> formatting
if i receives text and extracts info from it -> parsing

this clearly belongs in the format module

i would also prefer if this came from a single dict, if we create a .json file for each category it quickly gets out of hand, connectors.json only has 2 entries for example.... I would suggest this becomes a simple pluralization.json file, if you feel the need add categories as top level keys inside that single json file

a .json file also does not scale well, so logic should be available, but thats for another future PR #37 #36

filips123 · 2020-12-17T18:13:04Z

I moved functions into format module and all translations into translations.json.

However, there is some problem. Some tests in test/test_format_sl.py fail because it looks like lingua-franca does not detect get_plural_category_sl in lingua_franca/lang/format_sl.py. This also happens locally, but if I just run that specific file with pytest test/test_format_sl.py, everything works.

I also added encoding="utf8" to all file open calls, because Python by default uses system encoding which is not always UTF-8 (for example on Windows) so tests can fail.

ChanceNCounter · 2020-12-17T18:43:29Z

There's some boilerplate at the top of the other tests, which is required. Your tests pass once you paste it in there. It's this part:

class TestPronounceNumber(unittest.TestCase):
    def setUp(self):
        self.old_lang = get_default_lang()
        set_default_lang("sl-si")


    def tearDown(self):
        if self.old_lang:
            set_default_lang(self.old_lang)

Basically, there's no way to guarantee the order in which pytest will run tests. In order to be absolutely, 100% certain that the language you're testing is the default language at the moment the test runs, we set and then reset it in each test. That is, when a given test class is finished, the default language goes back to what it had been before the test ran. It's not a perfect solution, but it keeps the paradigm from breaking Pytest.

Hence, at the moment, with that boilerplate missing, your new tests are trying to run with some other default language, one in which the function really hasn't been localized.

In a perfect world, we'd just unload all languages after every test, so that this problem would throw a slightly more straightforward "no language module loaded." However, that didn't resolve the problem when I was writing the localizer. I should revisit that...

ChanceNCounter · 2020-12-17T21:55:19Z

lingua_franca/lang/format_en.py

+            return "other"
+
+    elif type == "ordinal":
+        if amount % 10 == 1 and amount % 100 != 11:


It's not the end of the world, but if you store amount % 10 and amount % 100 and use variables for these comparisons, it'll shave several operations off of the later cases.

ChanceNCounter · 2020-12-17T21:57:45Z

Looks like some of the English and Slovenian res files got nuked when you renamed that file (hello, fellow regex person! 😁)

JarbasAl · 2020-12-18T12:36:11Z

nitpick, can translations.json be renamed to something more explicit? maybe pluralizations.json?

ChanceNCounter · 2020-12-20T23:41:07Z

Now having more time to review, I'm not sure pluralizations is a better name, because it now also houses the singlar forms of critical vocabulary.

Perhaps something like common_words.json or similar (I'm not exactly married to that, either =P) and the localization docs can just make it explicit that this contains important vocabulary singular and plural forms.

However, (@JarbasAl @jmontane) I wonder what implications this approach might have down the road for languages that need the "variant" framework.

ChanceNCounter · 2020-12-30T22:40:13Z

@JarbasAl bump re: compatibility with variants, possible need for integration (plural variants?)

emphasize · 2022-07-13T21:35:20Z

Let's chime in.

As is you'd need to pass an english singular to get the localized plural. I guess there are some good cross-language arguments.
Lets say you want to pluralize some skill vocabulary (native language) - you're out of luck

The local function isn't used widely, so not a deep code. Why not step away from _translate_word? i get it, as a mycroft-core component, but LF?

Please tell me if i'm missing something.

filips123 · 2022-07-14T11:10:43Z

This PR has been inactive for quite some time, so I kinda forgot about it, but I think it would still be useful. I think one of the things that weren't determined yet was how to name pluralizations.json/translations.json. If that can be resolved and maintainers think the PR would be useful, I can try to rebase it and resolve conflicts.

The local function isn't used widely, so not a deep code. Why not step away from _translate_word?

_translate_word is only used internally from nice_duration and join_list to translate (and correctly pluralize) a few words that are used in these functions. It is not meant to be used by external skills.

A similar function to translate words already exists (and has been renamed to _translate_word_legacy in this PR), but only supports simple singular/plural form, so it does not work for languages with more complex plural forms. With this PR, _translate_word should now work with most languages, using Unicode CLDR Plural Rules.

Lets say you want to pluralize some skill vocabulary (native language)

In this case, you should use get_plural_category. You will get Unicode CLDR plural category and will then need to handle it according to your language's rules. You would probably have a dictionary of plural forms for all words and languages you need, and then just lookup it using category you get from get_plural_category.

There is also get_plural_form function, which is meant to get the correct plural form of any (native) word. However, because proper pluralization for arbitrary words is hard to do, it probably won't actually be implemented for any language for quite some time. So, you should probably use get_plural_category and do pluralization yourself, only for words you actually need.

i get it, as a mycroft-core component, but LF?

If you meant why _translate_word is in LF, it's because it's only an internal function used by other LF functions, so it wouldn't make sense it's in mycroft-core.

If you meant why get_plural_category and get_plural_form are in LF, it's because I think pluralization counts under "multilingual text parsing and formatting".

krisgesling · 2022-07-15T00:40:11Z

Hey there, I haven't looked at this deeply but definitely agree with the premise.

We're very close to launching the Mark II so keeping our focus quite specifically on that, but I'd like to come back to this once we hit code freeze on the Mark II.

I won't try to present an opinion on the questions raised until I've properly digested this. I would however say that I'm not opposed to more modules in LF beyond parse and format. Where the functions from this PR live is another question but very open to a util module if that makes more sense for this.

JarbasAl · 2022-07-15T02:20:44Z

I approve these changes and cherry picked these commits for ovos, so far i only did minor changes like adding an enum and porting the old singularize/pluralize prs for english/portuguese, but i still want to change a few more things like renaming "type" to something else to avoid the namespace collision with the builtin type etc.

might be interesting to keep an eye on OpenVoiceOS#28 , for what its worth i think this PR is looking great already!

devs-mycroft added the CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) label Dec 16, 2020

This was referenced Dec 16, 2020

singularize/pluralize #36

Closed

Feat/plural singular pt #37

Closed

filips123 force-pushed the pluralization-support branch from 8e1fc52 to 9174cce Compare December 17, 2020 18:05

filips123 added 3 commits December 17, 2020 20:38

Implement support for getting plural categories

367d851

Add new format for duration translations

101389b

Add documentation about pluralization

c28ffe1

filips123 force-pushed the pluralization-support branch from 9174cce to c28ffe1 Compare December 17, 2020 19:38

ChanceNCounter reviewed Dec 17, 2020

View reviewed changes

Rename translations to pluralizations

5ef8970

NeonJarbas mentioned this pull request Jul 13, 2022

Feat/pluralization OpenVoiceOS/ovos-lingua-franca#28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement support for getting plural categories #167

Implement support for getting plural categories #167

filips123 commented Dec 16, 2020

JarbasAl commented Dec 16, 2020 •

edited

Loading

filips123 commented Dec 17, 2020

ChanceNCounter commented Dec 17, 2020

ChanceNCounter Dec 17, 2020

ChanceNCounter commented Dec 17, 2020

JarbasAl commented Dec 18, 2020

ChanceNCounter commented Dec 20, 2020

ChanceNCounter commented Dec 30, 2020

emphasize commented Jul 13, 2022 •

edited

Loading

filips123 commented Jul 14, 2022

krisgesling commented Jul 15, 2022

JarbasAl commented Jul 15, 2022

Implement support for getting plural categories #167

Are you sure you want to change the base?

Implement support for getting plural categories #167

Conversation

filips123 commented Dec 16, 2020

Description

Type of PR

Testing

Documentation

JarbasAl commented Dec 16, 2020 • edited Loading

filips123 commented Dec 17, 2020

ChanceNCounter commented Dec 17, 2020

ChanceNCounter Dec 17, 2020

Choose a reason for hiding this comment

ChanceNCounter commented Dec 17, 2020

JarbasAl commented Dec 18, 2020

ChanceNCounter commented Dec 20, 2020

ChanceNCounter commented Dec 30, 2020

emphasize commented Jul 13, 2022 • edited Loading

filips123 commented Jul 14, 2022

krisgesling commented Jul 15, 2022

JarbasAl commented Jul 15, 2022

JarbasAl commented Dec 16, 2020 •

edited

Loading

emphasize commented Jul 13, 2022 •

edited

Loading