Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for getting plural categories #167

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

filips123
Copy link
Contributor

Description

This implementes support for getting plural categories and prepared for future full pliralization functions. It also changes how _translate_word works to make it possible to get correct plural forms in nice_duration.

I added those functions to new utils module, because I think they don't fit completely into existing parse and format. If you think adding them to existing modules would be better, I can move them. Because of #160 I added empty utils modules for for all other languages.

Format of translations in res/text is now changed to use JSON files that contain all translations for all plural forms. However, I kept existing format in most languages (except English and Slovenian) and added "legacy" translation function. That languages need to be converted in the future and legacy function can be deleted.

Currently the only languages that implement plural categories are English and Slovenian, but there is a fallback to English-style pluralization for all other languages for backwards compatibility.

Type of PR

  • Bugfix
  • Feature implementation
  • Refactor of code (without functional changes)
  • Documentation improvements
  • Test improvements

Testing

There are tests for English and Slovenian language.

Documentation

I added/fixed docstrings where needed and created documentation in readme and project structure files.

@devs-mycroft devs-mycroft added the CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) label Dec 16, 2020
@JarbasAl
Copy link
Collaborator

JarbasAl commented Dec 16, 2020

i think this belongs in format module, dont think we should create a new utils submodule for this

essentially if it outputs text -> formatting
if i receives text and extracts info from it -> parsing

this clearly belongs in the format module

i would also prefer if this came from a single dict, if we create a .json file for each category it quickly gets out of hand, connectors.json only has 2 entries for example.... I would suggest this becomes a simple pluralization.json file, if you feel the need add categories as top level keys inside that single json file

a .json file also does not scale well, so logic should be available, but thats for another future PR #37 #36

@filips123
Copy link
Contributor Author

I moved functions into format module and all translations into translations.json.

However, there is some problem. Some tests in test/test_format_sl.py fail because it looks like lingua-franca does not detect get_plural_category_sl in lingua_franca/lang/format_sl.py. This also happens locally, but if I just run that specific file with pytest test/test_format_sl.py, everything works.

I also added encoding="utf8" to all file open calls, because Python by default uses system encoding which is not always UTF-8 (for example on Windows) so tests can fail.

@ChanceNCounter
Copy link
Contributor

There's some boilerplate at the top of the other tests, which is required. Your tests pass once you paste it in there. It's this part:

class TestPronounceNumber(unittest.TestCase):
    def setUp(self):
        self.old_lang = get_default_lang()
        set_default_lang("sl-si")


    def tearDown(self):
        if self.old_lang:
            set_default_lang(self.old_lang)

Basically, there's no way to guarantee the order in which pytest will run tests. In order to be absolutely, 100% certain that the language you're testing is the default language at the moment the test runs, we set and then reset it in each test. That is, when a given test class is finished, the default language goes back to what it had been before the test ran. It's not a perfect solution, but it keeps the paradigm from breaking Pytest.

Hence, at the moment, with that boilerplate missing, your new tests are trying to run with some other default language, one in which the function really hasn't been localized.

In a perfect world, we'd just unload all languages after every test, so that this problem would throw a slightly more straightforward "no language module loaded." However, that didn't resolve the problem when I was writing the localizer. I should revisit that...

return "other"

elif type == "ordinal":
if amount % 10 == 1 and amount % 100 != 11:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the end of the world, but if you store amount % 10 and amount % 100 and use variables for these comparisons, it'll shave several operations off of the later cases.

@ChanceNCounter
Copy link
Contributor

Looks like some of the English and Slovenian res files got nuked when you renamed that file (hello, fellow regex person! 😁)

@JarbasAl
Copy link
Collaborator

nitpick, can translations.json be renamed to something more explicit? maybe pluralizations.json?

@ChanceNCounter
Copy link
Contributor

Now having more time to review, I'm not sure pluralizations is a better name, because it now also houses the singlar forms of critical vocabulary.

Perhaps something like common_words.json or similar (I'm not exactly married to that, either =P) and the localization docs can just make it explicit that this contains important vocabulary singular and plural forms.

However, (@JarbasAl @jmontane) I wonder what implications this approach might have down the road for languages that need the "variant" framework.

@ChanceNCounter
Copy link
Contributor

@JarbasAl bump re: compatibility with variants, possible need for integration (plural variants?)

@emphasize
Copy link
Contributor

emphasize commented Jul 13, 2022

Let's chime in.

As is you'd need to pass an english singular to get the localized plural. I guess there are some good cross-language arguments.
Lets say you want to pluralize some skill vocabulary (native language) - you're out of luck

The local function isn't used widely, so not a deep code. Why not step away from _translate_word? i get it, as a mycroft-core component, but LF?

Please tell me if i'm missing something.

@filips123
Copy link
Contributor Author

This PR has been inactive for quite some time, so I kinda forgot about it, but I think it would still be useful. I think one of the things that weren't determined yet was how to name pluralizations.json/translations.json. If that can be resolved and maintainers think the PR would be useful, I can try to rebase it and resolve conflicts.

The local function isn't used widely, so not a deep code. Why not step away from _translate_word?

_translate_word is only used internally from nice_duration and join_list to translate (and correctly pluralize) a few words that are used in these functions. It is not meant to be used by external skills.

A similar function to translate words already exists (and has been renamed to _translate_word_legacy in this PR), but only supports simple singular/plural form, so it does not work for languages with more complex plural forms. With this PR, _translate_word should now work with most languages, using Unicode CLDR Plural Rules.

Lets say you want to pluralize some skill vocabulary (native language)

In this case, you should use get_plural_category. You will get Unicode CLDR plural category and will then need to handle it according to your language's rules. You would probably have a dictionary of plural forms for all words and languages you need, and then just lookup it using category you get from get_plural_category.

There is also get_plural_form function, which is meant to get the correct plural form of any (native) word. However, because proper pluralization for arbitrary words is hard to do, it probably won't actually be implemented for any language for quite some time. So, you should probably use get_plural_category and do pluralization yourself, only for words you actually need.

i get it, as a mycroft-core component, but LF?

If you meant why _translate_word is in LF, it's because it's only an internal function used by other LF functions, so it wouldn't make sense it's in mycroft-core.

If you meant why get_plural_category and get_plural_form are in LF, it's because I think pluralization counts under "multilingual text parsing and formatting".

@krisgesling
Copy link
Contributor

Hey there, I haven't looked at this deeply but definitely agree with the premise.

We're very close to launching the Mark II so keeping our focus quite specifically on that, but I'd like to come back to this once we hit code freeze on the Mark II.

I won't try to present an opinion on the questions raised until I've properly digested this. I would however say that I'm not opposed to more modules in LF beyond parse and format. Where the functions from this PR live is another question but very open to a util module if that makes more sense for this.

@JarbasAl
Copy link
Collaborator

I approve these changes and cherry picked these commits for ovos, so far i only did minor changes like adding an enum and porting the old singularize/pluralize prs for english/portuguese, but i still want to change a few more things like renaming "type" to something else to avoid the namespace collision with the builtin type etc.

might be interesting to keep an eye on OpenVoiceOS#28 , for what its worth i think this PR is looking great already!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants