Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement gettext plurals for PO files #677

Merged
merged 16 commits into from
Dec 7, 2020
Merged

Conversation

iStefo
Copy link
Contributor

@iStefo iStefo commented Apr 23, 2020

Hi & thanks for your work on this project!
While looking for a new localization solutions for our React frontend, we found this project and liked the approach. Our goal is to use po files to leverage comments for providing context for translators (and our future selves…).
We started with the @next branch since that's the natual way to go forward. Despite some rough edges regarding the documentation and typescript typings, we were very happy with the functionality of this solution.

Problem

Our only major pain point (besides the removal of the /* i18n: Comment */t("id")`Translated string`; syntax from version 2.x ;)) is that the po file format does not support the "native" plurals with msgstr[0], msgstr[1] etc.
Unfortunately, we were not able to find an online translation service that was able to handle ICU plural strings when embedded in po files. The situation may be different with JSON documents, but then we'd loose comment functionality.

We noticed that other user have already encountered the same problem, unfortunately without an obvious solution: #595 & #82 (where it is mentioned that there are services that understand ICU, but as I said, we weren't able to find any that support ICU in PO)

Approach

After some experimenting, we decided to give a try at implementing rudimentary ICU -> PO -> ICU transformation for the po format. This PR shows the modifications that were required.
In short, the new code transforms the following ICU into the corresponding PO plurals (and back):

{
    "message_id": {
        "message": "{count, plural, one {Singular Case} two {Case for two} other {Case for {count} other}}"
    }
}
msgctxt "pluralize_on=count"
msgid "message_id"
msgid_plural "message_id"
msgstr[0] "Singular Case"
msgstr[1] "Case for two"
msgstr[2] "Case for {count} other"

To be discussed

1. What to use as msgid_plural?

Currently, the code uses msgid_plural to tell the world that a message is pluralized.
When a custom ID is used, it uses the same ID as msgid_plural.

As of right now, for automatic IDs (where the message is also its ID), the last plural case taken from the ICU message format is used as msgid_plural, but I don't know if that's the right decision as it reads somewhat strange when the regular msgid is the full ICU message and the msgid_plural is only an excerpt from that.

2. Is msgctxt the right place to store pluralization key?

Since the ICU format is more powerful than PO plurals, we need to somehow remember the placeholder used to pluralize the message to transform the message back to the ICU format.
It might be possible to parse it from the translations, but not in all cases. (Might not be used in all/any translation, multiple placeholders might be used but we can't know which is the one for localization...)

As a workaround, I've chosen to store the pluralization key in the PO item's msgctxt. What do you think about that? Should we rather use a special comment line for that purpose?

3. How to opt in or out of this process?

While we think that this new way of using PO plurals is "more correct" than embedding ICU messages in PO, it might not be suitable for everyone and brake the workflow for people with functioning translation tools that understand ICU within PO (whatever tools that are).

What do you think would be good way to integrate this? As a new format (po_with_plurals?)
As the default and the old po parsers is then referred to as po_icu?

Summary

If you decide that we can go forward with this I'll gladly add tests for the new warning outputs that have not been automatically tested yet. The transforms themselves are covered in two added test cases as you can see in the diff.

We'd like to hear what you think about these changes and whether you could see them land in lingui.

@vercel
Copy link

vercel bot commented Apr 23, 2020

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/lingui-js/js-lingui/aunxbx0i5
✅ Preview: https://js-lingui-git-next.lingui-js.vercel.app

@codecov
Copy link

codecov bot commented Apr 23, 2020

Codecov Report

Merging #677 (d775589) into main (fc8c628) will increase coverage by 0.22%.
The diff coverage is 85.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #677      +/-   ##
==========================================
+ Coverage   82.89%   83.12%   +0.22%     
==========================================
  Files          51       52       +1     
  Lines        1450     1570     +120     
  Branches      400      425      +25     
==========================================
+ Hits         1202     1305     +103     
- Misses        146      157      +11     
- Partials      102      108       +6     
Impacted Files Coverage Δ
packages/cli/src/api/formats/index.ts 60.00% <ø> (ø)
packages/cli/src/api/formats/po-gettext.ts 85.83% <85.83%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc8c628...d775589. Read the comment docs.

@tricoder42
Copy link
Contributor

This is good 👍 I understand you problem, although I don't have capacity to answer any questions right now. I'll get back to you in few days, hopefully after v3 is released.

@iStefo
Copy link
Contributor Author

iStefo commented Apr 26, 2020

Thanks. Only issue I see is that (depending on how developers would opt-in or -out of this change) it could be a breaking change that should be considered before releasing v3, right?
Otherwise we're bound to enabling this in a backwards compatible way, which would also work, of course...

Another thing: PO files don't support special casing for certain values (_0=…) which should at least raise a warning (not implemented yet). Once you get to it, we can discuss whether there is a way around that limitation or not.

@tricoder42
Copy link
Contributor

I would definitely make this feature opt-in. I always considered gettext as a message format. PO file format comes definitely hand in hand with gettext, but when you omit plural rules, you can use it as a file format with any other message format.

The problem is that gettext is just a subset of ICU message format. Plurals are simplified - there can be only one plural per message and source locale must use two plural forms (e.g. in Czech we have 3 plural forms and therefore I can't use it as a source locale). Other formatters are missing - select, selectOrdinal. So, the conversion from ICU to gettext isn't "lossless" in general case. On the other hand I completely understand that for lot users the gettext features are good enough. After all, gettext was and is still used a lot.

Now question is how far we want to get. I believe the simplest solution would be a different catalog format, e.g. po-gettext. That format would convert ICU to gettext on lingui extract and compile gettext on lingui compile. That way the core stays the same and the hard work happens in CLI. We just need to take care of edge cases and show an error, e.g. if you use any feature which isn't supported by gettext.

What do you think?

@iStefo
Copy link
Contributor Author

iStefo commented Apr 27, 2020

Thanks for your input. I really haven't considered the difference between po and gettext, but I think you're right with what you say.

Regarding Plurals: I haven't used a language as source language with more than two plurals. As long as you use messages-as-keys, you're probably right (since there only ismsgid and msgid_plural). With custom IDs one should be fine I guess...
I'm not sure what you mean regarding "there can only be one plural per message"? Do you mean the dependent variable that is used for pluralization? How would an ICU mesasge look that depends on two variables?

Select, selectOrdinal: I haven't had a case where I needed one of those (and gettext as well as our editor of choice, poeditor.com, don't support them) and I guess it would be problematic to implement anyways. Maybe one could generate additional messages with the values appended ("mykey_valueX"...), but that would need deep integration into catalog handling – much more than simply converting between message formats.

I'm fine with using the suggested po-gettext format and completely decouple the formatter from the current po formatter. I will adjust my implementation in this regard.
Raising errors for unsupported features is fine for now until we happen to find a way to work around the limitations.
Thanks again for your time!

@tricoder42
Copy link
Contributor

I'm not sure what you mean regarding "there can only be one plural per message"? Do you mean the dependent variable that is used for pluralization? How would an ICU mesasge look that depends on two variables?

I have {countBooks, plural, one {# book} other {# books}} and {countBeer, plural, one {# beer} other {# beers}}

I'm not saying this is very common usecase, but something we need to consider when adding support for gettext.

The same applies for select and selectOrdinal - simply throw an error explaining that you can't use them with gettext. We can't cover all usecases anyway.

As for message ids: With custom ids I would simply use _plural suffix. With generated ids it would be great if we could "explode" icu message into singular/plural forms:

- msgid "{count, plural, one {I have # book} other {I have # books}}"
+ msgid "I have # book"
+ msgid_plural "I have # books"

Not sure if msgctxt "pluralize_on=count" is the best place to keep information about placeholders, but we definitely need to put it somewhere. I'm open to any suggestion. Feel free to use msgctx if we don't figure out anything better. msgctx is usually used to distinguish messages with same msgid but different meaning.

@tricoder42
Copy link
Contributor

And thank you for tacking this issue 🙏 I understand that most tools work with gettext and that ICU might be overpowered for most cases.

@iStefo iStefo changed the title Implement "native" po plurals Implement gettext plurals for PO files Apr 27, 2020
@ruscoder
Copy link

Hello! Any updates on this PR?

@tricoder42
Copy link
Contributor

Hey @iStefo, sorry for long delay. Are you still available to work on this issue? Meanwhile, the v3.0 was release so you would need to resolve conflicts. Let me know if I can help somehow

@iStefo
Copy link
Contributor Author

iStefo commented Nov 1, 2020

Hi, I'm sorry, too, as I've not been responding or making progress on this.
Internally, we do use our fork of over at https://github.com/box-id/js-lingui/tree/next to export gettext plurals (and upload/download them to/from poeditor.com for localization) in production and it's working very well within its limits (which need to be clarified, documented and, ideally, safeguarded in code).

I'll try hard to allocate some time this week to get this thing rolling again as I'm sure my manager would like to see us using the upstream version of lingui again at some time...

@vercel
Copy link

vercel bot commented Nov 6, 2020

@iStefo is attempting to deploy a commit to the LinguiJS Team on Vercel.

A member of the Team first needs to authorize it.

@semoal semoal requested a review from tricoder42 November 17, 2020 12:42
@iStefo
Copy link
Contributor Author

iStefo commented Nov 30, 2020

Hi @semoal @tricoder42, please see if you can free up some time to take another look at this so we can get it merged :)

@semoal
Copy link
Contributor

semoal commented Nov 30, 2020

Except this minor changes, for me looks pretty good, probably Tomas will review it today or tomorrow, so we can release this week.

msgstr[0] ""
msgstr[1] ""

msgctxt "icu=%7Bcount%2C+plural%2C+one+%7BSingular%7D+other+%7BPlural%7D%7D&pluralize_on=count"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary for this feature to use msgctx? There's actually another PR #856 implementing the original msgctx behavior from Gettext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing the original ICU message in msgctx is clearly a workaround and not using the field as intended. In earlier versions, only the pluralize_on value was stored there, which I need to reconstruct the original ICU message.
The full ICU message is stored s.t. msgid can be restored for messages where the developer does not use custom IDs, as the ICU cases in development language are used for msgid and msgid_plural, so items look like this:

msgctxt "icu=%7Bcount%2C+plural%2C+one+%7BSingular%7D+other+%7BPlural%7D%7D&pluralize_on=count"
msgid "Singular"
msgid_plural "Plural"
msgstr[0] ""
msgstr[1] ""

I could also store the querystring encoded data in a new type of comment, say ' #?foo=bar' and, when converting from po to ICU, iterate comments until I find one that matches the format. What do you think about that?

Copy link
Contributor

@tricoder42 tricoder42 Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we need to use comments to store any required metadata. You could even have one comment per line:

#. icu: { count, plural, one {Singular} other {Plural}}
#. pluralize_on: count
msgid "Singular"
msgid_plural "Plural"
msgstr[0] ""
msgstr[1] ""

but whatever works for you the best 👍

There're several types of comments available in gettext:

white-space
#  translator-comments
#. extracted-comments
#: reference…
#, flag…
#| msgid previous-untranslated-string
msgid untranslated-string
msgstr translated-string

Not sure which are supported by the PO library we use, but I guess this would be a good fit for extracted-comments.

I'm open to any suggestions :)

@iStefo
Copy link
Contributor Author

iStefo commented Dec 2, 2020

I've updated the implementation to:

  • Use extracted comments to store the required ICU context. The comment is prefixed with js-lingui: to allow reidentification when parsing
  • Use the extractedComments property when mapping from lingui messages to PO items, even though TypeScript is currently not happy with that...
  • Only include references if options.origins is set (although it seems like leaving this option unset, for existing catalogs, it will not remove references if already present, is that correct?)

@vonovak
Copy link
Collaborator

vonovak commented Dec 2, 2020

hey there, just a word of warning, I was dealing with plurals on my previous project and I believe there might a misalignment in this PR:

you're using the CLDR data to get the plurals for a given language and CLDR data supports plurals for decimals (eg 1.5) as well as integers (2).

BUT

po gettext only works for integers - compare the number of plurals for gettext: cs - Czech has 3: find it in http://docs.translatehouse.org/projects/localization-guide/en/latest/l10n/pluralforms.html or https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html#Plural-forms but in CLDR it has 4 - see https://github.com/unicode-org/cldr/blob/master/common/supplemental/plurals.xml#L154

to give an example, in Czech we have

0.5 dne / 0.5 days
1 den / 1 day
2 dny / 2 days
5 dní / 5 days

and CLDR supports all forms but gettext does not support 0.5

Why? look for "You might now ask, ngettext handles only numbers n of type ‘unsigned long’. What about larger integer types? What about negative numbers? What about floating-point numbers?" and "Negative and floating-point values usually represent physical entities for which singular and plural don’t clearly apply." in https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html

What this means is that the plurals, essentially, won't work. I believe you'll need to find some other source than CLDR to find the number of plurals for a given language, I remember finding some repo on github that did that. I might be wrong here but please double-check this. Nice work btw!

Copy link
Contributor

@tricoder42 tricoder42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Use extracted comments to store the required ICU context. The comment is prefixed with js-lingui: to allow reidentification when parsing

👍

Use the extractedComments property when mapping from lingui messages to PO items, even though TypeScript is currently not happy with that...

👍

Only include references if options.origins is set (although it seems like leaving this option unset, for existing catalogs, it will not remove references if already present, is that correct?)

Yeah, I think so. I've never tried it on my own to be honest.

Also it would be great to document or mention somewhere what @vonovak said about plurals for decimal numbers. I guess it's a limitation of gettext format, but as long as project uses only integers, it should be fine.

@iStefo
Copy link
Contributor Author

iStefo commented Dec 7, 2020

If I understand @vonovak correctly, this could be an issue even if no decimal numbers are used, because of the different number of plural cases reported by CLDR vs. gettext.

For example, the gettext-enabled localization tool we currently use offers the plural cases "one, few, other" for the Czech language (which seems correct when I try out cldr-plurals for integer numbers), so the PO files would contain the cases in this order with indexes from 0 to 2.
However, plurals-cldr (https://github.com/nodeca/plurals-cldr/blob/master/dist/plurals-cldr.js#L235) outputs the plural cases as "one, few, many, other" which means that gettext messages 0-2 get converted back to the cases "one, few, many" instead of "one, few, other" as probably expected by ICU, right?

I have reproduced the issue in a test case and will commit the solution later today.

@vonovak
Copy link
Collaborator

vonovak commented Dec 7, 2020

@iStefo

If I understand @vonovak correctly, this could be an issue even if no decimal numbers are used, because of the different number of plural cases reported by CLDR vs. gettext.

which means that gettext messages 0-2 get converted back to the cases "one, few, many" instead of "one, few, other" as probably expected by ICU, right?

exactly!

@iStefo
Copy link
Contributor Author

iStefo commented Dec 7, 2020

Thanks @vonovak for broadening my knowledge of foreign languages 😀

I have implemented a solution that was heavily influenced by https://github.com/LLK/po2icu/blob/9eb97f81f72b2fee02b77f1424702e019647e9b9/lib/po2icu.js#L148 which achieves a very similar goal to my code.

@tricoder42 I still updated the docs to include gettext's limitation regarding fractional numbers. I've also updated the docs to reflect the new way of storing the context data in a comment instead of msgctx. From my point of view, this is ready to be merged.

@semoal
Copy link
Contributor

semoal commented Dec 7, 2020

Great, I'll merge this once test suite pass :shipit:
Tomorrow version will be great 🥇 thanks to you all

Fails on Windows, let me search because this already happened to me on another issue. And I'll post the changes you need to fix the ci 👍🏻

@iStefo
Copy link
Contributor Author

iStefo commented Dec 7, 2020

The windows tests seem to fail because of line ending differences (https://github.com/lingui/js-lingui/runs/1510224213?check_suite_focus=true#step:7:27), but I can't see any difference to the regular po fixtures in terms of line endings...

@semoal
Copy link
Contributor

semoal commented Dec 7, 2020

The windows tests seem to fail because of line ending differences (lingui/js-lingui/runs/1510224213?check_suite_focus=true#step:7:27), but I can't see any difference to the regular po fixtures in terms of line endings...

#840 (comment)

Yes, this already happened it's a problem of jest that adds different end of line strings.

I'll investigate on jest codebase if i can fix this issue..
You can use this workaround:

    // on windows mockFs adds ··· to multiline string, so this strictly equal comparison can't be done
    // we test that the content if the same inlined...
    expect(actual.replace(/(\r\n|\n|\r)/gm,"")).toEqual(pofile.replace(/(\r\n|\n|\r)/gm,""))

@semoal
Copy link
Contributor

semoal commented Dec 8, 2020

😱 Released 3.3.0 with this fix/feature introduced!

npx update-by-scope @lingui

@truong-hua
Copy link
Contributor

truong-hua commented Jun 19, 2021

@iStefo thank for this great feature and it's very valuable to us too, however the lingui extract-template command still export messages.pot file in the ICU format instead of gettext even that lingui extract does the job well.

This is the reason why we need the .pot file: #793

I have just found out that this issue is more related to extract-template cli than this feature, so I made a simple patch here #1089. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants