feat: Implement gettext plurals for PO files #677

iStefo · 2020-04-23T13:42:15Z

Hi & thanks for your work on this project!
While looking for a new localization solutions for our React frontend, we found this project and liked the approach. Our goal is to use po files to leverage comments for providing context for translators (and our future selves…).
We started with the @next branch since that's the natual way to go forward. Despite some rough edges regarding the documentation and typescript typings, we were very happy with the functionality of this solution.

Problem

Our only major pain point (besides the removal of the /* i18n: Comment */t("id")`Translated string`; syntax from version 2.x ;)) is that the po file format does not support the "native" plurals with msgstr[0], msgstr[1] etc.
Unfortunately, we were not able to find an online translation service that was able to handle ICU plural strings when embedded in po files. The situation may be different with JSON documents, but then we'd loose comment functionality.

We noticed that other user have already encountered the same problem, unfortunately without an obvious solution: #595 & #82 (where it is mentioned that there are services that understand ICU, but as I said, we weren't able to find any that support ICU in PO)

Approach

After some experimenting, we decided to give a try at implementing rudimentary ICU -> PO -> ICU transformation for the po format. This PR shows the modifications that were required.
In short, the new code transforms the following ICU into the corresponding PO plurals (and back):

{
    "message_id": {
        "message": "{count, plural, one {Singular Case} two {Case for two} other {Case for {count} other}}"
    }
}

msgctxt "pluralize_on=count"
msgid "message_id"
msgid_plural "message_id"
msgstr[0] "Singular Case"
msgstr[1] "Case for two"
msgstr[2] "Case for {count} other"

To be discussed

1. What to use as `msgid_plural`?

Currently, the code uses msgid_plural to tell the world that a message is pluralized.
When a custom ID is used, it uses the same ID as msgid_plural.

As of right now, for automatic IDs (where the message is also its ID), the last plural case taken from the ICU message format is used as msgid_plural, but I don't know if that's the right decision as it reads somewhat strange when the regular msgid is the full ICU message and the msgid_plural is only an excerpt from that.

2. Is `msgctxt` the right place to store pluralization key?

Since the ICU format is more powerful than PO plurals, we need to somehow remember the placeholder used to pluralize the message to transform the message back to the ICU format.
It might be possible to parse it from the translations, but not in all cases. (Might not be used in all/any translation, multiple placeholders might be used but we can't know which is the one for localization...)

As a workaround, I've chosen to store the pluralization key in the PO item's msgctxt. What do you think about that? Should we rather use a special comment line for that purpose?

3. How to opt in or out of this process?

While we think that this new way of using PO plurals is "more correct" than embedding ICU messages in PO, it might not be suitable for everyone and brake the workflow for people with functioning translation tools that understand ICU within PO (whatever tools that are).

What do you think would be good way to integrate this? As a new format (po_with_plurals?)
As the default and the old po parsers is then referred to as po_icu?

Summary

If you decide that we can go forward with this I'll gladly add tests for the new warning outputs that have not been automatically tested yet. The transforms themselves are covered in two added test cases as you can see in the diff.

We'd like to hear what you think about these changes and whether you could see them land in lingui.

vercel · 2020-04-23T13:55:49Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/lingui-js/js-lingui/aunxbx0i5
✅ Preview: https://js-lingui-git-next.lingui-js.vercel.app

codecov · 2020-04-23T14:30:20Z

Codecov Report

Merging #677 (d775589) into main (fc8c628) will increase coverage by 0.22%.
The diff coverage is 85.83%.

@@            Coverage Diff             @@
##             main     #677      +/-   ##
==========================================
+ Coverage   82.89%   83.12%   +0.22%     
==========================================
  Files          51       52       +1     
  Lines        1450     1570     +120     
  Branches      400      425      +25     
==========================================
+ Hits         1202     1305     +103     
- Misses        146      157      +11     
- Partials      102      108       +6

Impacted Files	Coverage Δ
packages/cli/src/api/formats/index.ts	`60.00% <ø> (ø)`
packages/cli/src/api/formats/po-gettext.ts	`85.83% <85.83%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc8c628...d775589. Read the comment docs.

tricoder42 · 2020-04-26T07:46:53Z

This is good 👍 I understand you problem, although I don't have capacity to answer any questions right now. I'll get back to you in few days, hopefully after v3 is released.

iStefo · 2020-04-26T08:25:12Z

Thanks. Only issue I see is that (depending on how developers would opt-in or -out of this change) it could be a breaking change that should be considered before releasing v3, right?
Otherwise we're bound to enabling this in a backwards compatible way, which would also work, of course...

Another thing: PO files don't support special casing for certain values (_0=…) which should at least raise a warning (not implemented yet). Once you get to it, we can discuss whether there is a way around that limitation or not.

tricoder42 · 2020-04-27T11:04:17Z

I would definitely make this feature opt-in. I always considered gettext as a message format. PO file format comes definitely hand in hand with gettext, but when you omit plural rules, you can use it as a file format with any other message format.

The problem is that gettext is just a subset of ICU message format. Plurals are simplified - there can be only one plural per message and source locale must use two plural forms (e.g. in Czech we have 3 plural forms and therefore I can't use it as a source locale). Other formatters are missing - select, selectOrdinal. So, the conversion from ICU to gettext isn't "lossless" in general case. On the other hand I completely understand that for lot users the gettext features are good enough. After all, gettext was and is still used a lot.

Now question is how far we want to get. I believe the simplest solution would be a different catalog format, e.g. po-gettext. That format would convert ICU to gettext on lingui extract and compile gettext on lingui compile. That way the core stays the same and the hard work happens in CLI. We just need to take care of edge cases and show an error, e.g. if you use any feature which isn't supported by gettext.

What do you think?

iStefo · 2020-04-27T11:16:13Z

Thanks for your input. I really haven't considered the difference between po and gettext, but I think you're right with what you say.

Regarding Plurals: I haven't used a language as source language with more than two plurals. As long as you use messages-as-keys, you're probably right (since there only ismsgid and msgid_plural). With custom IDs one should be fine I guess...
I'm not sure what you mean regarding "there can only be one plural per message"? Do you mean the dependent variable that is used for pluralization? How would an ICU mesasge look that depends on two variables?

Select, selectOrdinal: I haven't had a case where I needed one of those (and gettext as well as our editor of choice, poeditor.com, don't support them) and I guess it would be problematic to implement anyways. Maybe one could generate additional messages with the values appended ("mykey_valueX"...), but that would need deep integration into catalog handling – much more than simply converting between message formats.

I'm fine with using the suggested po-gettext format and completely decouple the formatter from the current po formatter. I will adjust my implementation in this regard.
Raising errors for unsupported features is fine for now until we happen to find a way to work around the limitations.
Thanks again for your time!

tricoder42 · 2020-04-27T11:34:11Z

I'm not sure what you mean regarding "there can only be one plural per message"? Do you mean the dependent variable that is used for pluralization? How would an ICU mesasge look that depends on two variables?

I have {countBooks, plural, one {# book} other {# books}} and {countBeer, plural, one {# beer} other {# beers}}

I'm not saying this is very common usecase, but something we need to consider when adding support for gettext.

The same applies for select and selectOrdinal - simply throw an error explaining that you can't use them with gettext. We can't cover all usecases anyway.

As for message ids: With custom ids I would simply use _plural suffix. With generated ids it would be great if we could "explode" icu message into singular/plural forms:

- msgid "{count, plural, one {I have # book} other {I have # books}}"
+ msgid "I have # book"
+ msgid_plural "I have # books"

Not sure if msgctxt "pluralize_on=count" is the best place to keep information about placeholders, but we definitely need to put it somewhere. I'm open to any suggestion. Feel free to use msgctx if we don't figure out anything better. msgctx is usually used to distinguish messages with same msgid but different meaning.

tricoder42 · 2020-04-27T11:35:12Z

And thank you for tacking this issue 🙏 I understand that most tools work with gettext and that ICU might be overpowered for most cases.

ruscoder · 2020-07-28T17:13:21Z

Hello! Any updates on this PR?

tricoder42 · 2020-11-01T19:02:51Z

Hey @iStefo, sorry for long delay. Are you still available to work on this issue? Meanwhile, the v3.0 was release so you would need to resolve conflicts. Let me know if I can help somehow

iStefo · 2020-11-01T21:27:24Z

Hi, I'm sorry, too, as I've not been responding or making progress on this.
Internally, we do use our fork of over at https://github.com/box-id/js-lingui/tree/next to export gettext plurals (and upload/download them to/from poeditor.com for localization) in production and it's working very well within its limits (which need to be clarified, documented and, ideally, safeguarded in code).

I'll try hard to allocate some time this week to get this thing rolling again as I'm sure my manager would like to see us using the upstream version of lingui again at some time...

vercel · 2020-11-06T14:52:42Z

@iStefo is attempting to deploy a commit to the LinguiJS Team on Vercel.

A member of the Team first needs to authorize it.

iStefo · 2020-11-30T12:27:53Z

Hi @semoal @tricoder42, please see if you can free up some time to take another look at this so we can get it merged :)

packages/cli/src/api/formats/po-gettext.ts

semoal · 2020-11-30T12:59:06Z

Except this minor changes, for me looks pretty good, probably Tomas will review it today or tomorrow, so we can release this week.

tricoder42 · 2020-11-30T15:34:32Z

packages/cli/src/api/formats/fixtures/messages_plural.po

+msgstr[0] ""
+msgstr[1] ""
+
+msgctxt "icu=%7Bcount%2C+plural%2C+one+%7BSingular%7D+other+%7BPlural%7D%7D&pluralize_on=count"


Is it necessary for this feature to use msgctx? There's actually another PR #856 implementing the original msgctx behavior from Gettext.

Storing the original ICU message in msgctx is clearly a workaround and not using the field as intended. In earlier versions, only the pluralize_on value was stored there, which I need to reconstruct the original ICU message.
The full ICU message is stored s.t. msgid can be restored for messages where the developer does not use custom IDs, as the ICU cases in development language are used for msgid and msgid_plural, so items look like this:

msgctxt "icu=%7Bcount%2C+plural%2C+one+%7BSingular%7D+other+%7BPlural%7D%7D&pluralize_on=count" msgid "Singular" msgid_plural "Plural" msgstr[0] "" msgstr[1] ""

I could also store the querystring encoded data in a new type of comment, say ' #?foo=bar' and, when converting from po to ICU, iterate comments until I find one that matches the format. What do you think about that?

Yeah, I think we need to use comments to store any required metadata. You could even have one comment per line:

#. icu: { count, plural, one {Singular} other {Plural}} #. pluralize_on: count msgid "Singular" msgid_plural "Plural" msgstr[0] "" msgstr[1] ""

but whatever works for you the best 👍

There're several types of comments available in gettext:

white-space # translator-comments #. extracted-comments #: reference… #, flag… #| msgid previous-untranslated-string msgid untranslated-string msgstr translated-string

Not sure which are supported by the PO library we use, but I guess this would be a good fit for extracted-comments.

I'm open to any suggestions :)

iStefo · 2020-12-02T08:24:30Z

I've updated the implementation to:

Use extracted comments to store the required ICU context. The comment is prefixed with js-lingui: to allow reidentification when parsing
Use the extractedComments property when mapping from lingui messages to PO items, even though TypeScript is currently not happy with that...
Only include references if options.origins is set (although it seems like leaving this option unset, for existing catalogs, it will not remove references if already present, is that correct?)

vonovak · 2020-12-02T10:50:34Z

hey there, just a word of warning, I was dealing with plurals on my previous project and I believe there might a misalignment in this PR:

you're using the CLDR data to get the plurals for a given language and CLDR data supports plurals for decimals (eg 1.5) as well as integers (2).

BUT

po gettext only works for integers - compare the number of plurals for gettext: cs - Czech has 3: find it in http://docs.translatehouse.org/projects/localization-guide/en/latest/l10n/pluralforms.html or https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html#Plural-forms but in CLDR it has 4 - see https://github.com/unicode-org/cldr/blob/master/common/supplemental/plurals.xml#L154

to give an example, in Czech we have

0.5 dne / 0.5 days
1 den / 1 day
2 dny / 2 days
5 dní / 5 days

and CLDR supports all forms but gettext does not support 0.5

Why? look for "You might now ask, ngettext handles only numbers n of type ‘unsigned long’. What about larger integer types? What about negative numbers? What about floating-point numbers?" and "Negative and floating-point values usually represent physical entities for which singular and plural don’t clearly apply." in https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html

What this means is that the plurals, essentially, won't work. I believe you'll need to find some other source than CLDR to find the number of plurals for a given language, I remember finding some repo on github that did that. I might be wrong here but please double-check this. Nice work btw!

tricoder42

Looks good!

Use extracted comments to store the required ICU context. The comment is prefixed with js-lingui: to allow reidentification when parsing

👍

Use the extractedComments property when mapping from lingui messages to PO items, even though TypeScript is currently not happy with that...

👍

Only include references if options.origins is set (although it seems like leaving this option unset, for existing catalogs, it will not remove references if already present, is that correct?)

Yeah, I think so. I've never tried it on my own to be honest.

Also it would be great to document or mention somewhere what @vonovak said about plurals for decimal numbers. I guess it's a limitation of gettext format, but as long as project uses only integers, it should be fine.

iStefo · 2020-12-07T09:59:17Z

If I understand @vonovak correctly, this could be an issue even if no decimal numbers are used, because of the different number of plural cases reported by CLDR vs. gettext.

For example, the gettext-enabled localization tool we currently use offers the plural cases "one, few, other" for the Czech language (which seems correct when I try out cldr-plurals for integer numbers), so the PO files would contain the cases in this order with indexes from 0 to 2.
However, plurals-cldr (https://github.com/nodeca/plurals-cldr/blob/master/dist/plurals-cldr.js#L235) outputs the plural cases as "one, few, many, other" which means that gettext messages 0-2 get converted back to the cases "one, few, many" instead of "one, few, other" as probably expected by ICU, right?

I have reproduced the issue in a test case and will commit the solution later today.

vonovak · 2020-12-07T10:04:46Z

@iStefo

If I understand @vonovak correctly, this could be an issue even if no decimal numbers are used, because of the different number of plural cases reported by CLDR vs. gettext.

which means that gettext messages 0-2 get converted back to the cases "one, few, many" instead of "one, few, other" as probably expected by ICU, right?

exactly!

…ctional numbers

iStefo · 2020-12-07T11:02:46Z

Thanks @vonovak for broadening my knowledge of foreign languages 😀

I have implemented a solution that was heavily influenced by https://github.com/LLK/po2icu/blob/9eb97f81f72b2fee02b77f1424702e019647e9b9/lib/po2icu.js#L148 which achieves a very similar goal to my code.

@tricoder42 I still updated the docs to include gettext's limitation regarding fractional numbers. I've also updated the docs to reflect the new way of storing the context data in a comment instead of msgctx. From my point of view, this is ready to be merged.

semoal · 2020-12-07T11:07:15Z

Great, I'll merge this once test suite pass
Tomorrow version will be great 🥇 thanks to you all

Fails on Windows, let me search because this already happened to me on another issue. And I'll post the changes you need to fix the ci 👍🏻

iStefo · 2020-12-07T11:16:58Z

The windows tests seem to fail because of line ending differences (https://github.com/lingui/js-lingui/runs/1510224213?check_suite_focus=true#step:7:27), but I can't see any difference to the regular po fixtures in terms of line endings...

semoal · 2020-12-07T11:19:25Z

The windows tests seem to fail because of line ending differences (lingui/js-lingui/runs/1510224213?check_suite_focus=true#step:7:27), but I can't see any difference to the regular po fixtures in terms of line endings...

#840 (comment)

Yes, this already happened it's a problem of jest that adds different end of line strings.

I'll investigate on jest codebase if i can fix this issue..
You can use this workaround:

    // on windows mockFs adds ··· to multiline string, so this strictly equal comparison can't be done
    // we test that the content if the same inlined...
    expect(actual.replace(/(\r\n|\n|\r)/gm,"")).toEqual(pofile.replace(/(\r\n|\n|\r)/gm,""))

semoal · 2020-12-08T14:27:07Z

😱 Released 3.3.0 with this fix/feature introduced!

npx update-by-scope @lingui

truong-hua · 2021-06-19T13:47:02Z

@iStefo thank for this great feature and it's very valuable to us too, however the lingui extract-template command still export messages.pot file in the ICU format instead of gettext even that lingui extract does the job well.

This is the reason why we need the .pot file: #793

I have just found out that this issue is more related to extract-template cli than this feature, so I made a simple patch here #1089. Thanks

tricoder42 added 💡 new feature 📦 cli labels Apr 25, 2020

iStefo changed the title ~~Implement "native" po plurals~~ Implement gettext plurals for PO files Apr 27, 2020

vercel bot deployed to Preview May 24, 2020 19:24 View deployment

tricoder42 mentioned this pull request Jun 1, 2020

Add support for gettext message format #703

Closed

tricoder42 added this to the 3.0 Optimizing production bundle size milestone Oct 11, 2020

tricoder42 added the triage: waiting for review label Oct 11, 2020

tricoder42 changed the base branch from next to main October 29, 2020 06:47

tricoder42 added triage: accepted and removed triage: waiting for review labels Nov 1, 2020

iStefo added 4 commits November 5, 2020 15:58

Manual rebase of po-gettext onto main branch

f3c002b

Merge branch 'main' of github.com:lingui/js-lingui into main

866f220

po-gettext: Improve code and tests

095d9f6

po-gettext: Add docs

42a8a98

Merge branch 'main' of github.com:lingui/js-lingui into main

066ed79

iStefo force-pushed the next branch from 893d99d to 066ed79 Compare November 6, 2020 14:55

vercel bot deployed to Preview November 6, 2020 14:55 View deployment

po-gettext: Fix TS errors in test

58cb1c1

semoal requested a review from tricoder42 November 17, 2020 12:42

semoal reviewed Nov 30, 2020

View reviewed changes

packages/cli/src/api/formats/po-gettext.ts Show resolved Hide resolved

semoal reviewed Nov 30, 2020

View reviewed changes

packages/cli/src/api/formats/po-gettext.ts Show resolved Hide resolved

tricoder42 reviewed Nov 30, 2020

View reviewed changes

iStefo added 3 commits December 1, 2020 15:10

po-gettext: Store context information in extracted comments section

d4d24da

Merge branch 'main' of github.com:lingui/js-lingui into main

4c15013

po-gettext: Align API with regular po module

ea2b8c6

vercel bot deployed to Preview December 2, 2020 08:21 View deployment

tricoder42 approved these changes Dec 3, 2020

View reviewed changes

iStefo added 2 commits December 7, 2020 11:41

po-gettext: Ignore CLDR plural cases that are not supported by gettext

ec91a88

po-gettext: Update docs to reflect gettext's limitation regarding fra…

a8842a6

…ctional numbers

vercel bot deployed to Preview December 7, 2020 10:54 View deployment

po-gettext: Fix incompatibility issue with node 10

4833f7c

vercel bot deployed to Preview December 7, 2020 11:04 View deployment

po-gettext: Attempt fix text on windows

d775589

vercel bot deployed to Preview December 7, 2020 11:23 View deployment

semoal merged commit 415b90e into lingui:main Dec 7, 2020

andrii-bodnar mentioned this pull request Mar 20, 2023

refactor(formats): extract formats into separate packages #1536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement gettext plurals for PO files #677

feat: Implement gettext plurals for PO files #677

iStefo commented Apr 23, 2020 •

edited

Loading

vercel bot commented Apr 23, 2020 •

edited

Loading

codecov bot commented Apr 23, 2020 •

edited

Loading

tricoder42 commented Apr 26, 2020

iStefo commented Apr 26, 2020 •

edited

Loading

tricoder42 commented Apr 27, 2020

iStefo commented Apr 27, 2020

tricoder42 commented Apr 27, 2020

tricoder42 commented Apr 27, 2020

ruscoder commented Jul 28, 2020

tricoder42 commented Nov 1, 2020

iStefo commented Nov 1, 2020

vercel bot commented Nov 6, 2020

iStefo commented Nov 30, 2020

semoal commented Nov 30, 2020

tricoder42 Nov 30, 2020

iStefo Dec 1, 2020

tricoder42 Dec 1, 2020 •

edited

Loading

iStefo commented Dec 2, 2020

vonovak commented Dec 2, 2020 •

edited

Loading

tricoder42 left a comment

iStefo commented Dec 7, 2020

vonovak commented Dec 7, 2020

iStefo commented Dec 7, 2020

semoal commented Dec 7, 2020 •

edited

Loading

iStefo commented Dec 7, 2020

semoal commented Dec 7, 2020

semoal commented Dec 8, 2020

truong-hua commented Jun 19, 2021 •

edited

Loading

feat: Implement gettext plurals for PO files #677

feat: Implement gettext plurals for PO files #677

Conversation

iStefo commented Apr 23, 2020 • edited Loading

Problem

Approach

To be discussed

1. What to use as msgid_plural?

2. Is msgctxt the right place to store pluralization key?

3. How to opt in or out of this process?

Summary

vercel bot commented Apr 23, 2020 • edited Loading

codecov bot commented Apr 23, 2020 • edited Loading

Codecov Report

tricoder42 commented Apr 26, 2020

iStefo commented Apr 26, 2020 • edited Loading

tricoder42 commented Apr 27, 2020

iStefo commented Apr 27, 2020

tricoder42 commented Apr 27, 2020

tricoder42 commented Apr 27, 2020

ruscoder commented Jul 28, 2020

tricoder42 commented Nov 1, 2020

iStefo commented Nov 1, 2020

vercel bot commented Nov 6, 2020

iStefo commented Nov 30, 2020

semoal commented Nov 30, 2020

tricoder42 Nov 30, 2020

Choose a reason for hiding this comment

iStefo Dec 1, 2020

Choose a reason for hiding this comment

tricoder42 Dec 1, 2020 • edited Loading

Choose a reason for hiding this comment

iStefo commented Dec 2, 2020

vonovak commented Dec 2, 2020 • edited Loading

tricoder42 left a comment

Choose a reason for hiding this comment

iStefo commented Dec 7, 2020

vonovak commented Dec 7, 2020

iStefo commented Dec 7, 2020

semoal commented Dec 7, 2020 • edited Loading

iStefo commented Dec 7, 2020

semoal commented Dec 7, 2020

semoal commented Dec 8, 2020

truong-hua commented Jun 19, 2021 • edited Loading

iStefo commented Apr 23, 2020 •

edited

Loading

1. What to use as `msgid_plural`?

2. Is `msgctxt` the right place to store pluralization key?

vercel bot commented Apr 23, 2020 •

edited

Loading

codecov bot commented Apr 23, 2020 •

edited

Loading

iStefo commented Apr 26, 2020 •

edited

Loading

tricoder42 Dec 1, 2020 •

edited

Loading

vonovak commented Dec 2, 2020 •

edited

Loading

semoal commented Dec 7, 2020 •

edited

Loading

truong-hua commented Jun 19, 2021 •

edited

Loading