-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement gettext plurals for PO files #677
Conversation
This pull request is being automatically deployed with Vercel (learn more). 🔍 Inspect: https://vercel.com/lingui-js/js-lingui/aunxbx0i5 |
Codecov Report
@@ Coverage Diff @@
## main #677 +/- ##
==========================================
+ Coverage 82.89% 83.12% +0.22%
==========================================
Files 51 52 +1
Lines 1450 1570 +120
Branches 400 425 +25
==========================================
+ Hits 1202 1305 +103
- Misses 146 157 +11
- Partials 102 108 +6
Continue to review full report at Codecov.
|
This is good 👍 I understand you problem, although I don't have capacity to answer any questions right now. I'll get back to you in few days, hopefully after v3 is released. |
Thanks. Only issue I see is that (depending on how developers would opt-in or -out of this change) it could be a breaking change that should be considered before releasing v3, right? Another thing: PO files don't support special casing for certain values ( |
I would definitely make this feature opt-in. I always considered gettext as a message format. PO file format comes definitely hand in hand with gettext, but when you omit plural rules, you can use it as a file format with any other message format. The problem is that gettext is just a subset of ICU message format. Plurals are simplified - there can be only one plural per message and source locale must use two plural forms (e.g. in Czech we have 3 plural forms and therefore I can't use it as a source locale). Other formatters are missing - Now question is how far we want to get. I believe the simplest solution would be a different catalog format, e.g. What do you think? |
Thanks for your input. I really haven't considered the difference between po and gettext, but I think you're right with what you say. Regarding Plurals: I haven't used a language as source language with more than two plurals. As long as you use messages-as-keys, you're probably right (since there only is Select, selectOrdinal: I haven't had a case where I needed one of those (and gettext as well as our editor of choice, poeditor.com, don't support them) and I guess it would be problematic to implement anyways. Maybe one could generate additional messages with the values appended ("mykey_valueX"...), but that would need deep integration into catalog handling – much more than simply converting between message formats. I'm fine with using the suggested |
I'm not saying this is very common usecase, but something we need to consider when adding support for gettext. The same applies for As for message ids: With custom ids I would simply use - msgid "{count, plural, one {I have # book} other {I have # books}}"
+ msgid "I have # book"
+ msgid_plural "I have # books" Not sure if |
And thank you for tacking this issue 🙏 I understand that most tools work with gettext and that ICU might be overpowered for most cases. |
Hello! Any updates on this PR? |
Hey @iStefo, sorry for long delay. Are you still available to work on this issue? Meanwhile, the v3.0 was release so you would need to resolve conflicts. Let me know if I can help somehow |
Hi, I'm sorry, too, as I've not been responding or making progress on this. I'll try hard to allocate some time this week to get this thing rolling again as I'm sure my manager would like to see us using the upstream version of lingui again at some time... |
@iStefo is attempting to deploy a commit to the LinguiJS Team on Vercel. A member of the Team first needs to authorize it. |
Hi @semoal @tricoder42, please see if you can free up some time to take another look at this so we can get it merged :) |
Except this minor changes, for me looks pretty good, probably Tomas will review it today or tomorrow, so we can release this week. |
msgstr[0] "" | ||
msgstr[1] "" | ||
|
||
msgctxt "icu=%7Bcount%2C+plural%2C+one+%7BSingular%7D+other+%7BPlural%7D%7D&pluralize_on=count" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for this feature to use msgctx
? There's actually another PR #856 implementing the original msgctx
behavior from Gettext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing the original ICU message in msgctx
is clearly a workaround and not using the field as intended. In earlier versions, only the pluralize_on
value was stored there, which I need to reconstruct the original ICU message.
The full ICU message is stored s.t. msgid
can be restored for messages where the developer does not use custom IDs, as the ICU cases in development language are used for msgid
and msgid_plural
, so items look like this:
msgctxt "icu=%7Bcount%2C+plural%2C+one+%7BSingular%7D+other+%7BPlural%7D%7D&pluralize_on=count"
msgid "Singular"
msgid_plural "Plural"
msgstr[0] ""
msgstr[1] ""
I could also store the querystring encoded data in a new type of comment, say ' #?foo=bar' and, when converting from po to ICU, iterate comments until I find one that matches the format. What do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we need to use comments to store any required metadata. You could even have one comment per line:
#. icu: { count, plural, one {Singular} other {Plural}}
#. pluralize_on: count
msgid "Singular"
msgid_plural "Plural"
msgstr[0] ""
msgstr[1] ""
but whatever works for you the best 👍
There're several types of comments available in gettext:
white-space
# translator-comments
#. extracted-comments
#: reference…
#, flag…
#| msgid previous-untranslated-string
msgid untranslated-string
msgstr translated-string
Not sure which are supported by the PO library we use, but I guess this would be a good fit for extracted-comments
.
I'm open to any suggestions :)
I've updated the implementation to:
|
hey there, just a word of warning, I was dealing with plurals on my previous project and I believe there might a misalignment in this PR: you're using the CLDR data to get the plurals for a given language and CLDR data supports plurals for decimals (eg 1.5) as well as integers (2). BUT po gettext only works for integers - compare the number of plurals for gettext: cs - Czech has 3: find it in http://docs.translatehouse.org/projects/localization-guide/en/latest/l10n/pluralforms.html or https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html#Plural-forms but in CLDR it has 4 - see https://github.com/unicode-org/cldr/blob/master/common/supplemental/plurals.xml#L154 to give an example, in Czech we have 0.5 dne / 0.5 days and CLDR supports all forms but gettext does not support 0.5 Why? look for "You might now ask, ngettext handles only numbers n of type ‘unsigned long’. What about larger integer types? What about negative numbers? What about floating-point numbers?" and "Negative and floating-point values usually represent physical entities for which singular and plural don’t clearly apply." in https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html What this means is that the plurals, essentially, won't work. I believe you'll need to find some other source than CLDR to find the number of plurals for a given language, I remember finding some repo on github that did that. I might be wrong here but please double-check this. Nice work btw! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Use extracted comments to store the required ICU context. The comment is prefixed with js-lingui: to allow reidentification when parsing
👍
Use the extractedComments property when mapping from lingui messages to PO items, even though TypeScript is currently not happy with that...
👍
Only include references if options.origins is set (although it seems like leaving this option unset, for existing catalogs, it will not remove references if already present, is that correct?)
Yeah, I think so. I've never tried it on my own to be honest.
Also it would be great to document or mention somewhere what @vonovak said about plurals for decimal numbers. I guess it's a limitation of gettext format, but as long as project uses only integers, it should be fine.
If I understand @vonovak correctly, this could be an issue even if no decimal numbers are used, because of the different number of plural cases reported by CLDR vs. gettext. For example, the gettext-enabled localization tool we currently use offers the plural cases "one, few, other" for the Czech language (which seems correct when I try out cldr-plurals for integer numbers), so the PO files would contain the cases in this order with indexes from 0 to 2. I have reproduced the issue in a test case and will commit the solution later today. |
exactly! |
Thanks @vonovak for broadening my knowledge of foreign languages 😀 I have implemented a solution that was heavily influenced by https://github.com/LLK/po2icu/blob/9eb97f81f72b2fee02b77f1424702e019647e9b9/lib/po2icu.js#L148 which achieves a very similar goal to my code. @tricoder42 I still updated the docs to include gettext's limitation regarding fractional numbers. I've also updated the docs to reflect the new way of storing the context data in a comment instead of |
Great, I'll merge this once test suite pass Fails on Windows, let me search because this already happened to me on another issue. And I'll post the changes you need to fix the ci 👍🏻 |
The windows tests seem to fail because of line ending differences (https://github.com/lingui/js-lingui/runs/1510224213?check_suite_focus=true#step:7:27), but I can't see any difference to the regular |
Yes, this already happened it's a problem of jest that adds different end of line strings. I'll investigate on jest codebase if i can fix this issue..
|
😱 Released 3.3.0 with this fix/feature introduced!
|
@iStefo thank for this great feature and it's very valuable to us too, however the This is the reason why we need the .pot file: #793 I have just found out that this issue is more related to |
Hi & thanks for your work on this project!
While looking for a new localization solutions for our React frontend, we found this project and liked the approach. Our goal is to use po files to leverage comments for providing context for translators (and our future selves…).
We started with the @next branch since that's the natual way to go forward. Despite some rough edges regarding the documentation and typescript typings, we were very happy with the functionality of this solution.
Problem
Our only major pain point (besides the removal of the
/* i18n: Comment */t("id")`Translated string`;
syntax from version 2.x ;)) is that the po file format does not support the "native" plurals withmsgstr[0]
,msgstr[1]
etc.Unfortunately, we were not able to find an online translation service that was able to handle ICU plural strings when embedded in po files. The situation may be different with JSON documents, but then we'd loose comment functionality.
We noticed that other user have already encountered the same problem, unfortunately without an obvious solution: #595 & #82 (where it is mentioned that there are services that understand ICU, but as I said, we weren't able to find any that support ICU in PO)
Approach
After some experimenting, we decided to give a try at implementing rudimentary ICU -> PO -> ICU transformation for the po format. This PR shows the modifications that were required.
In short, the new code transforms the following ICU into the corresponding PO plurals (and back):
To be discussed
1. What to use as
msgid_plural
?Currently, the code uses
msgid_plural
to tell the world that a message is pluralized.When a custom ID is used, it uses the same ID as
msgid_plural
.As of right now, for automatic IDs (where the message is also its ID), the last plural case taken from the ICU message format is used as
msgid_plural
, but I don't know if that's the right decision as it reads somewhat strange when the regularmsgid
is the full ICU message and themsgid_plural
is only an excerpt from that.2. Is
msgctxt
the right place to store pluralization key?Since the ICU format is more powerful than PO plurals, we need to somehow remember the placeholder used to pluralize the message to transform the message back to the ICU format.
It might be possible to parse it from the translations, but not in all cases. (Might not be used in all/any translation, multiple placeholders might be used but we can't know which is the one for localization...)
As a workaround, I've chosen to store the pluralization key in the PO item's
msgctxt
. What do you think about that? Should we rather use a special comment line for that purpose?3. How to opt in or out of this process?
While we think that this new way of using PO plurals is "more correct" than embedding ICU messages in PO, it might not be suitable for everyone and brake the workflow for people with functioning translation tools that understand ICU within PO (whatever tools that are).
What do you think would be good way to integrate this? As a new format (
po_with_plurals
?)As the default and the old po parsers is then referred to as
po_icu
?Summary
If you decide that we can go forward with this I'll gladly add tests for the new warning outputs that have not been automatically tested yet. The transforms themselves are covered in two added test cases as you can see in the diff.
We'd like to hear what you think about these changes and whether you could see them land in lingui.