Add commands for Gettext-based translations i18n #1864

mgeisler · 2022-07-24T21:54:31Z

This implements the Gettext-based translation support mentioned in #5 (comment). Gettext is a wide-used standard for translating software, with many tools available for translators to maintain and update the translations.

I added two new top-level commands:

mdbook xgettext: will extract all strings into a messages.pot file, similar to how xgettext works for source code,
mdbook gettext: will use a xx.po file to generate a translated source tree, similar to how gettext works.

The names don't feel great to me, since they assume that one is already familiar with the Gettext system. Perhaps it would be better to have mdbook i18n extract and mdbook i18n translate or similar?

The translated source tree can be used together with the language support from #1306 to get a multi-lingual book.

mgeisler · 2022-07-24T21:55:32Z

While this seems to work, I marked this as a draft since I'm sure we need some discussion here.

mgeisler · 2022-07-25T12:13:25Z

Hey @sebras, this is the PR I was working on with the extract and reconstruct scripts — they're no longer scripts but now top-level mdbook commands just so that I could hook into the MDBook struct and easily iterate over the book content.

Please let me know how this works for you — I'll also be testing it out here over the next few weeks.

mgeisler · 2022-08-09T19:38:20Z

I'm marking this as non-draft since I would love to get some feedback from people on this.

aellwein · 2022-09-08T10:17:13Z

Hi @mgeisler,
i wanted to test this pull request, however i get an error message upon cargo build:

error[E0433]: failed to resolve: could not find `gettext` in `cmd`
  --> src/main.rs:38:48
   |
38 |         Some(("gettext", sub_matches)) => cmd::gettext::execute(sub_matches),
   |                                                ^^^^^^^ could not find `gettext` in `cmd`

error[E0433]: failed to resolve: could not find `gettext` in `cmd`
  --> src/main.rs:82:26
   |
82 |         .subcommand(cmd::gettext::make_subcommand())
   |                          ^^^^^^^ could not find `gettext` in `cmd`

Is there something missing?

mgeisler · 2022-09-08T15:32:56Z

Is there something missing?

Ups! Yes, there is... I had not added a pub mod gettext; line to src/cmd/mod.rs. Thanks for catching that!

I've updated the branch, please give it a try again.

aellwein · 2022-09-13T18:22:57Z

@mgeisler, i'm sorry for the delay, it took me some time to test the PR, but first of all, thank you for your work.

I've tried to create some example content, everything works well but it was not quiet what i've expected.

xgettext command simply converted every line of my chapter into a separate message, but this approach
appears very tiresome to me, just because the whole text is split in lines and it's hard to read and follow
the context and translate afterwards.

In my opinion gettext makes sense, when you are expecting single messages to be translated out of the context (like program info boxes, error messages, buttons etc.), but in creation of a book it's usually the whole text of a chapter which is to be translated (with maybe some small exceptions).

So at least in my expectation, a chapter-by-chapter approach fits better here: i could imagine writing something like chapter1.<lang>.md and chapter1.<other_lang>.mdand just having a simple language switch in my generated markdown book to switch between different languages.

So i would like to know what others think about it, if gettext approach is feasible for book writers.

sebras · 2022-09-13T18:49:53Z

So at least in my expectation, a chapter-by-chapter approach fits better here: i could imagine writing something like chapter1.<lang>.md and chapter1.<other_lang>.mdand just having a simple language switch in my generated markdown book to switch between different languages.

In other project where I have been translating online documentation and websites they tend to separate out each paragraph into a gettext translatable message. That gives the translator enough context while also not being overly long as entire chapters may be. Moreover a paragraph per message makes it easier to identify any changes per revision, if the message is too long it may be difficult to identify all differences. Finally, paragraphs may move around unchanged between different revisions, and then having each paragraph as a gettext message would not require retranslation (whereas an entire chapter would).

PS. These are just general observations from the position of a translator, I have not tested this proposed PR yet.

mgeisler · 2022-09-13T20:20:37Z

@mgeisler, i'm sorry for the delay, it took me some time to test the PR, but first of all, thank you for your work.

No worries at all, thanks a lot for giving it a go!

I've tried to create some example content, everything works well but it was not quiet what i've expected.

xgettext command simply converted every line of my chapter into a separate message, but this approach appears very tiresome to me, just because the whole text is split in lines and it's hard to read and follow the context and translate afterwards.

Right, I fully intended to extract paragraphs (lines of text between \n\n+) and not individual lines.

I just tried with cargo run -- xgettext inside the test_book directory of this repository. The resulting messages.pot file looks like this:

#: individual/list.md:1
msgid "# Lists"
msgstr ""

#: individual/list.md:3
msgid ""
"1. A\n"
"2. Normal\n"
"3. Ordered\n"
"4. List"
msgstr ""

#: individual/list.md:8
msgid "---"
msgstr ""

This corresponds to

# Lists

1. A
2. Normal
3. Ordered
4. List

---

I think that's what we both wanted: lines of text is kept together unless it is separated by \n\n+. Do you see something else? Could it perhaps be that you're on Windows? I wrote the code to split on \n only, but I don't see why it could not split on \r\n as well.

Now, this list example is perhaps a poor example: I've been wondering if it makes sense to parse the Markdown more carefully and emit individual msgids for each list item. Similarly, a heading like ## My heading could be put into the messages.pot file as simply My heading. That way the translators will have less markup to deal with (but also slightly less context).

mgeisler · 2022-09-13T20:33:36Z

So at least in my expectation, a chapter-by-chapter approach fits better here: i could imagine writing something like chapter1.<lang>.md and chapter1.<other_lang>.mdand just having a simple language switch in my generated markdown book to switch between different languages.

My experience with this is that it becomes impossible to track changes after a little while. This is in some sense an important role of the structured files created by Gettext: they give you a way to unambiguously say these 17 paragraphs are out of date.

If you just have a stream of changes to chapter1.<lang>.md, then it suddenly becomes a management task of the translator to track where the chapter1.<other_lang>.md file is in relationship to the source. Yes, it's doable, but it would require that the translator would write something like  at the top of the file.

When text is added and removed from the source file, the translator will now have to apply these changes — perhaps a paragraph is added on Monday and revised on Tuesday and Wednesday. If the translator sees this Friday, then they have to manually notice that they can avoid translating the text from Monday and Tuesday and only translate the version from Wednesday.

The "buffer" in the messages.pot file helps here: the translator start the workflow Friday morning by extracting all strings to mesages.pot. This file is then merged into other_lang.po. The translator now sees exactly that needs to be translated and they see what is "fuzzy" because only minor changes have been made to the source paragraph.

aellwein · 2022-09-14T05:58:20Z

I think that's what we both wanted: lines of text is kept together unless it is separated by \n\n+. Do you see something else? Could it perhaps be that you're on Windows? I wrote the code to split on \n only, but I don't see why it could not split on \r\n as well.

No, i am not on Windows, but i added additional line breaks after the sentences for better styling (my test text was a poem), this could be the reason.

Now, this list example is perhaps a poor example: I've been wondering if it makes sense to parse the Markdown more carefully and emit individual msgids for each list item. Similarly, a heading like ## My heading could be put into the messages.pot file as simply My heading. That way the translators will have less markup to deal with (but also slightly less context).

Yes, may be it's a good idea to have more "semantic" parsing of Markdown.

mgeisler · 2022-09-14T07:07:43Z

No, i am not on Windows, but i added additional line breaks after the sentences for better styling (my test text was a poem), this could be the reason.

I see, was the poem perhaps indented or in a block quote? That is,

> foo
> bar

will be put into a single msgid. The same happens with the two quoted paragraphs in this example:

> foo
>
> bar

I think there could be a lot of benefit from parsing away such block-level markup and put foo and bar into their own msgid. Similar for code blocks, headings, and list items.

If we parse a list with 3 items into 3 msgids, then there's no way for a translator to add/remove list items. Right now, it seems like that's okay since it can help prevent translation mistakes.

trdthg · 2022-09-17T12:03:23Z

Hi, I'm trying to do some translation with your code and #1306. Here are the steps I took：

mdbook xgettext
msguniq messages.pot -o messages.pot
msginit -i messages.pot --local zh.po
mdbook gettext zh.po

Then I use mdbook from #1306 to build, but get this error:

[ERROR] (mdbook::utils): Error: Couldn't open SUMMARY.md in "/home/trdthg/myproject/flutter_rust_bridge/book/src/zh" directory

#1306 needs the translated book to have its own SUMMARY.md. So do I have to translate and copy it manually?

Btw, cloning two extra copies of mdbook is a bad experience）

This command is one half of a Gettext-based translation (i18n) workflow. It iterates over each chapter and extracts all translatable text into a `messages.pot` file. The text is split on paragraph boundaries, which helps ensure less churn in the output when the text is edited. The other half of the workflow is a `gettext` command which will take a source Markdown file and a `xx.po` file and output a translated Markdown file. Part of the solution for rust-lang#5.

This command is the second part of a Gettext-based translation (i18n) workflow. It takes an `xx.po` file with translations and uses this to translate the chapters of the book. Paragraphs without a translation are kept in the original language. Part of the solution for rust-lang#5.

mgeisler · 2022-09-18T20:59:39Z

Hi @trdthg Thanks so much for testing this out!

#1306 needs the translated book to have its own SUMMARY.md. So do I have to translate and copy it manually?

You're completely right that I missed the generation of the SUMMARY.md file. I've pushed a new version of the branch which will also translate this file.

Btw, cloning two extra copies of mdbook is a bad experience）

Yeah, I agree... Perhaps @Ruin0x11 could rebase the branch on top of the latest master so that I in turn can rebase my branch on top. I just looked at the history and I see that the commits are 1-2 years old... so this might be much more work than I had hoped.

Ruin0x11 · 2022-09-19T00:34:58Z

If I understand correctly this adds better support for translator focused tooling to my original code, is that accurate? I don't mind rebasing again, but I want to make sure there are no blockers for integrating the original code like last time.

mgeisler · 2022-09-19T12:45:37Z

If I understand correctly this adds better support for translator focused tooling to my original code, is that accurate?

Yes, that is precisely the idea. The new commands in this PR allows for a Gettext based workflow for translations. The result is a tree of files which mirror the original files — a tree which should be ready to be put under src/xx/ for the xx language.

I want to make sure there are no blockers for integrating the original code like last time.

Just to be clear, I'm not a developer on the project — I'm just using mdbook myself for training materials and I would like to be able to translate this material to other languages.

Ruin0x11 · 2022-09-19T19:59:36Z

Okay, thanks for clarifying, I'm also not a major contributor to mdBook, but shared the same need for multilingual support at one point. I'm happy to collaborate if there's some way of getting traction on these code changes.

This implements a translation pipeline using the industry-standard Gettext[1] system. I picked Gettext for the reasons described in [2] and [3]: * It’s widely used in open source software. This means that there are graphical editors which will help you in editing the `.po` files. There are also many websites which allows you to do translation via an online flow. An example is Pontoon[4], which is used for the Rust website itself. We can consider setting up such an instance ourselves. * It is a light-weight yet structured format. This means that nothing changes with regards to how you update the original English text. We can still accept fixes and PRs like normal. The structure means that translators can see exactly which part of the course they need to update after a change. This is completely lost if you simply copy over the original text and translate it in-place in the Markdown files. The code here only adds support for translatins. They are not yet published or used for anything. Next steps will be * Add support for switching languages via a bit of JavaScript on each page. * Update the speaker notes feature to support translations (right now “Speaker Notes” is hard-coded into the generated HTML). I think we should turn it into a mdbook preprocessor instead. [1]: https://www.gnu.org/software/gettext/manual/html_node/index.html [2]: rust-lang/mdBook#1864 [3]: rust-lang/mdBook#5 (comment) [4]: https://pontoon.rust-lang.org/

This implements a translation pipeline using the industry-standard Gettext[1] system. I picked Gettext for the reasons described in [2] and [3]: * It’s widely used in open source software. This means that there are graphical editors which will help you in editing the `.po` files. There are also many websites which allows you to do translation via an online flow. An example is Pontoon[4], which is used for the Rust website itself. We can consider setting up such an instance ourselves. * It is a light-weight yet structured format. This means that nothing changes with regards to how you update the original English text. We can still accept fixes and PRs like normal. The structure means that translators can see exactly which part of the course they need to update after a change. This is completely lost if you simply copy over the original text and translate it in-place in the Markdown files. The code here only adds support for translatins. They are not yet published or used for anything. Next steps will be * Add support for switching languages via a bit of JavaScript on each page. * Update the speaker notes feature to support translations (right now “Speaker Notes” is hard-coded into the generated HTML). I think we should turn it into a mdbook preprocessor instead. Fixes #115. [1]: https://www.gnu.org/software/gettext/manual/html_node/index.html [2]: rust-lang/mdBook#1864 [3]: rust-lang/mdBook#5 (comment) [4]: https://pontoon.rust-lang.org/

This implements a translation pipeline using the industry-standard Gettext[1] system. I picked Gettext for the reasons described in [2] and [3]: * It’s widely used in open source software. This means that there are graphical editors which will help you in editing the `.po` files. An example is Poedit[4], which is available for all major platforms. There are also many online systems for doing translations. An example is Pontoon[5], which is used for the Rust website itself. We can consider setting up such an instance ourselves. * It is a light-weight yet structured format. This means that nothing changes with regards to how you update the original English text. We can still accept fixes and PRs like normal. The structure means that translators can see exactly which part of the course they need to update after a change. This is completely lost if you simply copy over the original text and translate it in-place in the Markdown files. The code here only adds support for translations. They are not yet tested, published or used for anything. Next steps will be: * Add support for switching languages via a bit of JavaScript on each page. * Update the speaker notes feature to support translations (right now “Speaker Notes” is hard-coded into the generated HTML). I think we should turn it into a mdbook preprocessor instead. * Add testing: We should test that the `.po` files are well-formed. We should also run `mdbook test` on each language since the translations can alter the embedded code. Fixes #115. [1]: https://www.gnu.org/software/gettext/manual/html_node/index.html [2]: rust-lang/mdBook#1864 [3]: rust-lang/mdBook#5 (comment) [4]: https://poedit.net/ [5]: https://pontoon.rust-lang.org/

mgeisler · 2023-01-09T13:09:00Z

Hi all, I'll close this PR in favor of google/comprehensive-rust#130. It's the same code there, but it's refactored to not require any changes of mdbook. Instead, I use a renderer (output format) to extract the strings and a preprocessor to do the translations.

You can reuse these tools in your own projects! Please let me know if you do so that we can figure out if we should publish them on crates.io.

This implements a translation pipeline using the industry-standard Gettext[1] system. I picked Gettext for the reasons described in [2] and [3]: * It’s widely used in open source software. This means that there are graphical editors which will help you in editing the `.po` files. An example is Poedit[4], which is available for all major platforms. There are also many online systems for doing translations. An example is Pontoon[5], which is used for the Rust website itself. We can consider setting up such an instance ourselves. * It is a light-weight yet structured format. This means that nothing changes with regards to how you update the original English text. We can still accept fixes and PRs like normal. The structure means that translators can see exactly which part of the course they need to update after a change. This is completely lost if you simply copy over the original text and translate it in-place in the Markdown files. The code here only adds support for translations. They are not yet tested, published or used for anything. Next steps will be: * Add support for switching languages via a bit of JavaScript on each page. * Update the speaker notes feature to support translations (right now “Speaker Notes” is hard-coded into the generated HTML). I think we should turn it into a mdbook preprocessor instead. * Add testing: We should test that the `.po` files are well-formed. We should also run `mdbook test` on each language since the translations can alter the embedded code. Fixes #115. [1]: https://www.gnu.org/software/gettext/manual/html_node/index.html [2]: rust-lang/mdBook#1864 [3]: rust-lang/mdBook#5 (comment) [4]: https://poedit.net/ [5]: https://pontoon.rust-lang.org/

mgeisler · 2023-05-19T16:19:30Z

Just in case someone finds this much later: the tooling has been released as a set of mdbook plugins: https://github.com/google/mdbook-i18n-helpers.

This implements a translation pipeline using the industry-standard Gettext[1] system. I picked Gettext for the reasons described in [2] and [3]: * It’s widely used in open source software. This means that there are graphical editors which will help you in editing the `.po` files. An example is Poedit[4], which is available for all major platforms. There are also many online systems for doing translations. An example is Pontoon[5], which is used for the Rust website itself. We can consider setting up such an instance ourselves. * It is a light-weight yet structured format. This means that nothing changes with regards to how you update the original English text. We can still accept fixes and PRs like normal. The structure means that translators can see exactly which part of the course they need to update after a change. This is completely lost if you simply copy over the original text and translate it in-place in the Markdown files. The code here only adds support for translations. They are not yet tested, published or used for anything. Next steps will be: * Add support for switching languages via a bit of JavaScript on each page. * Update the speaker notes feature to support translations (right now “Speaker Notes” is hard-coded into the generated HTML). I think we should turn it into a mdbook preprocessor instead. * Add testing: We should test that the `.po` files are well-formed. We should also run `mdbook test` on each language since the translations can alter the embedded code. Fixes google#115. [1]: https://www.gnu.org/software/gettext/manual/html_node/index.html [2]: rust-lang/mdBook#1864 [3]: rust-lang/mdBook#5 (comment) [4]: https://poedit.net/ [5]: https://pontoon.rust-lang.org/

mgeisler · 2023-08-23T18:44:27Z

Hi all, the latest version of mdbook-i18n-helpers significantly improves on how the text is extracted by removing unnecessary Markdown syntax. Please try it out if you're still interested in translating your mdbook documentation!

mgeisler marked this pull request as draft July 24, 2022 21:54

mgeisler mentioned this pull request Jul 24, 2022

Add multilingual support #5

Open

mgeisler force-pushed the gettext-i18n branch from 3941814 to 9234e00 Compare July 28, 2022 16:48

This was referenced Jul 29, 2022

Bump MSRV to 1.56 to get access to Rust 2021 crates #1866

Closed

migrate to 2021 edition #1831

Closed

mgeisler marked this pull request as ready for review August 9, 2022 19:38

mgeisler force-pushed the gettext-i18n branch 2 times, most recently from 65fb571 to 11d39d8 Compare September 8, 2022 15:31

mgeisler force-pushed the gettext-i18n branch 2 times, most recently from 027d32f to 14d6f9a Compare September 10, 2022 22:17

fzyzcjy mentioned this pull request Sep 13, 2022

Add Chinese translation to the book fzyzcjy/flutter_rust_bridge#700

Closed

mgeisler added 2 commits September 18, 2022 22:37

mgeisler force-pushed the gettext-i18n branch from 14d6f9a to 8a64a4f Compare September 18, 2022 20:49

trdthg mentioned this pull request Sep 19, 2022

Add i18n support fzyzcjy/flutter_rust_bridge#722

Closed

This was referenced Jan 5, 2023

Support translations google/comprehensive-rust#115

Closed

Let mdbook build select renderer and preprocessors #1978

Open

mgeisler mentioned this pull request Jan 8, 2023

Add support for translations google/comprehensive-rust#130

Merged

mgeisler closed this Jan 9, 2023

mgeisler mentioned this pull request Sep 15, 2023

Write mdbook renderer with strong templating system google/mdbook-i18n-helpers#70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add commands for Gettext-based translations i18n #1864

Add commands for Gettext-based translations i18n #1864

mgeisler commented Jul 24, 2022

mgeisler commented Jul 24, 2022

mgeisler commented Jul 25, 2022

mgeisler commented Aug 9, 2022

aellwein commented Sep 8, 2022

mgeisler commented Sep 8, 2022

aellwein commented Sep 13, 2022

sebras commented Sep 13, 2022

mgeisler commented Sep 13, 2022

mgeisler commented Sep 13, 2022

aellwein commented Sep 14, 2022

mgeisler commented Sep 14, 2022

trdthg commented Sep 17, 2022

mgeisler commented Sep 18, 2022

Ruin0x11 commented Sep 19, 2022

mgeisler commented Sep 19, 2022

Ruin0x11 commented Sep 19, 2022

mgeisler commented Jan 9, 2023

mgeisler commented May 19, 2023

mgeisler commented Aug 23, 2023

Add commands for Gettext-based translations i18n #1864

Add commands for Gettext-based translations i18n #1864

Conversation

mgeisler commented Jul 24, 2022

mgeisler commented Jul 24, 2022

mgeisler commented Jul 25, 2022

mgeisler commented Aug 9, 2022

aellwein commented Sep 8, 2022

mgeisler commented Sep 8, 2022

aellwein commented Sep 13, 2022

sebras commented Sep 13, 2022

mgeisler commented Sep 13, 2022

mgeisler commented Sep 13, 2022

aellwein commented Sep 14, 2022

mgeisler commented Sep 14, 2022

trdthg commented Sep 17, 2022

mgeisler commented Sep 18, 2022

Ruin0x11 commented Sep 19, 2022

mgeisler commented Sep 19, 2022

Ruin0x11 commented Sep 19, 2022

mgeisler commented Jan 9, 2023

mgeisler commented May 19, 2023

mgeisler commented Aug 23, 2023