Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-language support and management #3678

Open
cychitivav opened this issue May 23, 2023 · 5 comments
Open

Multi-language support and management #3678

cychitivav opened this issue May 23, 2023 · 5 comments

Comments

@cychitivav
Copy link
Contributor

cychitivav commented May 23, 2023

Hello,

I am a member of the Spanish ROS community, and I have been reading issue #3249. I believe that translating into multiple languages is crucial because, at one point, I was one of those individuals who struggled to learn ROS due to the language barrier in the documentation.

Translating this is a best-effort task to help people who prefer to read in their language or are unable (or find it difficult) to read in English. A translated page is beneficial, but if it's not available, the page will be in English.

Apart from the steep learning curve of ROS, the fact that the documentation is only available in English makes the learning curve even steeper. Therefore, I think it's essential to have the documentation available in multiple languages to enable more people to learn ROS.

Creating separate repositories for each language seems impractical since every update in the original repository creates conflicts when merging into individual language repositories. While exploring Sphinx's documentation on internationalization (https://www.sphinx-doc.org/en/master/usage/advanced/intl.html), I found the following approach:

image

Process

This process can be somewhat simplified with the help of the sphinx-intl library:

  1. Upload .rst files with the documentation in English.
  2. Generate a .pot file from the .rst files using make gettext (these files would be stored in the build folder).
  3. Update the .po files with the .pot files using sphinx-intl update -p build/locale -l <language>.
  4. Translate the .po files manually or with the assistance of a translator.
  5. Build the documentation in the desired language using make -e SPHINXOPTS="-D language='<language>'" html (the output would be stored in the build/<language> folder).

Note: When compiling HTML with make html, a .mo file is generated, which contains the translation from the .po file, and the HTML page is generated with the translation.

Structure of .po files

These files allow the translation of phrases or short paragraphs, where people with basic knowledge of English can perform the translation. Additionally, the authorship of each translation can be maintained, and changes made in each language can be tracked without affecting the English documentation (.rst files).

As described in the GNU documentation:

A PO file is made up of many entries, each entry holding the relation between an original untranslated string and its corresponding translation. All entries in a given PO file usually pertain to a single project, and all translations are expressed in a single target language. One PO file entry has the following schematic structure:

white-space
#  translator-comments
#. extracted-comments
#: reference…
#, flag…
msgid untranslated-string
msgstr translated-> 

A simple entry can look like this:

#: lib/error.c:116
msgid "Unknown system error"
msgstr "Error desconocido del sistema"

In essence, the .po files are a list of entries, each having a msgid and a msgstr. The msgid represents the original English text, and the msgstr contains the translation.

Management of .po files

I have been working on a fork of this repository (https://github.com/cychitivav/ros2_documentation/tree/multilingual) to automate the generation of .po files using GitHub Actions. With some changes in the source folder, it is possible to generate .po files for multiple languages simultaneously.

Furthermore, the action includes code to extract the current status of the .po files, providing information on the number of translated and untranslated msgid entries. With this information, an issue could be automatically generated, or translations could be performed using googletrans (https://pypi.org/project/googletrans/) or similar tools.

Modifications to msgid

An important aspect is to identify which files require translation and ensure that existing translations are not lost. To address this, I have explored the sphinx-intl module, which allows updating the .po files based on the .pot files generated by make gettext. During the update, the following scenarios can occur:

  • New msgid entries are added: In this case, the .po files are updated with the new msgid, and the msgstr remains blank.

  • A msgid entry is removed: In this case, the corresponding msgid is removed from the .po file.

  • A msgid entry is not modified: In this case, the corresponding msgstr is preserved.

  • A msgid entry is modified: This last scenario includes the additional functionality of make gettext, which uses a fuzzy comparison (Levenshtein distance) to determine the extent of the documentation changes.

    If the change is minimal, the fuzzy flag is added, allowing the previous translation to be temporarily preserved until it is updated (or published if desired). For example, if a comma is changed to a period:

    #, fuzzy
    msgid "modified text in the .rst file"
    msgstr "previous translation"

    On the other hand, if the change is significant, the translation in the .po file will be removed, and the new msgid will be added:

    msgid "modified text in the .rst file"
    msgstr ""

Handling for each language

I have made some changes to the makefile so that files for multiple languages can be generated simultaneously (according to the interested translation communities). Additionally, I have placed the locale folder in the root of the repository to avoid conflicts in the action workflow.

The folder structure is as follows:

.
├── build
│   ├── gettext
│   └── html
│       ├── en
│       ├── es
│       └── fr
├── locale
│   ├── es
│   └── fr
└── source

As you can see, the locale folder contains the .po files, and the source folder remains unchanged to prevent any loops in the action or damage to the English documentation. As suggested by @fujitatomoya:

but if we take multiple language support in this repo, i would request the following architecture dependency.

  • mainline doc WILL NOT depend on any multiple language contents.
  • Only multiple language contents can refer to mainline doc.

Feasibility

Given the change tracking performed by make gettext and sphinx-intl in the .rst files, I believe it is possible to maintain the documentation in multiple languages within a single repository once a significant portion of the documentation has been translated. This would even allow for automatic translation and community contributions to improve the translations through PRs.

This is because if a large portion of the files is translated (either manually or automatically),

minor changes in the .rst files can be handled by temporarily preserving the previous translation until it is updated (or published, if desired).

Final Comments

First and foremost, I would like to hear the maintainer's opinion on adding the .po files and the locale folder to the main repository. This would involve reviewing each pull request by a moderator for each language or implementing a similar process. By doing so, the authorship of each translation can be maintained, and changes made in each language can be controlled without affecting the English documentation (.rst files).

If you believe this is possible, I would like to submit a PR to the repository and await a review. I can also provide further clarification on the entire process.

It would be interesting to have a section like this on the ROS page:

image
https://docs.readthedocs.io/en/stable/localization.html

Considerations

  • How can the contributions already made in the repositories for each language be added to the main repository?
  • Multiversion internal configurations: Some internal paths may need to be changed if paths per language are included.
  • Possibly update sitemap.xml.
  • Check the translations with doc8.
@clalancette
Copy link
Contributor

First, I'm sorry for the very long delay in responding.

Second, I agree with you that for the most part, it would be much better if the translations lived in the same repository as the original English documentation. Otherwise, it is going to get out of sync quickly and be hard to keep up-to-date.

However, I do have concerns that it will be hard for the current maintainers of ros2_documentation to be able to review the translations for many different languages. If we do go this route, then effectively we will merge in changes to the .po files without understanding the language they are being translated into. There is some possibility for abuse there (like putting spam in the translation), but I hope that wouldn't be a problem. Overall, I think we should do it regardless of these problems, but in the long-term I think we would want to have trusted reviewers for each language.

With all of that said, I would love to see a PR that has the changes to make this happen. We can discuss more about what this would look like there. Note that the way we build official documentation does not use a GitHub action, but instead invokes the Dockefile at https://github.com/ros2/ros2_documentation/blob/rolling/docker/image/Dockerfile, so any solution will need to be integrated there.

@cychitivav
Copy link
Contributor Author

Hi,

Thank you very much for your feedback.

Of course, I have created the Pull Request (#3829). For now, I'll keep it as a draft while I make some changes, as the version I had is outdated. Regarding GitHub Actions, I only use it to update the PO files, and the only change in Docker is adding the sphinx-intl package to the requirements.txt file.

@fujitatomoya
Copy link
Collaborator

thank you very much detailed explanation!

i would like to ask a couple of questions,

  • Notification and Status check against the main doc.

    I do speak Japanese so if i were the maintainer for jp locale doc, i want to know that this jp doc or .po file is out of data, need to catch up with main when main English doc is modified or updated. without these notification or capability, it would be out of date documentation and we need to post this jp doc is out of date, please see latest information for English doc. which probably the situation we should avoid if we have multiple language docs...
    Can we detect that .po files are not compatible with current .pot files during build, and then maintainers can know where needs to be addressed?
    btw, i think it is likely that locale maintainers do the sync for certain period.

On the other hand, if the change is significant, the translation in the .po file will be removed,

this sounds that mainline doc easily breaks multiple language docs? this works for mainline doc maintainers, but can be problems for multiple language doc maintainers?

i am not so familiar with sphinx multiple language support, so maybe i am mistaken for some parts...

@cychitivav
Copy link
Contributor Author

cychitivav commented Aug 15, 2023

Hi @fujitatomoya,

I had forgotten to mention that these .po files are not only for Sphinx; they are actually used in various internationalization cases as they separate the code or formatting from the translation. I bring this up because these files can be managed using several tools, and one of them is the GNU gettext. Using this tool, you can obtain statistics on how many missing msgid need translation. While I'm not an expert with these files, I believe that identification isn't a problem. That's why I've left the PR as a draft for now and it is necessary to check it properly.

Regarding synchronization, I've been working on a GitHub Actions workflow to perform updates and generate a report every time a commit is made in the source folder. However, I think it's not ideal and warrants discussion.

Lastly, when a msgstr is removed, it's usually because the original text has changed significantly, and it's highly likely that the previous translation wouldn't be accurate. If the concern is about minor changes with typographical errors or punctuation, a 'fuzzy' flag is activated for this, allowing a decision between keeping the original text or the previous translation. If a text lacks translation, it would be displayed in English, ensuring the documentation is always up to date. Nonetheless, the issue lies in potentially having pages with mixed languages.

I hope this solves your concerns and don't forget to mention something.

@pxalcantara
Copy link

hi @cychitivav I strongly agree with your points about the importance of having multi-language tutorials support. There is any update subject?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants