Extract only strings and comments from codeblock #109

dalance · 2023-10-28T02:54:10Z

Fixes #95

This PR adds codeblock parsing support to extract only strings and comments.

google-cla · 2023-10-28T02:54:14Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

i18n-helpers/src/lib.rs

i18n-helpers/src/bin/mdbook-gettext.rs

i18n-helpers/src/lib.rs

mgeisler · 2023-10-29T21:17:33Z

Hi @dalance,

This PR adds codeblock parsing support to extract only strings and comments.

Wow, thank you so much! This looks really good from a quick look!

Could you test how https://google.github.io/comprehensive-rust/zh-CN/hello-world/small-example.html looks after this? In particular, does mdbook-i18n-normalize do the right thing here? It should turn this message into 5 individual messages, one for each line comment.

@dyoo has written a lot of this code and I hope he can review this. Also, I think this will collide with #107 and the design he started there to support things like translation comments. So we should make sure we can get everything working nicely together here 🙂

I think we would want an option in book.toml for this? That could certainly be added later down the road if someone asks for it.

dalance · 2023-10-30T01:17:51Z

Hi @mgeisler,
Thank you for your review.
I fixed some codes which are suggested.

The small-example.md becomes like below.
I think it is an expected result.

#: src/hello-world/small-example.md:6
msgid "// Program entry point\n"
msgstr ""

#: src/hello-world/small-example.md:7
msgid "// Mutable variable binding\n"
msgstr ""

#: src/hello-world/small-example.md:8 src/traits/impl-trait.md:15
msgid "\"{x}\""
msgstr ""

#: src/hello-world/small-example.md:8
msgid "// Macro for printing, like printf\n"
msgstr ""

#: src/hello-world/small-example.md:9
msgid "// No parenthesis around expression\n"
msgstr ""

#: src/hello-world/small-example.md:10
msgid "// Math like in other languages\n"
msgstr ""

#: src/hello-world/small-example.md:15
msgid "\" -> {x}\""
msgstr ""

The conflicting point with #107 seems to be the modifitation to Group enum.
It may be better that splitting this PR to changing to Group which owns Event and adding parse_codeblock function?

I think we would want an option in book.toml for this? That could certainly be added later down the road if someone asks for it.

This PR changes messages.pot significantly.
So an option to disable this feature may be useful for existing users.

dalance · 2023-10-30T02:42:21Z

I changed some logics to treat continuous line comments.
This is because the previous code can't handle indented continuous line comments.

mgeisler · 2023-10-30T09:05:32Z

Thanks for all the updates!

I think we would want an option in book.toml for this? That could certainly be added later down the road if someone asks for it.

This PR changes messages.pot significantly. So an option to disable this feature may be useful for existing users.

Yeah, I can imagine it will change the files a lot — but for the better!

Please consider adding a line about this new behavior to the documentation. I think it could just be a little paragraph in Marking Sections to be Skipped for Translation which says something like

Note that we don't extract the full text of code blocks. Only text that is recognized as comments and literal strings is extracted.

Something like that, feel free to rephrase it how you like.

The small-example.md becomes like below. I think it is an expected result.

Yes, this looks great! Please also try running

mdbook-i18n-normalize po/zh-CN.po po/zh-CN-normalized.po

and verify that the existing translations are correctly paired up with the new messages. The normalization tool is what is supposed to let us change how we extract messages while also keeping existing translations intact. If I did things correctly, it should "just work", but it'll be nice to have confirmation from you 🙂

dalance · 2023-10-30T09:26:12Z

The result of mdbook-i18n-normalize is below.
It works fine too.

#: src/hello-world/small-example.md:6
msgid "// Program entry point\n"
msgstr "// 程序入口\n"

#: src/hello-world/small-example.md:7
msgid "// Mutable variable binding\n"
msgstr "// 可变变量绑定\n"

#: src/hello-world/small-example.md:8
#: src/traits/impl-trait.md:15
msgid "\"{x}\""
msgstr "\"{x}\""

#: src/hello-world/small-example.md:8
msgid "// Macro for printing, like printf\n"
msgstr "// 与 printf 类似的输出宏\n"

#: src/hello-world/small-example.md:9
msgid "// No parenthesis around expression\n"
msgstr "// 表达式周围没有括号\n"

#: src/hello-world/small-example.md:10
msgid "// Math like in other languages\n"
msgstr "// 与其他语言类似的数值计算\n"

#: src/hello-world/small-example.md:15
msgid "\" -> {x}\""
msgstr "\" -> {x}\""

i18n-helpers/src/bin/mdbook-gettext.rs

i18n-helpers/src/lib.rs

mgeisler · 2023-10-31T06:57:06Z

@henrif75, when this is merged, we will do a 0.3 release. This is a breaking change: we need to run mdbook-i18n-normalize on our translations to keep any translated comments intact (as tested by @dalance in #109 (comment)). I'll make sure to explain this again in the release notes.

dalance · 2023-10-31T09:08:22Z

I found a normalization issue.
The result of re-executed nomalization to po/zh-CN-normalized.po is here.

#: src/hello-world/small-example.md:6
msgid "// Program entry point"
msgstr "// 程序入口"

The trailing linebreak is missing.
This is because msgid/msgstr don't have a context which is in codeblock or not.
I can add a workaround like below, but hand-written detection is required because language is not clear.

pub fn extract_messages(document: &str) -> Vec<(usize, String)> {
    if document.starts_with("//") && document.ends_with("\n") {
        return vec![(1, document.to_string())];
    }

How do you think the workaround?
Or any information which shows msgid/msgstr is originated from codeblock should be added to po file?

dalance · 2023-10-31T09:14:11Z

I fixed the reviewed issues.
Additionally, I found my mistake about the usage of syntect stack through fuzz test, and fixed it too.

henrif75 · 2023-11-01T17:35:42Z

@henrif75, when this is merged, we will do a 0.3 release. This is a breaking change: we need to run mdbook-i18n-normalize on our translations to keep any translated comments intact (as tested by @dalance in #109 (comment)). I'll make sure to explain this again in the release notes.

Sounds good. I assume it's going to break ongoing PRs?

i18n-helpers/src/lib.rs

dyoo · 2023-11-01T17:49:52Z

I am so sorry for late review comments; I thought I pressed Submit, and it turns out that I did not. :(

dalance · 2023-11-02T00:56:13Z

I refrected the review, introduced Flags struct to resolve the normarilzation issue, and rebased to the latest main.

dalance · 2023-11-02T05:33:21Z

My solution seems to be broken...
msginit can't handle flags in messages.pot.
Another solution may be required.

dalance · 2023-11-02T06:26:23Z

Built-in flags seems to be passed through msginit.
So using c-format flag instead of codeblock flag may resolve the issue.
But it may be confusable.

mgeisler · 2023-11-02T07:41:39Z

I refrected the review, introduced Flags struct to resolve the normarilzation issue, and rebased to the latest main.

If this is only for normalization, then please don't add extra code to support it — normalization is something translators (or @henrif75) will run only once after a new breaking release. I don't think the main code should be expanded to support it.

In this case, I believe the problem are some newlines which are incorrect after running normalize? As long as the normal flow works, then these messages will be corrected by msgmerge and that's good enough.

@dyoo, are you happy with the logic here? Please finalize the review so we can merge it.

dalance · 2023-11-02T07:59:50Z

If there is no need to resolve the normalization issue, I'll remove the logic.
I think migration from v0.2 to v0.3 will works fine without it.

dalance · 2023-11-02T09:18:54Z

I removed the logic related to Flags.

mgeisler · 2023-11-02T21:48:03Z

If there is no need to resolve the normalization issue, I'll remove the logic.

Thanks @dalance!

@henrif75, when this is merged, we will do a 0.3 release. This is a breaking change: we need to run mdbook-i18n-normalize on our translations to keep any translated comments intact (as tested by @dalance in #109 (comment)). I'll make sure to explain this again in the release notes.

Sounds good. I assume it's going to break ongoing PRs?

Hey @henrif75, yes, it will affect PRs that touches code blocks. Most PRs don't, I believe so the impact should be less than when the tooling learnt to ignore most Markdown formatting in version 0.2.

Note that there will only be conflicts if we run the normalization tool on a translation — until we do, the translation will work just fine in all places except for the code blocks.

So I suggest that we land the current in-flight PRs, run the normalization tool, and then people can continue translating like normal (but with much smaller PO files: a quick test shows that we go from 19k to 16k lines!).

i18n-helpers/src/lib.rs

dalance · 2023-11-06T01:10:34Z

I restored heuristic logic for codeblock which doesn't have lang-specifier, and refactored stack operation logic.

dyoo

Thanks for the changes; looks good to me.

mgeisler · 2023-11-08T08:42:59Z

Amazing, thanks @dyoo for the review and thanks @dalance for all the work here! I'm very excited to merge and release this to the world 😄

mgeisler · 2023-11-08T17:49:52Z

Sorry for the tiny merge conflict in lib.rs, @dalance 🙂 We should be able to merge this right after you resolve it.

dalance · 2023-11-09T00:19:21Z

I resolved the conflict.

i18n-helpers/fuzz/fuzz_targets/group_events.rs

dalance mentioned this pull request Oct 28, 2023

Extract only strings and comments from code blocks #95

Closed

mgeisler reviewed Oct 29, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Oct 29, 2023

View reviewed changes

i18n-helpers/src/bin/mdbook-gettext.rs Outdated Show resolved Hide resolved

mgeisler reviewed Oct 29, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Oct 29, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Oct 29, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler requested a review from dyoo October 29, 2023 21:08

dalance force-pushed the parse_codeblock branch from d78316f to 2401cec Compare October 30, 2023 00:25

dalance force-pushed the parse_codeblock branch from 2401cec to 420f768 Compare October 30, 2023 02:39

dalance force-pushed the parse_codeblock branch from 420f768 to e1de210 Compare October 30, 2023 02:58

dalance force-pushed the parse_codeblock branch 2 times, most recently from c7b861f to e19b096 Compare October 30, 2023 10:22

mgeisler reviewed Oct 30, 2023

View reviewed changes

i18n-helpers/src/bin/mdbook-gettext.rs Outdated Show resolved Hide resolved

dyoo reviewed Oct 30, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

dalance force-pushed the parse_codeblock branch from e19b096 to 70ad914 Compare October 31, 2023 00:08

mgeisler mentioned this pull request Oct 31, 2023

Fixture code and footer content tags in rtl content google/comprehensive-rust#1433

Merged

mgeisler reviewed Oct 31, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Oct 31, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler reviewed Oct 31, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

mgeisler mentioned this pull request Oct 31, 2023

Add CXX tutorial google/comprehensive-rust#1392

Merged

dalance force-pushed the parse_codeblock branch from 70ad914 to d7fda79 Compare October 31, 2023 09:11

dyoo reviewed Nov 1, 2023

View reviewed changes

dalance force-pushed the parse_codeblock branch 2 times, most recently from fb769d7 to cd1faa1 Compare November 2, 2023 00:53

dalance force-pushed the parse_codeblock branch 2 times, most recently from c821d01 to a48720a Compare November 2, 2023 08:51

mgeisler reviewed Nov 2, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

dyoo reviewed Nov 3, 2023

View reviewed changes

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

i18n-helpers/src/lib.rs Outdated Show resolved Hide resolved

i18n-helpers/src/lib.rs Show resolved Hide resolved

dalance force-pushed the parse_codeblock branch from a48720a to 2cf67ed Compare November 6, 2023 01:08

dyoo approved these changes Nov 7, 2023

View reviewed changes

mgeisler mentioned this pull request Nov 8, 2023

Persian translation google/comprehensive-rust#671

Open

Extract only strings and comments from codeblock

eb5cde7

dalance force-pushed the parse_codeblock branch from 2cf67ed to eb5cde7 Compare November 9, 2023 00:18

mgeisler reviewed Nov 9, 2023

View reviewed changes

i18n-helpers/fuzz/fuzz_targets/group_events.rs Show resolved Hide resolved

mgeisler enabled auto-merge November 9, 2023 07:34

mgeisler merged commit 9e894bc into google:main Nov 9, 2023
6 checks passed

mgeisler mentioned this pull request Nov 9, 2023

Release 0.3.0 #116

Merged

mgeisler mentioned this pull request Nov 19, 2023

Skip code blocks without translatable strings google/comprehensive-rust#1257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract only strings and comments from codeblock #109

Extract only strings and comments from codeblock #109

dalance commented Oct 28, 2023

google-cla bot commented Oct 28, 2023

mgeisler commented Oct 29, 2023

dalance commented Oct 30, 2023

dalance commented Oct 30, 2023

mgeisler commented Oct 30, 2023

dalance commented Oct 30, 2023

mgeisler commented Oct 31, 2023

dalance commented Oct 31, 2023

dalance commented Oct 31, 2023

henrif75 commented Nov 1, 2023

dyoo commented Nov 1, 2023

dalance commented Nov 2, 2023

dalance commented Nov 2, 2023

dalance commented Nov 2, 2023

mgeisler commented Nov 2, 2023

dalance commented Nov 2, 2023

dalance commented Nov 2, 2023

mgeisler commented Nov 2, 2023

dalance commented Nov 6, 2023

dyoo left a comment

mgeisler commented Nov 8, 2023

mgeisler commented Nov 8, 2023

dalance commented Nov 9, 2023

Extract only strings and comments from codeblock #109

Extract only strings and comments from codeblock #109

Conversation

dalance commented Oct 28, 2023

google-cla bot commented Oct 28, 2023

mgeisler commented Oct 29, 2023

dalance commented Oct 30, 2023

dalance commented Oct 30, 2023

mgeisler commented Oct 30, 2023

dalance commented Oct 30, 2023

mgeisler commented Oct 31, 2023

dalance commented Oct 31, 2023

dalance commented Oct 31, 2023

henrif75 commented Nov 1, 2023

dyoo commented Nov 1, 2023

dalance commented Nov 2, 2023

dalance commented Nov 2, 2023

dalance commented Nov 2, 2023

mgeisler commented Nov 2, 2023

dalance commented Nov 2, 2023

dalance commented Nov 2, 2023

mgeisler commented Nov 2, 2023

dalance commented Nov 6, 2023

dyoo left a comment

Choose a reason for hiding this comment

mgeisler commented Nov 8, 2023

mgeisler commented Nov 8, 2023

dalance commented Nov 9, 2023