Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML tag abbr seems to confuse the parser #97

Closed
HidenoriKobayashi opened this issue Oct 6, 2023 · 7 comments
Closed

HTML tag abbr seems to confuse the parser #97

HidenoriKobayashi opened this issue Oct 6, 2023 · 7 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@HidenoriKobayashi
Copy link

I'm redirected to here from google/comprehensive-rust#1284.

The source:
https://github.com/google/comprehensive-rust/blob/a38a33c8fba58678d0a9127d9644242bce41ed94/src/bare-metal/microcontrollers/probe-rs.md

The po file (bit outdated but updating it does not resolve the issue):
https://github.com/google/comprehensive-rust/blob/a38a33c8fba58678d0a9127d9644242bce41ed94/po/ja.po#L13420

Now, if I make this change (just for testing purpose)

diff --git a/po/ja.po b/po/ja.po
index b2baaf215743..2a9eb0c872d4 100644
--- a/po/ja.po
+++ b/po/ja.po
@@ -13418,7 +13418,7 @@ msgstr ""
 
 #: src/bare-metal/microcontrollers/probe-rs.md:10
 msgid "`cargo-embed` is a cargo subcommand to build and flash binaries, log "
-msgstr ""
+msgstr "`cargo-embed`"
 
 #: src/bare-metal/microcontrollers/probe-rs.md:11
 msgid "RTT"

I get a result like this:
localhost_3000_bare-metal_microcontrollers_probe-rs html

The strange thing about this is that string cargo-embed got merged into the last item in the list.

@mgeisler
Copy link
Collaborator

mgeisler commented Oct 6, 2023

Thanks @HidenoriKobayashi!

I think there might be two problems here. First, we have this test, which demonstrates that we chop up the text when we encounter a HTML tag (including a HTML comment):

fn extract_messages_inline_html() {
// HTML tags are skipped, but text inside is extracted:
assert_extract_messages(
"Hi <script>alert('there');</script>",
vec![
(1, "Hi "), //
(1, "alert('there');"),
],
);
}

There is also this test, which demonstrates that we lose text immediately following a (block-level?) HTML tag:

fn extract_messages_details() {
// This isn't great: we lose text following a HTML tag:
assert_extract_messages(
"Preamble\n\
<details>\n\
Some Details\n\
</details>\n\
\n\
Postamble",
vec![
(1, "Preamble"), //
// Missing "Some Details"
(6, "Postamble"),
],
);

I think it's an open question if we could or should extract the HTML into the POT file. One concern would be if we should just extract foo for something like <abbr title="foo">bar</abbr>? That would probably require us to know that the title attribute has translatable text (which is doable). We would still be chopping up the text — which I guess is nicer than asking everybody to deal with HTML in the translation files?

This is only about what gets extracted and put into the POT file. The other half of this is how we weave things together when we translate the Markdown later.

I just tried adding a little unit test for mdbook-gettext:

    #[test]
    fn test_translate_html() {
        let catalog = create_catalog(&[("foo ", "FOO "), ("bar", "BAR"), (" baz", " BAZ")]);
        assert_eq!(
            translate("foo <b>bar</b> baz", &catalog),
            "FOO <b>BAR</b> BAZ"
        );
    }

it fails with

---- tests::test_translate_html stdout ----
thread 'tests::test_translate_html' panicked at i18n-helpers/src/bin/mdbook-gettext.rs:239:9:
assertion failed: `(left == right)`

Diff < left / right > :
<FOO <b>BAR</b>BAZ
>FOO <b>BAR</b> BAZ

which shows that we lose a ' ' immediately following the </b>!

@mgeisler mgeisler added bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed labels Oct 6, 2023
@mgeisler
Copy link
Collaborator

mgeisler commented Oct 6, 2023

Further investigation shows that extract_events(" BAZ", state) give a Text(Borrowed("BAZ")) back! It should really give " BAZ" back so this seems to be where the space is lost.

@mgeisler
Copy link
Collaborator

mgeisler commented Oct 6, 2023

This is actually fair: Markdown does not distinguish between a line of text starting with " foo" vs "foo". So when we extract the events, we lose the leading whitespace.

I'm not entirely sure where where we should account for this, but perhaps extract_events could add it back in? Or perhaps the caller should account for being in an inline context and then add it? Note that we have a pulldown_cmark_to_cmark::State object available and it knows that last_was_html... so that might be helpful here?

@mgeisler
Copy link
Collaborator

mgeisler commented Oct 6, 2023

So I think these two tests should work:

diff --git a/i18n-helpers/src/bin/mdbook-gettext.rs b/i18n-helpers/src/bin/mdbook-gettext.rs
index b686be4..291bb2c 100644
--- a/i18n-helpers/src/bin/mdbook-gettext.rs
+++ b/i18n-helpers/src/bin/mdbook-gettext.rs
@@ -233,6 +233,15 @@ mod tests {
         );
     }
 
+    #[test]
+    fn test_translate_html() {
+        let catalog = create_catalog(&[("foo ", "FOO "), ("bar", "BAR"), (" baz", " BAZ")]);
+        assert_eq!(
+            translate("foo <b>bar</b> baz", &catalog),
+            "FOO <b>BAR</b> BAZ"
+        );
+    }
+
     #[test]
     fn test_translate_table() {
         let catalog = create_catalog(&[
diff --git a/i18n-helpers/src/lib.rs b/i18n-helpers/src/lib.rs
index 3bf2a5f..484d348 100644
--- a/i18n-helpers/src/lib.rs
+++ b/i18n-helpers/src/lib.rs
@@ -590,6 +590,30 @@ mod tests {
         assert_eq!(extract_events("", None), vec![]);
     }
 
+    #[test]
+    fn extract_events_leading_whitespace() {
+        assert_eq!(
+            extract_events("  foo", None),
+            vec![
+                (1, Start(Paragraph)),
+                (1, Text("  foo".into())),
+                (1, End(Paragraph)),
+            ]
+        );
+    }
+
+    #[test]
+    fn extract_events_trailing_whitespace() {
+        assert_eq!(
+            extract_events("foo  ", None),
+            vec![
+                (1, Start(Paragraph)),
+                (1, Text("foo  ".into())),
+                (1, End(Paragraph)),
+            ]
+        );
+    }
+
     #[test]
     fn extract_events_paragraph() {
         assert_eq!(

mgeisler added a commit to google/comprehensive-rust that referenced this issue Nov 30, 2023
The mixing of HTML and Markdown breaks our translation pipeline: we see the HTML and fail to parse things correctly.

This might be google/mdbook-i18n-helpers#97, but I'm not 100% sure.

The fix is to make put the HTML on its own line: then the Markdown is parsed again inside.
mgeisler added a commit to google/comprehensive-rust that referenced this issue Nov 30, 2023
The mixing of HTML and Markdown breaks our translation pipeline: we see
the HTML and fail to parse things correctly.

This might be google/mdbook-i18n-helpers#97,
but I'm not 100% sure.

The fix is to make put the HTML on its own line: then the Markdown is
parsed again inside.

Fixes #1527.
@mgeisler
Copy link
Collaborator

mgeisler commented Jan 9, 2024

@kdarkhan, the problem here is about how we lose a ' ' when the Markdown AST is a mix of Text and Html nodes. I believe #125 is related: it's about how breaking apart the text on HTML elements causes problems. I think we could solve both problems if we would let Text and Html nodes be grouped together.

@kdarkhan kdarkhan self-assigned this Jan 29, 2024
@kdarkhan
Copy link
Collaborator

kdarkhan commented May 22, 2024

I believe this should be resolved now with #195 merged.

@kdarkhan kdarkhan removed their assignment May 22, 2024
@mgeisler
Copy link
Collaborator

I think so too, let's close it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants