ICU-22890 MF2: Add lone surrogate test to parser #3167

catamorphism · 2024-09-14T00:59:50Z

Also see #3166

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22890
Required: The PR title must be prefixed with a JIRA Issue number.
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

catamorphism · 2024-09-14T01:00:53Z

Note that #3063 (not landed yet) and #3092 (draft) have more general fixes for keyword lookahead and wide character parsing, but I wanted to include a minimal solution here to address the infinite loop bug.

FrankYFTang · 2024-09-16T23:00:34Z

@echeran you review other MessageFormatter changes before. Could you also review this PR? Thanks

jira-pull-request-webhook · 2024-09-18T17:32:33Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/messageformat2_parser.cpp is no longer changed in the branch
icu4c/source/i18n/messageformat2_parser.h is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

catamorphism · 2024-09-18T17:33:19Z

I rebased, and it turns out that #3063 fully fixed the infinite loop bug, so I removed all the commits except the one with the new tests.

FrankYFTang · 2024-09-18T17:55:56Z

Could you change the title of this PR from "Fix infinity loop in parser" to "Add lone surrogate test to MessageFormatter2 Parser" Thanks

FrankYFTang · 2024-09-18T18:13:48Z

Could you also add similar test to Java ?

catamorphism · 2024-09-18T20:12:03Z

@FrankYFTang First, I added checks to your ICU4C test so as to require a syntax error.

I also added a test for Java. I was hoping to add this to the shared data-driven tests, but I couldn't figure out how to escape the unpaired surrogate strings for JSON. So, the tests are separate for now.

ICU4J wasn't erroring out on this case (though there was no infinite loop either), so I fixed that -- cc @mihnita. But maybe that should be a separate PR, since the ICU4J bug isn't critical. What do you think?

FrankYFTang · 2024-09-18T20:59:54Z

icu4c/source/test/intltest/messageformat2test.cpp

+      .build(errorCode);
+    UnicodeString result = msgfmt1.formatToString({}, errorCode);
+    assertEquals("testHighLoneSurrogate", U_MF_SYNTAX_ERROR, errorCode);
+    errorCode.reset();


why do we reset the errorCode here?

I thought that in intltest, each test has to reset the error code so it doesn't carry over to the next test -- is that not the case?

no, errorCode is a local variable in this test.

please remove the reset

If I remove the errorCode.reset() call, then the test fails. I assume this has to do with the reference to this:

IcuTestErrorCode errorCode(*this, "testHighLoneSurrogate");

and that there's some shared state that results in an error if the error code is non-success at the end of a test.

A lot of other tests in intltest have this pattern (errorCode declared as a local IcuTestErrorCode variable, and then errorCode.reset() at the end of the method.)

so sorry, I got confused, it should be

errorCode.expectErrorAndReset(U_MF_SYNTAX_ERROR, "testHighLoneSurrogate");

instead of

assertEquals("testHighLoneSurrogate", U_MF_SYNTAX_ERROR, errorCode); errorCode.reset();

same below

OK, fixed in a9a6de2

FrankYFTang · 2024-09-18T20:59:58Z

icu4c/source/test/intltest/messageformat2test.cpp

+      .build(errorCode);
+    UnicodeString result = msgfmt2.formatToString({}, errorCode);
+    assertEquals("testLowLoneSurrogate", U_MF_SYNTAX_ERROR, errorCode);
+    errorCode.reset();


why do we reset the errorCode here?

FrankYFTang

LGTM

catamorphism · 2024-09-18T22:42:57Z

I haven't been able to reproduce the fuzzer failure in the latest CI run yet, but I'll keep trying.

mihnita · 2024-09-18T23:42:39Z

Unpaired surrogates are not an error, according to the spec.
We had a long argument about that a while ago, I can try to dig it out.

jira-pull-request-webhook · 2024-09-19T01:29:32Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/unicode/messageformat2.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

catamorphism · 2024-09-19T01:30:13Z

135a911 should fix the fuzzer error.

catamorphism · 2024-09-19T01:32:27Z

@mihnita:

Unpaired surrogates are not an error, according to the spec. We had a long argument about that a while ago, I can try to dig it out.

I think they're a syntax error?

simple-message    -> o [simple-start pattern]
simple-start -> simple-start-char / escaped-char / placeholder
simple-start-char -> content-char / "@" / "|"

And the definition of content-char excludes surrogates. Am I missing something?

jira-pull-request-webhook · 2024-09-19T02:05:30Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

catamorphism · 2024-09-19T02:33:58Z

I won't merge this right away so that @mihnita has a chance to reply further re: whether unpaired surrogates are a syntax error.

mihnita · 2024-09-19T03:33:34Z

https://github.com/unicode-org/message-format-wg/blob/main/meetings/2022/notes-2022-06-13.md

RGN: Surrogate code points. Those are code points reserved for representing code points in UTF-16 that are beyond the first plane (BMP) of 2^16 code points.

MIH: I understand what you mean. But we also implement this is C and Java, and so on. So what should we do if we receive a message with invalid UTF-8 code points. Do we expect to replace them with the replacement character, or do we just pass them through?

RGN: I think what you're asking about, using JavaScript as a concrete example, is that a JS string is allowed to have unpaired surrogates. So the question is a question for the JS adapter / implementation, but that's not a question for the standard itself.

MIH: So we leave it to the implementation?

RGN: Yes.

MIH: Okay, that is fine with me.

If the spec ended up saying something else, I am really not happy about it.
I will try to track it down when this got in and how.

I don't want to reopen that discussion.
But it is not the job of the MF2 to validate correct / incorrect UTF sequences.
Not in the message proper.
I can live with it in the code part. But once we are in the pattern, we should not.

And I would not make such a change so very late.

One might make an argument about the C++ implementation.

But Java is UTF-16 everywhere, and it does not "explode" on improperly paired surrogates.
And was explicitly mentioned and accepted in the quoted thread.

(OK, deep in the belly of String there is an optimization making some strings Latin1, but that is not visible in the public APIs, it is only an implementation detail, for size)

catamorphism · 2024-09-19T04:12:47Z

@mihnita Looks like unicode-org/message-format-wg#290 is the PR that introduced this (from Aug. 2022).

mihnita · 2024-09-19T15:59:34Z

I will open an issue with MessageFormat WG.

This is really unfriendly for Java, JavaScript, Windows C "wide APIs" (that take a wchar_t, which is 16 bit).

It is not the job of a formatting function to validate and reject UTF corectness, at least in the above mentioned environments.

For example in Java we have String.format, java.text.MessageFormat, com.ibm.icu.text.MessageFormat, all of them work fine with unpaired surrogates.

I will make my case in the WG issue.

But I am against making this change in Java with this PR, sorry.

mihnita

Please remove the Java changes from the PR.

markusicu · 2024-09-19T16:44:27Z

We normally treat unpaired surrogates like unassigned code points.

jira-pull-request-webhook · 2024-09-19T19:59:11Z

Notice: the branch changed across the force-push!

icu4j/main/core/src/main/java/com/ibm/icu/message2/InputSource.java is no longer changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java is no longer changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/message2/MessageFormat2Test.java is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

catamorphism · 2024-09-19T20:00:06Z

OK -- I've removed the Java test and changes. Needs a re-approval.

mihnita

Thank you very much!
Mihai

Add a test to ICU4C for handling of lone surrogates. Incidentally fix uninitialized-memory bug in MessageFormatter (initialize `errors` to nullptr) Co-authored-by: Frank Tang <ftang@chromium.org>

jira-pull-request-webhook · 2024-09-19T20:32:49Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/unicode/messageformat2.h is no longer changed in the branch
icu4c/source/test/intltest/messageformat2test.cpp is different
icu4c/source/test/intltest/messageformat2test.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

mihnita · 2024-09-19T20:33:50Z

I've created unicode-org/message-format-wg#895

catamorphism · 2024-09-19T20:34:15Z

Sorry @mihnita -- one more approval? I had to rebase because I landed #3083 first.

catamorphism mentioned this pull request Sep 14, 2024

ICU-22890 Add test to show lone surrogate cause infinity loop #3166

Closed

7 tasks

catamorphism force-pushed the icu-22890 branch from 8699988 to 68b5ef5 Compare September 18, 2024 17:32

FrankYFTang self-requested a review September 18, 2024 17:54

catamorphism changed the title ~~ICU-22890: Fix infinite loop in parser~~ ICU-22890: MF2: Add lone surrogate test to parser Sep 18, 2024

FrankYFTang reviewed Sep 18, 2024

View reviewed changes

FrankYFTang previously approved these changes Sep 18, 2024

View reviewed changes

catamorphism force-pushed the icu-22890 branch from fb0c873 to 5d95452 Compare September 18, 2024 21:04

catamorphism dismissed FrankYFTang’s stale review via 135a911 September 19, 2024 01:29

catamorphism force-pushed the icu-22890 branch from 5d95452 to 135a911 Compare September 19, 2024 01:29

FrankYFTang previously approved these changes Sep 19, 2024

View reviewed changes

catamorphism force-pushed the icu-22890 branch from a9a6de2 to af45a5b Compare September 19, 2024 02:05

catamorphism changed the title ~~ICU-22890: MF2: Add lone surrogate test to parser~~ ICU-22890 MF2: Add lone surrogate test to parser Sep 19, 2024

mihnita requested changes Sep 19, 2024

View reviewed changes

markusicu assigned mihnita Sep 19, 2024

markusicu self-requested a review September 19, 2024 16:44

catamorphism dismissed FrankYFTang’s stale review via b3efbe2 September 19, 2024 19:59

catamorphism force-pushed the icu-22890 branch from af45a5b to b3efbe2 Compare September 19, 2024 19:59

catamorphism requested review from FrankYFTang and mihnita September 19, 2024 20:00

mihnita previously approved these changes Sep 19, 2024

View reviewed changes

ICU-22890 MF2: Add ICU4C test for lone surrogates

99fbca1

Add a test to ICU4C for handling of lone surrogates. Incidentally fix uninitialized-memory bug in MessageFormatter (initialize `errors` to nullptr) Co-authored-by: Frank Tang <ftang@chromium.org>

catamorphism dismissed mihnita’s stale review via 99fbca1 September 19, 2024 20:32

catamorphism force-pushed the icu-22890 branch from b3efbe2 to 99fbca1 Compare September 19, 2024 20:32

catamorphism requested a review from mihnita September 19, 2024 20:34

mihnita approved these changes Sep 19, 2024

View reviewed changes

catamorphism merged commit 5991c93 into unicode-org:main Sep 19, 2024
94 checks passed

ICU-22890 MF2: Add lone surrogate test to parser #3167

ICU-22890 MF2: Add lone surrogate test to parser #3167

Conversation

catamorphism commented Sep 14, 2024 • edited Loading

Checklist

catamorphism commented Sep 14, 2024

FrankYFTang commented Sep 16, 2024

jira-pull-request-webhook bot commented Sep 18, 2024

catamorphism commented Sep 18, 2024

FrankYFTang commented Sep 18, 2024

FrankYFTang commented Sep 18, 2024

catamorphism commented Sep 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FrankYFTang left a comment

Choose a reason for hiding this comment

catamorphism commented Sep 18, 2024

mihnita commented Sep 18, 2024

jira-pull-request-webhook bot commented Sep 19, 2024

catamorphism commented Sep 19, 2024

catamorphism commented Sep 19, 2024

jira-pull-request-webhook bot commented Sep 19, 2024

catamorphism commented Sep 19, 2024

mihnita commented Sep 19, 2024

catamorphism commented Sep 19, 2024

mihnita commented Sep 19, 2024

mihnita left a comment

Choose a reason for hiding this comment

markusicu commented Sep 19, 2024

jira-pull-request-webhook bot commented Sep 19, 2024

catamorphism commented Sep 19, 2024

mihnita left a comment

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Sep 19, 2024

mihnita commented Sep 19, 2024

catamorphism commented Sep 19, 2024

catamorphism commented Sep 14, 2024 •

edited

Loading