Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22794 MF2: De-duplicate C++ and Java test data #3050

Merged

Conversation

catamorphism
Copy link
Contributor

@catamorphism catamorphism commented Jun 26, 2024

This PR moves all the .json files for data-driven ICU4C and ICU4J tests into a single top-level directory, icu/testdata/.

I did my best to remove duplicate tests.

This involved fixing several ICU4J parser bugs, almost all involving whitespace handling.

Some tests still had to be ignored in either Java or C++, in which cases I filed ICU tickets.

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22794
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@catamorphism catamorphism marked this pull request as ready for review June 27, 2024 21:25
@catamorphism
Copy link
Contributor Author

/cc @echeran @mihnita

@srl295 srl295 changed the title [ICU-22794] MF2: De-duplicate C++ and Java test data ICU-22794 MF2: De-duplicate C++ and Java test data Jun 28, 2024
@srl295
Copy link
Member

srl295 commented Jun 28, 2024

Maybe this was discussed before, but for example ICU tarballs in https://github.com/unicode-org/icu/releases/tag/release-75-1 do not have this format.

Perhaps we should have a situation where the testdata gets copied in the ICU4C and ICU4J tarballs at a high level? I.e. so source and testdata are siblings.

Then the code could check ../testdata (source tarball) and if that doesn't exist then check ../../testdata (in-repo) and then fail.

#else
srcDataDir = ".." U_FILE_SEP_STRING ".." U_FILE_SEP_STRING ".." U_FILE_SEP_STRING ".." U_FILE_SEP_STRING "testdata" U_FILE_SEP_STRING;
FILE *f = fopen(".." U_FILE_SEP_STRING ".." U_FILE_SEP_STRING "test" U_FILE_SEP_STRING
"testdata" U_FILE_SEP_STRING "rbbitst.txt",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you move rbbitst.txt in this PR? I think this should perhaps be a different file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's correct as-is, but it's a weird way to check if a directory exists (I copied that from existing code). In 0ef54a5 I changed it to call stat() on the testdata directory itself instead.

icu4c/source/test/intltest/intltest.h Outdated Show resolved Hide resolved
checkCondition(cp == '|', "Invalid escape sequence, only \"\\|\" is valid here");
result.appendCodePoint('|');
boolean isValidEscape = cp == '|' || cp == '\\' || cp == '{' || cp == '}';
checkCondition(isValidEscape, "Invalid escape sequence inside quoted literal");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to validate the approach of shared test data!

Comment on lines 182 to 196
Path icuTestdataDir = filePath.resolve("../../../../../../../../../../../testdata/message2/").normalize();
return icuTestdataDir.resolve(json);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally should test and give the user some feedback if it doesn't work. Also, see comment about packaging when using a non-repo work area.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c53f203. Re: using a non-repo work area, see #3050 (comment)

testdata/message2/README.txt Outdated Show resolved Hide resolved
Comment on lines 34 to 35
Tests in the `spec/` subdirectory are taken from https://github.com/unicode-org/message-format-wg/blob/main/test
and need to be manually updated if the contents change upstream.
Copy link
Member

@srl295 srl295 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Tests in the `spec/` subdirectory are taken from https://github.com/unicode-org/message-format-wg/blob/main/test
and need to be manually updated if the contents change upstream.
Tests in the `spec/` subdirectory are taken from https://github.com/unicode-org/message-format-wg/blob/main/test
and need to be manually updated if the contents change upstream, but see https://unicode-org.atlassian.net/browse/ICU-22812 for making that automatic.

wouldn't it be from CLDR not mfwg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of -- if MFWG changes the spec tests, then CLDR would need to change to reflect those changes (for non-spec tests, CLDR is the source of truth). Clarified this in 1993b61.

@catamorphism
Copy link
Contributor Author

Maybe this was discussed before, but for example ICU tarballs in https://github.com/unicode-org/icu/releases/tag/release-75-1 do not have this format.

Perhaps we should have a situation where the testdata gets copied in the ICU4C and ICU4J tarballs at a high level? I.e. so source and testdata are siblings.

Then the code could check ../testdata (source tarball) and if that doesn't exist then check ../../testdata (in-repo) and then fail.

In ICU4C, 0ef54a5 should do this.

I'm not sure how to do the same for ICU4J, though. I got as far as finding the icu4j/releases_tools/github_release.sh, but I can't get it to run without errors, even in a branch with no local changes. I pasted the relevant part of the output into https://gist.github.com/catamorphism/1c9ea0e26580dd18ef69852de931ce8d . It's giving me an error about the javadocs. So I can't get far enough to make the change to copy the testdata/ directory into the jar file and test the change. Do you have any suggestions?

@catamorphism
Copy link
Contributor Author

catamorphism commented Jul 17, 2024

I'm not sure how to do the same for ICU4J, though. I got as far as finding the icu4j/releases_tools/github_release.sh, but I can't get it to run without errors, even in a branch with no local changes. I pasted the relevant part of the output into https://gist.github.com/catamorphism/1c9ea0e26580dd18ef69852de931ce8d . It's giving me an error about the javadocs. So I can't get far enough to make the change to copy the testdata/ directory into the jar file and test the change. Do you have any suggestions?

I was able to get it to work by editing the github_release.sh script as follows:

# mvn clean install -DskipITs -DskipTests -P with_sources,with_javadoc
mvn clean install -DskipITs -DskipTests -P with_sources
mvn install -DskipITs -DskipTests -P with_javadoc

though I don't know if that breaks anything. However, I'm not sure how to run the tests in the source .jar file once it's been created.

@catamorphism
Copy link
Contributor Author

Looks like the build failures are timeouts.

@srl295
Copy link
Member

srl295 commented Jul 31, 2024

I'm not sure how to do the same for ICU4J, though. I got as far as finding the icu4j/releases_tools/github_release.sh, but I can't get it to run without errors, even in a branch with no local changes. I pasted the relevant part of the output into https://gist.github.com/catamorphism/1c9ea0e26580dd18ef69852de931ce8d . It's giving me an error about the javadocs.

that has to do with needing to use an older java to build.

So I can't get far enough to make the change to copy the testdata/ directory into the jar file and test the change. Do you have any suggestions?

I'm not sure, but I can also try to test/fix this in the PR #3053

I was able to get it to work by editing the github_release.sh script as follows:

# mvn clean install -DskipITs -DskipTests -P with_sources,with_javadoc
mvn clean install -DskipITs -DskipTests -P with_sources
mvn install -DskipITs -DskipTests -P with_javadoc

though I don't know if that breaks anything. However, I'm not sure how to run the tests in the source .jar file once it's been created.

You have to unpack the jar and then compile everything.

srl295
srl295 previously approved these changes Jul 31, 2024
Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be in the right direction, need to verify with ICU-TC whether sys/stat.h is OK for inclusion.

@srl295
Copy link
Member

srl295 commented Jul 31, 2024

@catamorphism OK, well all tests pass on ICU4C (verified they are loading the files from the common area) and on ICU4J they fail with this (which at least shows that the test data is being read from its location!)

[ERROR] Failures: 
[ERROR]   FromJsonTest.test:417 TestCase {
  message: {hello {(world)}}',
  locale: 'en-US',
  arguments: {},
  expected: 'hello world',
  ignore: false,
  ignoreReason: '',
  errors: []
}
No error was expected here, but it happened:
Parse error:
Message: <<{hello {(world)}}>>
Error: Parse error [8]: Unclosed placeholder
{hello {^^^(world)}}

[ERROR]   FunctionsTest.test:32 UnitTest {src="Default int64: {$val}!", params={val=1.2345678901234568E18}, exp="Default int64: 1.234.567.890.123.456.770!"} expected:<...234.567.890.123.456.[77]0!> but was:<...234.567.890.123.456.[80]0!>
[INFO] 
[ERROR] Tests run: 2833, Failures: 2, Errors: 0, Skipped: 26

@catamorphism
Copy link
Contributor Author

though I don't know if that breaks anything. However, I'm not sure how to run the tests in the source .jar file once it's been created.

You have to unpack the jar and then compile everything.

Do any of the .jar files actually contain tests? I noticed that the github_release.sh script passes the flag -DskipTests when creating files, and after running the script, there are no test files in icu4j/target/github_release/icu4j-76_0_1-sources.jar.

@catamorphism
Copy link
Contributor Author

catamorphism commented Aug 1, 2024

seems to be in the right direction, need to verify with ICU-TC whether sys/stat.h is OK for inclusion.

No need, I changed it to check for a text file.

@catamorphism
Copy link
Contributor Author

catamorphism commented Aug 1, 2024

@catamorphism OK, well all tests pass on ICU4C (verified they are loading the files from the common area) and on ICU4J they fail with this (which at least shows that the test data is being read from its location!)

What command are you running to get that output? It's weird because the string {hello {(world)}} doesn't appear anywhere in either top-level icu4j or testdata in this branch, nor does it appear in icu4j in the main branch.

Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for refactoring test data & getting ICU4J tests to use the entire shared test data set including fixing up ICU4J unit test code accordingly.

…rectory

Modify ICU4C and ICU4J test readers to handle all tests

Add `ignoreJava` and `ignoreCpp` properties to tests where needed

Includes parser bug fixes:

ICU4J: require a complex-body after declarations

ICU4J: Correctly parse the complex body after an unsupported statement

ICU4J: Handle date params in tests and remove default params for tests

ICU4J: Handle decimal params in tests

ICU4J: Require whitespace before variable/literal in reserved annotation

ICU4J: Require whitespace between options

ICU4J: Require a variable-expression in an .input declaration

ICU4J: don't require space between last key and pattern in variant

ICU4J: don't require space between selectors

ICU4J: allow whitespace after '=' in option

ICU4J: parse escape sequences in quoted literals according to grammar

ICU4J: allow whitespace within markup after attributes list
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to go

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the good work!
some questions...

Comment on lines +71 to +72
# Copy top-level testdata directory so it's a sibling of the source/ directory
cp -R $(ICU4CTOP)/../testdata $(DISTY_TMP)/icu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we copying testdata files?
Is it because we distribute ICU4C .zip/.jar files rooted in icu4c/ rather than the one-higher repo root?
If so, shouldn't we change that instead? E.g., start with 76 to package up from the repo root, but for ICU4C exclude the icu4j/ and tools/ folders (and probably some more stuff)?

@@ -1679,6 +1679,61 @@ const char *IntlTest::getSourceTestData(UErrorCode& /*err*/) {
return srcDataDir;
}

static bool fileExists(const char* fileName) {
// Test for `srcDataDir` existing by checking for `srcDataDir`/message2/valid-tests.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems strangely, overly specific for this rather generic function.
Did this want to be a comment on or in getSharedTestData()?

static bool fileExists(const char* fileName) {
// Test for `srcDataDir` existing by checking for `srcDataDir`/message2/valid-tests.json
U_ASSERT(fileName != nullptr);
FILE *f = fopen(fileName, "r");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can/should we use some modern-ish C++ library function for "file exists"?

Comment on lines +181 to +182
// First, check the top level of the source directory,
// in case we're in a source tarball
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as for ICU4C: Should we start to package up ICU4J zip/jar/tar files from repo root, except omit unnecessary folders?

@echeran echeran merged commit 57ed0a2 into unicode-org:main Aug 8, 2024
105 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants