Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A path to killing testdata #3529

Closed
Manishearth opened this issue Jun 14, 2023 · 8 comments · Fixed by #3635
Closed

A path to killing testdata #3529

Manishearth opened this issue Jun 14, 2023 · 8 comments · Fixed by #3635
Assignees
Labels
A-data Area: Data coverage or quality C-meta Component: Relating to ICU4X as a whole S-small Size: One afternoon (small bug fix or enhancement) T-docs-tests Type: Code change outside core library

Comments

@Manishearth
Copy link
Member

Manishearth commented Jun 14, 2023

Related: #3529

@robertbastian has been diligently working on autodata/defaultdata (#2945), and we're nearing a point where each crate will have its own data already bundled, and used in the examples.

I've never been a fan of having checked-in postcard blobs in our tree, and I'm also not a fan of duplicating testdata and defaultdata.

We have wanted to have a separation between "data we use for testing" and "data we give people by default", so there still is some value to retaining defaultdata. Specifically; if we want to test some locale with strange features but not include it in defaultdata; we probably still need testdata.

However I'm wondering if we can greatly reduce the size of testdata, or potentially put such locale-tests in subdirectories with their own component/test_data/config.json.

It's worth starting conversations now.

I'd also like to get rid of checked in postcard:, see #3530

Opening this now to get temperature checks and an idea of where people are leaning.

Discuss with:

@Manishearth Manishearth added discuss Discuss at a future ICU4X-SC meeting A-data Area: Data coverage or quality labels Jun 14, 2023
@Manishearth
Copy link
Member Author

cc @sffc @robertbastian

@robertbastian
Copy link
Member

robertbastian commented Jun 14, 2023

Yes please, kill the crate, use baked data where possible, and put extra data in $crate/tests/data.

@sffc
Copy link
Member

sffc commented Jun 14, 2023

An outcome I would support would be:

  1. Subset of CLDR-JSON remains in repodata, including any extra locales that are interesting for datagen or component tests
  2. Make repodata ICU4X-JSON cover the same set of locales as CLDR-JSON, which may not be the same across all components
  3. The globaldata covers a larger set of locales that we consider a reasonable default (@robertbastian suggested modern but I'd like to discuss that); this is checked-in as baked data
  4. Postcard is generated during CI testing based on the CLDR-JSON repodata but not checked-in to tree (except for the small number of blob-specific unit tests)

@Manishearth
Copy link
Member Author

Manishearth commented Jun 15, 2023

This works for me. I'm iffy about keeping icu4x-json but overall I care less about that. Happy to have that discussion after we do everything else sketched out here.

(I think having that be generated on demand as needed is also fine)

@robertbastian
Copy link
Member

I'd like to get rid of repodata if possible. I've voiced concerns about it being a manually maintained subset of CLDR that might not behave like CLDR before, but keeping datagen unit tests network-free convinced me to keep it around. In fact, full datagen with repodata currently fails.

If we get rid of the testdata crate, then repodata is only used in one place, datagen unit tests. We should move what's needed into provider/datagen/tests/data, but move any unit test that tests through a component interface into the component. In fact, provider/datagen/tests/data could be synthetic data.

Postcard generation during CI should use the same set that checked in baked data uses, i.e. a tag. We already generate data from tags in CI, and this has never been an issue. Using a different set for postcard data can produce confusing test failures and also requires us to manually maintain the checked-in subset.

@Manishearth
Copy link
Member Author

I'm in favor. If we do this we should probably provide utility cargo-make tasks that will output the currently dominant tag of cldr-json, icuexportdata, or icu4x-json to a requested directory for debugging purposes.

I'm fine with making this the last thing we do, i.e. we take Shane's 4-entry list and execute it, and then also later remove repodata.

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 15, 2023
@sffc
Copy link
Member

sffc commented Jul 5, 2023

Conclusions:

LGTM: @sffc @robertbastian @Manishearth

@sffc sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 5, 2023
@sffc sffc self-assigned this Jul 5, 2023
@sffc
Copy link
Member

sffc commented Jul 5, 2023

@sffc will update tutorials

@sffc sffc added T-docs-tests Type: Code change outside core library C-meta Component: Relating to ICU4X as a whole S-small Size: One afternoon (small bug fix or enhancement) labels Jul 5, 2023
@sffc sffc added this to the 1.3 Blocking ⟨P1⟩ milestone Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality C-meta Component: Relating to ICU4X as a whole S-small Size: One afternoon (small bug fix or enhancement) T-docs-tests Type: Code change outside core library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants