Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databake improvements #2906

Merged
merged 8 commits into from
Dec 21, 2022
Merged

Databake improvements #2906

merged 8 commits into from
Dec 21, 2022

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Dec 20, 2022

  • Removed litemap in favour of separate key and value arrays (Explore alternate top level data structures #2885):
    • Fewer binary search monomorphizations: before, we had one per data struct (on &[(&str, &DataStruct)]), now there's only one, &[&str].
    • Key arrays can be deduplicated (there are only ~20 for ~200 keys)
    • Key lookup algorithm can be automatically optimized by datagen:
      • We still use binary search on a sorted array, but this can now easily be swapped for perfect hashing or AsciiTrie
      • If there is only one locale, we don't use strict_cmp, but eq with a macro-constructed locale
      • If the only locale is und we just check is_empty()
  • Smaller files for less rust-analyzer load
    • One file per (deduplicated) data struct
  • DataProvider and AnyProvider implementations are now behind a macro, which means they can be added to foreign types
  • icu_testdata can now be built in bin-mode even if databake doesn't compile

sffc
sffc previously approved these changes Dec 21, 2022
docs/tutorials/data_management.md Outdated Show resolved Hide resolved
@@ -84,6 +84,7 @@ zip = "0.6"
cached-path = "0.5"
reqwest = { version = "0.11", features = ["blocking"] }
lazy_static = "1"
rust-format = { version = "0.3.4", features = ["token_stream"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: It's much better to do this at codegen time; I thought there was a reason we hadn't been doing that before? Is this a new crate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been calling rustfmt directly. This crate is just a nicer interface and also supports prettyplease, but that doesn't look good.

@@ -0,0 +1,14 @@
::icu_calendar::provider::JapaneseErasV1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: You can use the proper path now (the one used for JSON): baked/calendar/japanese@v1/und.rs.data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't because the mod file with the lookup function is also in this module

},
192u32,
),
ces: unsafe { ::zerovec::ZeroVec::from_bytes_unchecked(&[]) },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: if the ZeroVec is empty, you don't need an unsafe constructor

impl DataProvider<::icu_calendar::provider::JapaneseErasV1Marker> for $provider {
fn load(&self, req: DataRequest) -> Result<DataResponse<::icu_calendar::provider::JapaneseErasV1Marker>, DataError> {
calendar::japanese_v1::lookup(&req.locale)
.map(zerofrom::ZeroFrom::zero_from)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: Although ZeroFrom doesn't allocate (except for the skeleton hack), it's not completely free, and it shows up in code size benches. We may want to think about making DataPayload copy-on-write where it can retain a reference to the static data that it returns from .get() and only call ZeroFrom when mutation needs to occur (rarer case).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to but Yoke doesn't currently support this afaik. @Manishearth

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to be in Yoke; I think it can be in DataPayload. (Not now; maybe for a future PR)

const NARROWDAYRELATIVETIMEFORMATDATAV1MARKER: ::icu_provider::DataKeyHash =
::icu_relativetime::provider::NarrowDayRelativeTimeFormatDataV1Marker::KEY.hashed();
#[cfg(feature = "icu_relativetime")]
const NARROWHO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We don't really need to call this by default, either. It's kind-of elegant if the file defines two macros for you and nothing else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a breaking change to not expose BakedDataProvider anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also a breaking change that any.rs doesn't work anymore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Are we okay with this breakage? Othewise any.rs could contain the macro invocation.

Copy link
Member

@sffc sffc Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if we add back any.rs with a single line to invoke the macro and then make a 2.0-breaking issue to clean this up and also change mod.rs to not invoke the normal macro

@robertbastian robertbastian requested review from sffc and Manishearth and removed request for filmil, zbraniecki, gregtatum and nordzilla December 21, 2022 12:59
sffc
sffc previously approved these changes Dec 21, 2022
sffc
sffc previously approved these changes Dec 21, 2022
true,
)?;
let file_name = format!("{file_name}.rs.data");
let statik = quote! { static #ident: DataStruct = include!(#file_name); };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: worth adding a comment to each line saying // used for locale "foo" and "bar" or something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the combined names are a better solution but Shane didn't like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mind each file having comments stating which locales it is used for, or comments in the mod.rs file. I think it's not really necessary though.

@robertbastian robertbastian merged commit b74aafd into unicode-org:main Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants