Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wasm][globalization][icu] Tracking issue for HybridGlobalization: Web API + ICU #79989

Closed
16 of 23 tasks
ilonatommy opened this issue Dec 27, 2022 · 3 comments
Closed
16 of 23 tasks
Assignees
Labels
arch-wasm WebAssembly architecture area-System.Globalization runtime-mono specific to the Mono runtime
Milestone

Comments

@ilonatommy
Copy link
Member

ilonatommy commented Dec 27, 2022

The task is to remove as much data from ICU files as possible and exchange ICU4C functions that are using this data with platform native functions - in the case of WASM with Web API. Because we are not able to get rid of ICU datafile completely (some functionalities are not easily replaceable) we will keep loading icudt.dat in a reduced form. This mode will be called HybridGlobalization and will be by default switched off. User can switch it on by setting MsBuild's <HybridGlobalization> to true.
PoC branch is here: main...ilonatommy:runtime:icu-platform-native.

  1. Removing collations for WASM
  • Prepare icudt_wasm.dat and corresponding sharded datafiles without collations/standard, enable setting HybridGlobalization and write WBT checking if the new file got loaded instead of the old one. Reduced ICU files for HybridGlobalization icu#300 [ILONA]
  • Loading reduced ICU files to Blazor - this unified the startup for Blazor, so we could implement HG for it as well.
  • Finish implementing GlobalizationNative_ChangeCase + optimize memory usage - do not create a new string for returning the value but pass the address of buffer reserved on C# size that will hold the result. [ILONA]
    public API: TextInfo.ToLower, TextInfo.ToUpper, TextInfo.ToTitleCase`
  • Implement GlobalizationNative_IndexOf and GlobalizationNative_LastIndexOf - will not work for letters that consist of more than one grapheme, issue: Add locale sensitive substring matching functions to Intl.Collator tc39/ecma402#506 [ILONA]
    public API: CompareInfo.IndexOf, String.IndexOf, MemoryExtensions.IndexOf, CompareInfo.LastIndexOf, String.LastIndexOf, MemoryExtensions.LastIndexOf.
  • Implement GlobalizationNative_StartsWith and GlobalizationNative_EndsWith [ILONA]
    public API: CompareInfo.IsSuffix, String.EndsWidth, MemoryExtensions.EndsWith, CompareInfo.IsPrefix, String.StartsWidth, MemoryExtensions.StartsWith.
  • Implement GlobalizationNative_CompareString (without Ordinal and OrdinalIgnoreCase, IgnoreKanaType, IgnoreWidth) [ILONA]
    public API: CompareInfo.Compare, String.Compare
  • Implement IgnoreKanaType and IgnoreWidth basing on pal_collation.c code [ILONA]
  • Investigate Ordinal and OrdinalIgnoreCase. [ILONA]
  • Throw PNSE on GlobalizationNative_GetSortKey
  • How much do we gain removing SortVersion? 32kB on uncompressed. Not worth it. If much, throw PNSE on GlobalizationNative_GetSortVersion.
  • Document how to use the flag and what to expect when switching it on.
  • Coordinate flow of HybridGlobalizationfrom Blazor. Changes in dotnet/sdk might be needed.
  1. Removing normalization for WASM:
    Removed from planned Hybrid features. Savings from normalization removal on WASM are ~60kB. The removal breaks public APIs: string.Normalize, string.IsNormalized, IdnMapping.GetAsciii, IdnMapping.GetUnicode. Normalize/IsNormalized were succesfully replaced in [browser][non-icu] HybridGlobalization normalization. #85510.
    For GetAscii/GetUnicode replacement, Invariant implementation enhanced by normalization step was used, see branch https://github.com/ilonatommy/runtime/tree/idn-mapping. The mapping still lacked detection of disallowed/ignored/mapped characters and would need access to MappingTables of the current Unicode version to e.g. detect incorrect inputs to throw. One Unicode version mapping table in plain text weights ~900kB. Even if we compressed it, we still would need to maintain it with every Unicode version. Development time spent on correct implementation and chances of real size reduction, taking into cosideration the need to keep the mapping tables, are too small to remove normalization data from ICU.
  • Update icudt_wasm.dat and corresponding sharded datafiles.
  • Implement Punycode, might be using this algorithm using InvariantGlobalization algorithm + normalization function.
  • Use normalization from the PoC branch.
  • Update documentations.
  1. Investigate implications of removing further data batches, e.g. check the effect of removing all collations, coll_ucadata, locales_tree etc.
  • Fix no exception thrown for CultureInfoAll.LcidTest, CultureInfoAll.GetCultureTest, CultureInfoConstructor.Ctor_String (now we support wider range of locales so we should not expect some of them throw as it was with standard ICU)
  1. (optional) Enhancement of collations by manual workarounds:
  1. (optional) Consider failing a build when HybridGlobalization function is not supported

Tracking issues:

#101912
#102305
#102373
#95921
#95795
#95623

@ghost
Copy link

ghost commented Dec 27, 2022

Tagging subscribers to this area: @dotnet/area-system-globalization
See info in area-owners.md if you want to be subscribed.

Issue Details

The task is to remove as much data from ICU files as possible and exchange ICU4C functions that are using this data with platform native functions - in the case of WASM with Web API. Because we are not able to get rid of ICU datafile completely (some functionalities are not easily replaceable) we will keep loading icudt.dat in a reduced form. This mode will be called HybridGlobalization and will be by default switched off. User can switch it on by setting MsBuild's <HybridGlobalization> to true.
PoC branch is here: https://github.com/ilonatommy/runtime/tree/icu-platform-native.

  1. Removing collations/standard for WASM
  • Prepare icudt_wasm.dat and corresponding sharded datafiles without collations/standard, enable setting HybridGlobalization and write WBT checking if the new file got loaded instead of the old one
  • [ ]
  1. Removing normalization for WASM
  • Update icudt_wasm.dat and corresponding sharded datafiles
  • Implement Punycode, might be using this algorithm.
  • Use normalization from the PoC branch.

....

Author: ilonatommy
Assignees: ilonatommy, mkhamoyan
Labels:

area-System.Globalization

Milestone: 8.0.0

@ilonatommy ilonatommy added the arch-wasm WebAssembly architecture label Dec 27, 2022
@ghost
Copy link

ghost commented Dec 27, 2022

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

The task is to remove as much data from ICU files as possible and exchange ICU4C functions that are using this data with platform native functions - in the case of WASM with Web API. Because we are not able to get rid of ICU datafile completely (some functionalities are not easily replaceable) we will keep loading icudt.dat in a reduced form. This mode will be called HybridGlobalization and will be by default switched off. User can switch it on by setting MsBuild's <HybridGlobalization> to true.
PoC branch is here: https://github.com/ilonatommy/runtime/tree/icu-platform-native.

  1. Removing collations/standard for WASM
  • Prepare icudt_wasm.dat and corresponding sharded datafiles without collations/standard, enable setting HybridGlobalization and write WBT checking if the new file got loaded instead of the old one
  • [ ]
  1. Removing normalization for WASM
  • Update icudt_wasm.dat and corresponding sharded datafiles
  • Implement Punycode, might be using this algorithm.
  • Use normalization from the PoC branch.

....

Author: ilonatommy
Assignees: ilonatommy, mkhamoyan
Labels:

arch-wasm, area-System.Globalization

Milestone: 8.0.0

@SamMonoRT SamMonoRT added the runtime-mono specific to the Mono runtime label Jan 13, 2023
@ilonatommy ilonatommy modified the milestones: 8.0.0, 9.0.0 Aug 8, 2023
@ilonatommy
Copy link
Member Author

Closing, the planned work for HybridGlobalization was completed.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-wasm WebAssembly architecture area-System.Globalization runtime-mono specific to the Mono runtime
Projects
None yet
Development

No branches or pull requests

3 participants