Should the Segmenter types accept a locale? #3284

sffc · 2023-04-11T00:11:24Z

In the API review, @markusicu pointed out that ICU takes a locale in the segmenter, and the locale affects the behavior in certain cases, such as those in the data files below:

Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr
fi_sv override for word break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/word_fi_sv.txt
el override for sentence break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/sent_el.txt

Why don't we support these in ICU4X Segmenter, and should we add them?

For 1.2 purposes, we have a few choices:

Add the locale parameter now and don't use it for anything yet
Don't add the locale parameter but add something like _invariant to the constructor names, so that in the future try_new_auto_invariant() creates the locale-invariant segmenter and try_new_auto(locale!("el")) creates the locale-specific segmenter
Keep things the way they are and add locale constructors later, possibly adopting the style above in 2.0
Add the parameter to Word and Sentence, but not Line or Grapheme

Thoughts?

@aethanyc @makotokato @Manishearth

The text was updated successfully, but these errors were encountered:

Manishearth · 2023-04-11T00:31:19Z

General preference for #3

Do we plan to provide these locale-ish APIs in the near term? I actually think future try_new_auto_with_locale() would be fine

zbraniecki · 2023-04-11T04:20:41Z

General preference for #2

I'd prefer the default constructor names to be consistent in behavior as much as makes sense. If we believe Segmenter constructors will want to take locale just like all others, lets keep the names for those constructors.
If, in the future, we decide to not add those names, we can always alias the default constructor names to the _invariant ones.

makotokato · 2023-04-11T11:42:19Z

Why this ICU4C rule isn't merged/requested to UAX#29? Does ICU4C have a plan to file/merge an issue to UAX#29? After merging this change to UAX#29, then #3.

Manishearth · 2023-04-11T15:09:14Z

UAX #29 in general doesn't really want to include locale-specific stuff because it wants to leave that up to CLDR.

sffc · 2023-04-11T15:44:12Z

Suffix suggestions:

_invariant
_root (wrong: we don't have CLDR root tailorings)
_uax (too restrictive: we want to add CLDR tailorings)
_untailored (wrong: LineBreakOptions has tailorings)
_default (could be restrictive with regard to default data)

macchiati · 2023-04-11T17:40:50Z

Since we will want to take locales as parameters (even for segmenters where that isn't implemented yet), IMO we should make that the "normal" case.

sffc · 2023-04-11T23:39:04Z

From discussion with @aethanyc @makotokato @Manishearth @nordzilla: It is an enhancement to consume CLDR root.xml tailorings, but not necessarily a bug. We would like to see it done in a timely fashion.

Conclusion: use _invariant

robertbastian · 2023-04-13T14:17:26Z

For the record we didn't use _invariant in #3294

hsivonen · 2023-08-31T16:13:57Z

Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr

These seem to be lists of abbreviations that contain a period that doesn't end a sentence. How bad would it be to merge the lists and use the merged lists across languages?

fi_sv override for word break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/word_fi_sv.txt

It's a bit sad that treating letter, colon, letter as having a word break opportunity after the colon is a case of giving computer syntax needs precedence over natural-language needs. If accommodating computer syntaxes wasn't given priority, the Finnish/Swedish requirement of not treating letter, colon, letter as containing a word break opportunity could be hoisted to root.

el override for sentence break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/sent_el.txt

This seems to be about ASCII semicolon having sentence-ending question mark semantics. Could this be accommodated in the root by triggering the rule on the most recent letter being from the Greek script?

robertbastian · 2024-01-24T10:20:41Z

It's a bit sad that treating letter, colon, letter as having a word break opportunity after the colon is a case of giving computer syntax needs precedence over natural-language needs. If accommodating computer syntaxes wasn't given priority, the Finnish/Swedish requirement of not treating letter, colon, letter as containing a word break opportunity could be hoisted to root.

I think German needs this tailoring as well. I don't know why Finnish and Swedish do, but in German a colon is commonly used to form gender-neutral nouns, like Lehrer:in, which should not contain any word breaks.

What's the current process for updating the tailorings? ICU or CLDR?

Manishearth · 2024-01-24T18:40:40Z

@robertbastian this was recently discussed in the CLDR design meeting, CLDR issue: https://unicode-org.atlassian.net/browse/CLDR-15910 / PAG issue: https://github.com/unicode-org/properties/issues/187 (internal)

There's thought that this should actually be made to apply to all languages, since colons without spaces on either side are not really a thing in regular text anyway, and if the space has been removed there's a good chance it's on purpose.

srl295 · 2024-01-24T21:13:19Z

Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr

These seem to be lists of abbreviations that contain a period that doesn't end a sentence. How bad would it be to merge the lists and use the merged lists across languages?

That's exactly what they are. https://www.unicode.org/reports/tr35/tr35-general.html#Segmentation_Exceptions for more details.

The lists for one language may not be applicable for others. But you could probably calculate a list that's likely to be generally useful, it might be less useful for any particular language.

hsivonen · 2024-01-25T08:07:51Z

I think German needs this tailoring as well.

I think https://unicode-org.atlassian.net/browse/CLDR-15910 should be reverted on the root level so that we don't need tailorings to accommodate natural languages.

I don't know why Finnish and Swedish do

For Finnish, the use case is marking where a sufficiently unusual word body (e.g. acronym) ends and the case suffix starts. For example, English Henri’s would be Henrin in Finnish but English ICU4X’s would be ICU4X:n in Finnish. The use case for Swedish seems to be also about applying suffixes (though not case suffixes) to sufficiently unusual word bodies. (Consider an analog English Londoner but with with the suffix applied to e.g. a sports team acronym.)

sffc · 2024-01-26T00:37:44Z

Okay, for 2.0 purposes, which of the four segmenters requires a locale parameter?

Grapheme: Are there locale-specific CLDR tailorings for graphemes?
Word: It sounds like people want to move the fi_sv tailorings to the root, which would obviate the need for RBBI tailoring. However, locale info could still help with complex language segmentation, although we need to know the language of the text, not of the user.
Sentence: Seems like this is the biggest use case, although it is still about the language of the text and not of the user.
Line: I think this one is invariant.

The "language of the text" would be more appropriate to provide in the terminal segment function since it is an attribute of the text, but since that requires data loading, it might be more appropriate to specify it in the constructor. Alternatively, we could stuff all sentence language tailorings into a single data key which is always loaded when making a sentence segmenter, as we do for the word break segmenter.

srl295 · 2024-01-26T04:15:25Z

Grapheme: Are there locale-specific CLDR tailorings for graphemes?

Not yet. Please put it into the API. I was doing planning on a work item to move this forward. This is for example languages that want to keep "ch" together etc.

hsivonen · 2024-01-26T13:06:04Z

Grapheme

Please put it into the API.

On the flip side, putting this in the API really requires making ECMA-402 have a way to explicitly ask for root and to default to root.

Some users getting a different definition of extended grapheme clusters based on the browser UI locale would likely be bad, after developers having assumed for years that extended grapheme clusters are a Unicode-level concept and not a locale-level concept. Also, it would be bad to have to assume that English is always going to be the untailored language and to teach every developer to ask for a grapheme segmenter for English in order to get behavior on a similar level of stability that one would expect of e.g Swift strings.

This is for example languages that want to keep "ch" together etc.

What languages do you mean and why do they want to keep "ch" together for the kind of purposes that extended grapheme clusters are used for, such as denying the selection of only "c" or only "h"? Czech treats "ch" as a collation unit, but do users of the language expect not to be able to select "c" and "h" individually?

markusicu · 2024-03-12T17:29:38Z

CLDR has decided that they're putting the : thing back into root.

Yes, but the Apple rep in the meeting, who is originally from Sweden, insisted that CLDR keep the fi/sv word break tailorings, because he thinks that even the future keyword selecting technical usage should keep fi/sv words together across colon.

No language parameter for grapheme cluster segmenter

+1

Language parameter for the other three segmenters

+1

Plan to put the parameter on the .segment() function

That seems weird both from looking at usage and thinking about data loading.

I strongly expect that someone should be able to get a Segmenter object and just use it to find/iterate over segments without knowing about additional options.
I expect tailorings to require some different data that should be loaded in the "constructor". Depending on the implementation, there may be a totally different blob or a small delta, but probably generally some non-zero tailoring-specific data.

sffc · 2024-04-01T22:58:31Z

The conclusions from the discussion of this issue with the CLDR design group:

Grapheme clusters should not be language-specific; baked into much low-level processing (e.g., Swift, font mappings) which we don’t want to be language-specific
Content locale/text language parameter (not UI locale): Potential for accuracy; make it optional, name it well
Ok to leave the locale on the constructor; benefit: more specific data loading even for existing dictionaries & models

My suggested path forward for this issue, then, is to add an options bag to the WordSegmenter, LineSegmenter, and SentenceSegmenter constructors with an optional content_locale field of type &LanguageIdentifier.

sffc · 2024-04-01T23:00:46Z

I'm moving this back into 1.5 because the constructor can be drafted and bikeshed ahead of time, and then in 2.0 we can do the minimal change of making the new constructor the default one.

srl295 · 2024-04-02T02:41:10Z

Grapheme clusters should not be language-specific; baked into much low-level processing (e.g., Swift, font mappings) which we don’t want to be language-specific

This makes no sense and contradicts the long standing requests. ( https://unicode-org.atlassian.net/browse/CLDR-2992 which I am working on scheduling ) I would have joined, did not realize this was coming up today.

Perusing the notes it's not clear that the previous requirements and recent discussion from the segmentation summary last year were included here.

sffc · 2024-05-16T23:41:54Z

Based on additional discussion in the email thread, I would like to move forward with the recommendation in #3284 (comment), with the additional understanding that we may add support for locale-based grapheme segmentation in the future if CLDR adds data for this, but it might take the form of another (fifth) segmenter type.

Concretely:

All segmenters retain a new or try_new function without an options bag
Word, Sentence, and Line segmenters get a try_new_with_options function that includes a content_locale option

makotokato · 2024-06-27T01:47:05Z

When looking ICU4C brkiter rule files for word and sentence, UAX#29's property of this isn't same each locale. But rules seem to be same. So if we modify datagen (with a few changes of toml data file), we can generate rules data per locale.

Add LocaleData parameter for word/sentence segmenter This is a part of #3284. ICU4C has some language break rules for word and sentence segmenter, so this fix adds some rules to ICU4X per locale. This adds LocaleData argument to all constructors. Also, locale difference is small and 2 data only, I add the override table data marker for machine state property.

makotokato · 2024-09-11T05:35:18Z

If we support auto-phase line-break property, we should have locale paremeter to line segmenter instead of ja_zh flag.

sffc · 2024-09-18T00:08:38Z

Currently, we have the optional Content Locale on WordSegmenter and SentenceSegmenter.

@makotokato will create a pull request to replace ja_zh with content_locale in LineSegmenterOptions.

Once that PR lands, we can close this issue.

sffc added discuss Discuss at a future ICU4X-SC meeting C-segmentation Component: Segmentation labels Apr 11, 2023

peng1999 mentioned this issue May 4, 2023

Language-aware line breakpoints typst/typst#1009

Closed

1 task

sffc removed the discuss Discuss at a future ICU4X-SC meeting label May 11, 2023

sffc added this to the 1.x Priority ⟨P2⟩ milestone May 11, 2023

sffc assigned aethanyc May 11, 2023

sffc added T-core Type: Required functionality S-large Size: A few weeks (larger feature, major refactoring) labels May 11, 2023

peng1999 mentioned this issue May 30, 2023

Use icu4x for linebreaking algorithm typst/typst#1355

Merged

robertbastian modified the milestones: 1.x Priority ⟨P2⟩, ICU4X 2.0 Aug 30, 2023

hsivonen mentioned this issue Sep 18, 2023

Consider a pure ECMA262 approach tc39/proposal-stable-formatting#12

Closed

YDX-2147483647 mentioned this issue Jan 16, 2024

Chinese punctuation is placed at the beginning of the line in some cases typst/typst#3082

Closed

1 task

sffc added the needs-approval One or more stakeholders need to approve proposal label Jan 26, 2024

sffc unassigned hsivonen, Manishearth and eggrobin Mar 14, 2024

sffc modified the milestones: ICU4X 2.0, 1.5 Blocking ⟨P1⟩ Apr 1, 2024

sffc modified the milestones: 1.5 Blocking ⟨P1⟩, ICU4X 2.0 May 23, 2024

sffc mentioned this issue Jun 1, 2024

Consider relaxing locale resolution for Intl.Segmenter tc39/ecma402#895

Open

peng1999 mentioned this issue Jul 6, 2024

Allow displaying datetime in any locale typst/typst#4485

Closed

makotokato mentioned this issue Jul 10, 2024

Add LocaleData parameter for word/sentence segmenter #5214

Closed

makotokato mentioned this issue Jul 30, 2024

Add LocaleData parameter for word/sentence segmenter #5318

Merged

sffc assigned makotokato Sep 17, 2024

makotokato mentioned this issue Sep 20, 2024

Add content_locale member to LineBreakOptions #5565

Merged

sffc closed this as completed in #5565 Sep 20, 2024

sffc closed this as completed in d704ef7 Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should the Segmenter types accept a locale? #3284

Should the Segmenter types accept a locale? #3284

sffc commented Apr 11, 2023 •

edited

Loading

Manishearth commented Apr 11, 2023

zbraniecki commented Apr 11, 2023

makotokato commented Apr 11, 2023

Manishearth commented Apr 11, 2023

sffc commented Apr 11, 2023 •

edited

Loading

macchiati commented Apr 11, 2023

sffc commented Apr 11, 2023

robertbastian commented Apr 13, 2023

hsivonen commented Aug 31, 2023

robertbastian commented Jan 24, 2024

Manishearth commented Jan 24, 2024

srl295 commented Jan 24, 2024

hsivonen commented Jan 25, 2024

sffc commented Jan 26, 2024

srl295 commented Jan 26, 2024

hsivonen commented Jan 26, 2024 •

edited

Loading

markusicu commented Mar 12, 2024

sffc commented Apr 1, 2024 •

edited

Loading

sffc commented Apr 1, 2024

srl295 commented Apr 2, 2024 •

edited

Loading

sffc commented May 16, 2024

makotokato commented Jun 27, 2024

makotokato commented Sep 11, 2024

sffc commented Sep 18, 2024

Should the Segmenter types accept a locale? #3284

Should the Segmenter types accept a locale? #3284

Comments

sffc commented Apr 11, 2023 • edited Loading

Manishearth commented Apr 11, 2023

zbraniecki commented Apr 11, 2023

makotokato commented Apr 11, 2023

Manishearth commented Apr 11, 2023

sffc commented Apr 11, 2023 • edited Loading

macchiati commented Apr 11, 2023

sffc commented Apr 11, 2023

robertbastian commented Apr 13, 2023

hsivonen commented Aug 31, 2023

robertbastian commented Jan 24, 2024

Manishearth commented Jan 24, 2024

srl295 commented Jan 24, 2024

hsivonen commented Jan 25, 2024

sffc commented Jan 26, 2024

srl295 commented Jan 26, 2024

hsivonen commented Jan 26, 2024 • edited Loading

markusicu commented Mar 12, 2024

sffc commented Apr 1, 2024 • edited Loading

sffc commented Apr 1, 2024

srl295 commented Apr 2, 2024 • edited Loading

sffc commented May 16, 2024

makotokato commented Jun 27, 2024

makotokato commented Sep 11, 2024

sffc commented Sep 18, 2024

sffc commented Apr 11, 2023 •

edited

Loading

sffc commented Apr 11, 2023 •

edited

Loading

hsivonen commented Jan 26, 2024 •

edited

Loading

sffc commented Apr 1, 2024 •

edited

Loading

srl295 commented Apr 2, 2024 •

edited

Loading