Intl.breakIterator #60

caridy · 2015-12-15T22:43:56Z

Standardize Intl.v8BreakIterator.

Backpointers:

Update 1 (Sept 26th, 2016):

Proposal from @littledan: https://github.com/littledan/BreakIterator/blob/master/README.md

The text was updated successfully, but these errors were encountered:

jungshik · 2016-01-25T22:16:46Z

/cc @jungshik, @littledan

srl295 · 2016-01-26T23:40:40Z

use cases:
* rendering (Canvas etc… Thai…)
* console rendering (word wrap!)
* counting words/lines/sentences
* Translation tooling…

caridy · 2016-02-29T21:11:32Z

@littledan will champion this one.

littledan · 2016-03-09T00:48:59Z

I think we'd probably want a somewhat different API for this compared to what V8 currently ships, if it's not too late backwards compatibility-wise. The current API looks like this (more docs here: https://code.google.com/p/v8-i18n/wiki/BreakIterator):

Create a new break iterator with var instance = new Intl.v8BreakIterator(options)
Set the next with instance.adoptText()
Get information about the current iteration with instance.current() (for the index) and instance.breakType() (for a string representing the type, probably a CLDR thing)
Go to the next place with instance.next(), which returns the new current index.
Start at the beginning of the string again with instance.first().

I think a more ES2015-y way to do it would be to have a method instance.breakText("my string") on the instance which returns an iterator over the breaks in the string. Each item would be an object like {index: 1, breakType: "letter"}. To put the cherry on top, we should probably make the sole method breakText not be a bound function, unlike the current five-method API of bound functions, if this is the general strategy for new APIs.

A possible downside is that this could have worse performance (for the object allocation, and also for accounting for the case where multiple strings are being iterated over by the same instance at the same time), but I don't think this proposal would introduce further implications for a high-performance implementation compared to lots of other ES2015 features. It would also mean making a brand new iterator in place of first--would this be very bad performance-wise?

What do you all think of this general API shape?

The first step towards this will be unshipping Intl.v8BreakIterator in V8, as the standardized version will likely be incompatible. Current usage is low, but nonzero, so we'll see how this goes. If there are a lot of complaints, then maybe I'll want to argue for sticking to the current API; or maybe the complaining users would be happy to hear that if they are OK with the new API, then they'll get the support in more browsers.

I don't think I'll be able to write up a proposal for the March TC39 meeting unfortunately.

littledan · 2016-05-22T14:15:38Z

I ended up deciding against unshipping v8BreakIterator in V8 when I unshipped several other nonstandard features (which all had much lower usage counts).

littledan · 2016-08-26T06:26:43Z

I wrote up a quick explainer doc explaining the motivation and a strawman API shape. It seems reasonable to me for this to include both line breaking and grapheme/word/sentence segmentation. Maybe hyphenation could go into the same API, just with a different type "hyphen" rather than an entirely different class (as I imagine the API would be similar).

Does anyone have any thoughts? I'm interested in both web developers and implementers.

mathiasbynens · 2016-08-26T06:32:55Z

The proposed API in https://github.com/littledan/BreakIterator#example looks great! I’m in favor of overloading the type to include 'hyphen' provided the API can remain similar.

sebmarkbage · 2016-08-26T06:45:34Z

I'm very worried about the performance of this API because the use of this API over native methods is going to be performance critical enough anyway. Additionally, anyone compiling native layout code to asm.js or wasm is going to want the lowest level possible access to that. I've seen nothing to indicate that iterables and the allocations it requires can be optimized away in existing engines. Can you even iterate over a significant document without causing multiple young generation GCs? I'd like to see something to suggest that perf concerns are unfounded before moving forward with the alternative design. Otherwise I fear we'll have to use a polyfill anyway.

EDIT: I suppose supporting both would be an ok tradeoff is iterables aren't fast enough yet. Similar to how other iterable APIs have alternative iteration APIs.

The hyphenation API should be different. Unlike line breaks it is often possible to find a hyphenation point in the middle of a string without iterating through all of the possible ones. Using the iterator API would be very inefficient.

The way you do text-layout hyphenation is by first measuring the unhyphenated word, and only then find the closest point to hyphenate if it is too long - which will give you a single direct value.

IMO we can just look at what browsers already do rather than trying to be clever. They're designed that way for a reason.

jungshik · 2016-08-26T06:54:33Z

I'd rather not include 'hyphen' in the proposed API.

In addition to what @sebmarkbage wrote, hyphenation can change the input in some languages (e.g. German).

littledan · 2016-08-26T06:59:07Z

@sebmarkbage To the performance concern: What if %SegmentIterator% had an additional "low-level API" with three methods, advance (to imperatively move to the next match, returning undefined; the user could tell if they are at the end by observing the index getting too high, or maybe this could return true at the end), index and breakType (to get properties of the current breakpoint)?

littledan · 2016-08-26T07:07:22Z

I updated the explainer with the low-level segmentation interface, though I won't be surprised if we got pushback for this. I assume it's OK to do an allocation when adopting a different piece of text to perform segmentation over, right?

sebmarkbage · 2016-08-26T07:15:46Z

Short pieces of text are likely to be combined into a single string rope often.

I'm not as concerned about those allocations since new pieces of text are often associated with allocations anyway. The allocations are proportional. E.g. you might have a lot of small segments and iterate through them independently but the number of allocations is proportional to the allocations you do for the data structures holding them anyway.

SebastianZ · 2016-08-26T07:54:45Z

In addition to what @sebmarkbage wrote, hyphenation can change the input in some languages (e.g. German).

As far as I know, the only case in German to which this applied is hyphenation between c and k, which turned the c into another k. E.g. 'Zucker' became 'Zuk-ker'. This rule no longer applies since the orthography reform from 1996.

So, at least in German there is no such issue anymore, though I have no idea if other languages still have similar rules.

Sebastian

sebmarkbage · 2016-08-26T08:31:54Z

@SebastianZ there are a few other cases mentioned here http://www.unicode.org/L2/L2002/02279-muller.htm#4 for example in Swedish "tuggummi" becomes "tugg-gummi".

I think it is fairly rare to handle these special cases correctly but it'd be good for the API to handle it.

jungshik · 2016-10-14T18:32:47Z

We also need to support 'strictness' (for lack of a better term) either as a separate option or as values of 'type'.

CSS3 has 'strict', 'normal', 'loose' (and 'auto') for line-break and ICU/CLDR support them. (when v8BreakIterator was written, there's no such distinction).

littledan · 2016-10-18T15:03:32Z

I filed a new issue for strictness at tc39/proposal-intl-segmenter#5 . Let's migrate all additional discussion of feature requests related to segmentation to that repository.

ryzokuken · 2021-05-06T15:45:05Z

Since Intl.Segmenter is almost done, can we close this?

sffc · 2021-05-08T20:49:39Z

Since Intl.Segmenter is almost done, can we close this?

I think it should be closed when #553 is merged. I'll add it as a linked issue.

caridy added the enhancement label Dec 15, 2015

srl295 mentioned this issue Jan 25, 2016

Segmentation / Break Iteration #66

Closed

caridy added this to the 4rd Edition milestone Feb 29, 2016

sebmarkbage mentioned this issue Jul 11, 2016

Hyphenation API #93

Open

littledan mentioned this issue Oct 18, 2016

Support strictness tc39/proposal-intl-segmenter#5

Closed

sffc added s: in progress Status: the issue has an active proposal c: text Component: case mapping, collation, properties and removed enhancement labels Mar 19, 2019

sffc assigned gibson042 Jun 5, 2020

sffc added the Proposal Larger change requiring a proposal label Jun 5, 2020

sffc removed this from the 4th Edition milestone Jun 5, 2020

sffc linked a pull request May 8, 2021 that will close this issue

Normative: Add Intl.Segmenter #553

Merged

1 task

ryzokuken closed this as completed in #553 Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intl.breakIterator #60

Intl.breakIterator #60

caridy commented Dec 15, 2015 •

edited

Loading

jungshik commented Jan 25, 2016

srl295 commented Jan 26, 2016

caridy commented Feb 29, 2016

littledan commented Mar 9, 2016

littledan commented May 22, 2016

littledan commented Aug 26, 2016

mathiasbynens commented Aug 26, 2016

sebmarkbage commented Aug 26, 2016 •

edited

Loading

jungshik commented Aug 26, 2016

littledan commented Aug 26, 2016

littledan commented Aug 26, 2016

sebmarkbage commented Aug 26, 2016

SebastianZ commented Aug 26, 2016

sebmarkbage commented Aug 26, 2016

jungshik commented Oct 14, 2016

littledan commented Oct 18, 2016

ryzokuken commented May 6, 2021

sffc commented May 8, 2021

Intl.breakIterator #60

Intl.breakIterator #60

Comments

caridy commented Dec 15, 2015 • edited Loading

jungshik commented Jan 25, 2016

srl295 commented Jan 26, 2016

caridy commented Feb 29, 2016

littledan commented Mar 9, 2016

littledan commented May 22, 2016

littledan commented Aug 26, 2016

mathiasbynens commented Aug 26, 2016

sebmarkbage commented Aug 26, 2016 • edited Loading

jungshik commented Aug 26, 2016

littledan commented Aug 26, 2016

littledan commented Aug 26, 2016

sebmarkbage commented Aug 26, 2016

SebastianZ commented Aug 26, 2016

sebmarkbage commented Aug 26, 2016

jungshik commented Oct 14, 2016

littledan commented Oct 18, 2016

ryzokuken commented May 6, 2021

sffc commented May 8, 2021

caridy commented Dec 15, 2015 •

edited

Loading

sebmarkbage commented Aug 26, 2016 •

edited

Loading