-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intl.breakIterator #60
Comments
/cc @jungshik, @littledan |
use cases: |
@littledan will champion this one. |
I think we'd probably want a somewhat different API for this compared to what V8 currently ships, if it's not too late backwards compatibility-wise. The current API looks like this (more docs here: https://code.google.com/p/v8-i18n/wiki/BreakIterator):
I think a more ES2015-y way to do it would be to have a method A possible downside is that this could have worse performance (for the object allocation, and also for accounting for the case where multiple strings are being iterated over by the same instance at the same time), but I don't think this proposal would introduce further implications for a high-performance implementation compared to lots of other ES2015 features. It would also mean making a brand new iterator in place of What do you all think of this general API shape? The first step towards this will be unshipping Intl.v8BreakIterator in V8, as the standardized version will likely be incompatible. Current usage is low, but nonzero, so we'll see how this goes. If there are a lot of complaints, then maybe I'll want to argue for sticking to the current API; or maybe the complaining users would be happy to hear that if they are OK with the new API, then they'll get the support in more browsers. I don't think I'll be able to write up a proposal for the March TC39 meeting unfortunately. |
I ended up deciding against unshipping v8BreakIterator in V8 when I unshipped several other nonstandard features (which all had much lower usage counts). |
I wrote up a quick explainer doc explaining the motivation and a strawman API shape. It seems reasonable to me for this to include both line breaking and grapheme/word/sentence segmentation. Maybe hyphenation could go into the same API, just with a different Does anyone have any thoughts? I'm interested in both web developers and implementers. |
The proposed API in https://github.com/littledan/BreakIterator#example looks great! I’m in favor of overloading the |
I'm very worried about the performance of this API because the use of this API over native methods is going to be performance critical enough anyway. Additionally, anyone compiling native layout code to asm.js or wasm is going to want the lowest level possible access to that. I've seen nothing to indicate that iterables and the allocations it requires can be optimized away in existing engines. Can you even iterate over a significant document without causing multiple young generation GCs? I'd like to see something to suggest that perf concerns are unfounded before moving forward with the alternative design. Otherwise I fear we'll have to use a polyfill anyway. EDIT: I suppose supporting both would be an ok tradeoff is iterables aren't fast enough yet. Similar to how other iterable APIs have alternative iteration APIs. The hyphenation API should be different. Unlike line breaks it is often possible to find a hyphenation point in the middle of a string without iterating through all of the possible ones. Using the iterator API would be very inefficient. The way you do text-layout hyphenation is by first measuring the unhyphenated word, and only then find the closest point to hyphenate if it is too long - which will give you a single direct value. IMO we can just look at what browsers already do rather than trying to be clever. They're designed that way for a reason. |
I'd rather not include 'hyphen' in the proposed API. In addition to what @sebmarkbage wrote, hyphenation can change the input in some languages (e.g. German). |
@sebmarkbage To the performance concern: What if |
I updated the explainer with the low-level segmentation interface, though I won't be surprised if we got pushback for this. I assume it's OK to do an allocation when adopting a different piece of text to perform segmentation over, right? |
Short pieces of text are likely to be combined into a single string rope often. I'm not as concerned about those allocations since new pieces of text are often associated with allocations anyway. The allocations are proportional. E.g. you might have |
As far as I know, the only case in German to which this applied is hyphenation between c and k, which turned the c into another k. E.g. 'Zucker' became 'Zuk-ker'. This rule no longer applies since the orthography reform from 1996. So, at least in German there is no such issue anymore, though I have no idea if other languages still have similar rules. Sebastian |
@SebastianZ there are a few other cases mentioned here http://www.unicode.org/L2/L2002/02279-muller.htm#4 for example in Swedish "tuggummi" becomes "tugg-gummi". I think it is fairly rare to handle these special cases correctly but it'd be good for the API to handle it. |
We also need to support 'strictness' (for lack of a better term) either as a separate option or as values of 'type'. CSS3 has 'strict', 'normal', 'loose' (and 'auto') for line-break and ICU/CLDR support them. (when v8BreakIterator was written, there's no such distinction). |
I filed a new issue for strictness at tc39/proposal-intl-segmenter#5 . Let's migrate all additional discussion of feature requests related to segmentation to that repository. |
Since |
I think it should be closed when #553 is merged. I'll add it as a linked issue. |
Standardize
Intl.v8BreakIterator
.Backpointers:
Update 1 (Sept 26th, 2016):
The text was updated successfully, but these errors were encountered: