-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom Dictionaries #133
Comments
ECMA-402 specifically leaves the exact algorithm up to the implementor. Nothing says that engines must use a dictionary-based segmenter. I've heard concerns over the long life of this proposal that there's fear of ICU becoming the de-facto standard if that's what V8 chooses to use, but the proposal has reached Stage 3 despite those concerns. Like all of ECMA-402, Intl.Segmenter is "best effort", and the behavior could change over time or between implementations. If a client wants to provide their own dictionary, that's a lot of data, and then they may as well ship their own segmentation engine code. The purpose of Intl.Segmenter is to offer a lightweight solution. |
Worth mentioning, I am starting from the position of (personally) being entirely unopposed to the standardizing of "just proxy to ICU" in 402. I feel that is an honest description of the state of the world and would bring significant clarity to the effort. My opposition on this particular issue is not at all colored by ICU. My concern here is that I feel like the consequences of this API structurally disadvantage particular languages, including one that I speak (Cantonese). There are a few implicit statements in your response that I want to make explicit, please correct me if you feel I have mischaracterized your statements (so that I may respond to your argument, not a straw man):
These are all reasonable. My responses:
Restating my concern, I worry that stopping the standardization process for this proposal at this current abstraction ("here is something that has a clear mapping to a downstream API we know most—or even all—of you will use to implement this") codifies into 402 (and subsequently 262) structural advantages for more-dominant languages without the escape valves that exist in those downstream APIs to support other languages. In summary, my objection here is not over a technical issue, but an equity issue. I do not want to codify inequity into our spec. (I do not believe that any of the other 402 APIs suffer this same problem.) |
On behalf of Mozilla I would be opposed to such approach. I believe it would be harmful to the ECMA-402 health and purpose. |
@zbraniecki I am comfortable with opinions opposed to my position regarding approximate standardization of ICU behavior. I don't intend to litigate that here, and absolutely recognize the merits of designing an approach that is independent from ICU. My opinion on 402's relationship to the ICU is a pragmatic one, not an idealist one. From an idealism perspective, the opinion that you represent for Mozilla matches a shared ideal I also hold. Achieving that goal is harder work, and I am supportive of that effort. My opinions regarding ICU are merely incidental to my underlying concern: the proposed standardization of an API that leaves structural disadvantages in place for some languages. We can't ignore the practical implications of an API that will clearly delegate to ICU's Given the back and forth about increasing package size on this topic already, shipping additional bundled dictionaries also seems like it is a non-starter, which is why I am approaching this as adding custom dictionaries. |
I share your concern, but I don't think that linking ECMA-402 to ICU is a solution.
My opinion also carries a pragmatic quality.
I believe there is a promising avenue to ML driven segmentation models, which may provide better quality at lower bundle size cost. |
I agree that a machine learning approach is likely a good fit. That's approximately filed in #134. I should have a working approach for Cantonese segmentation using tensorflow.js soon. (https://github.com/cantonese/segmenter) |
The Intl.Segmenter constructor takes a locale argument, which implementations should use to tailor segmentation. I therefore don't see how inequity is being codified into this API. For example, the first argument to Intl.Segmenter should differentiate between |
I am concerned that you're not taking into account how this will be implemented by implementations and the consequences of that. I do not believe that we get to wash our hands of that. ICU's
That is but one example and I can identify many more (but I hope I won't have to in order to convince you). If we standardize this API as it stands, many implementations will:
|
This isn't a problem with the Intl.Segmenter spec, and in ECMA-402 it's not unique to Intl.Segmenter, either. Browsers choose which set of locales to ship, which means that some languages have better support than others. i18n advocates, myself included, are in the midst of a multi-year effort to get locale coverage in browsers to scale, which is one of the motivations behind projects such as ICU4X. To get better Cantonese segmentation support, the solution is to advocate directly to browser vendors.
I disagree with the characterization of ICU's option to use a custom dictionary as an "escape valve". It's a low-level constructor for power users to leverage the BreakIterator code without the ICU built-in data. Even if we were to add a way to override the ICU dictionary in Intl.Segmenter (an idea that I oppose for reasons Zibi stated earlier), most web sites still won't use it. Power users who really want to override Cantonese segmentation can just ship their own Cantonese segmentation code. However, with a certain matter of time (and tireless effort from i18n advocates), this won't be necessary because browsers will be shipping high-quality segmentation engines for all languages by default. |
@nathanhammond there's nothing specific to segmentation in your critique. It applies to all I18n APIs and I believe that a solution in form of "allow overrides" is massively complicated and misguided one for the problem at hand. |
The
Why should 402 not expose mechanisms for tailoring segmentation for particular locales or environments? Reduced to a challenge, as a developer, how do you propose that I handle segmentation for Cantonese or other languages? Four paths I can identify:
Reviewing:
As a JavaScript developer my best option becomes to manually implement support since the API is not reliable for enough languages. In that scenario, shipping this specification provides extremely limited value to me as a developer: it specifies an API shape to target, but that could be aped from ICU anyway. |
As to social considerations, there are also a number. Shipping this API as specified will lead many developers who are unfamiliar with languages to believe that they have a valid segmenter. The API over-promises and (for any foreseeable future) under-delivers. A developer will have no means to know that they're completely failing to segment I believe that there is a serious equity concern that should be addressed during the consideration of i18n APIs. I do not believe that this one meets the bar. That this may have been unconsidered in previous standardization efforts does not absolve this API from needing to consider it. It can be argued that this API doesn't write inequity into it, but the actual outcomes from standardizing this API will instead simply demonstrate the existing structural disadvantages for minority languages. Standardization without having a clear path to support any arbitrary language communicates to the community of speakers around the world that their language is lesser deserving of support. And finally, concretely, the |
I proposed a concrete solution (custom dictionaries) to a known limitation that exists in Even anyone that uses some future ML model approach will also need to be able to pass information in (see #134). This goes back to my pragmatism vs. idealism opinions. I believe that we should specify an API for now, not for the future. Or, we can delay the specification until the future arrives. |
First of all, thank you for your advocacy for i18n equity. I think it's a problem that's too often overlooked and doesn't receive the attention it deserves on an organizational level. I want to make clear that I'm on your side, and Zibi is as well: we both want to see the Web platform support more locales and close the gap between majority and minority languages. I see now that you're saying Intl.Segmenter is different from other Intl APIs in that it structurally disadvantages Cantonese and other Chinese-script languages that aren't Mandarin, which is an i18n equity problem of unique importance because of the historical, political, and cultural context. This is a good point that I will raise at the next TC39-TG2 meeting. To answer your specific queries and assertions:
All Intl APIs, including Intl.Segmenter, have the following escape hatch: Intl.Segmenter.supportedLocalesOf(["yue", "zh"])
["zh"] The expected behavior is that if the developer wants to support
I do think there is value in "trying to find a spec method for supporting Cantonese". I'm saying that we already have it: the locale argument (to hint the implementation) and supportedLocalesOf (to enable polyfillability).
Intl doesn't throw exceptions for unsupported locales in large part because we see ourselves as "best-effort i18n" and want to allow implementations to differ in their breadth of support. It's already the case that different browsers support different sets of locales. The developer should pass
The most likely future is one with "language packs", where we can scale the browser to support hundreds of locales. (The exact delivery mechanism for those language packs is a continuing discussion that hasn't been resolved yet.) I've also been an advocate for introducing async APIs into Intl (see tc39/ecma402#210) to allow implementations and polyfills to download new locale data on demand.
As a JavaScript developer, you should use Intl.Segmenter when it supports the current user's locale, and download an Intl.Segmenter polyfill when it doesn't.
You're correct on this point. It's a problem of Intl.Segmenter, and it's also a problem shared by all of Intl. It's the cost-benefit tradeoff of the principle of "best-effort i18n" discussed above.
I hope that the "forseeable future" turns out to be a fairly short period of time, like several quarters as opposed to several years.
The astute developer can use
I strongly agree that i18n equity needs more focus and attention. Intl makes it easy for developers to add i18n support to their web page instead of having the web page be English-only. So, in effect, without Intl, we have "English dominance", and with Intl, we have "Tier 1 language dominance". I argue that's a step in the right direction. I want to see minority languages be just as well supported as Tier 1 languages, and there are many people in this space who share this desire (attend the Internationalization and Unicode Conference to meet some of them). I just firmly see this as a problem on the implementation side, not on the spec side. As a side note, I'm personally inspired that you applied the word "equity" to this situation. "i18n equity" is a much more pointed and timely term than other terms used to describe this problem space, such as "long-tail language support" or "next billion users". I intend to start using "i18n equity" when advocating for solutions to this problem with others in my organization.
Point taken.
Does
Can you elaborate? Note that UTS 35 already defines some Unicode extension subtags that allow for tailoring the segmentation engine, such as "dx", "lb", "lw", and "ss", and more such subtags can be added in the future.
I will raise this point to TC39-TG2. |
Informational response, with a further response later, to demonstrate why I consider this impossible to get right with just language tags.
Concretely, these are both valid written Cantonese, meaning "We don't want to eat.":
ICU And then it gets harder: Character Set
Language Family "Yue", or "粵語", isn't really a language so much as it is a language family of which Cantonese is the most-well-known member. "广东话" (simplified) ("廣東話" traditional) is translated as "Cantonese" for which a direct literal translation literally means "spoken language of the Canton region." An approximate equivalent as an explanation for "Yue" might be to call English a "Germanic" language. "Yue" as a classifier would also include many historical languages and other still-in-use languages under its umbrella such as Taishanese (台山話). Taishanese is not only in wide use in Taishan, it is heard on the streets of Chinatown in San Francisco and throughout the world because of emigration patterns. Divergence Cantonese as spoken in Guangdong is easily differentiable from Cantonese as spoken in Hong Kong due to English influence and both past and present colonial history. The lines drawn on a map have resulted in dividing the language development into distinct paths on the opposite sides of the border. Macau adds Portugeuse into the mix, again for colonial history reasons. Historic Encodings Some digital Chinese uses known-incorrect but homophonic characters because of previous inequity in available character encodings. Addition of characters to Unicode necessarily occurs after a new character has come into use. The latency between creation of a character and insertion into Unicode character tables outside of private use areas of individual fonts requires this workaround. The larger that latency, the more content is produced with these workarounds—sometimes entirely subsuming the "true" character. This can easily mask the true intended word for any segmenter, even though the word itself may appear to be nonsense. Censorship Circumvention Many Chinese characters are composed of multiple components. Those components can be exploded into a series of individual characters which, when read by a person who understands the "code," will carry a separate meaning. A non-tailored segmenter shouldn't be expected to understand this, but it should be possible to create a segmenter which can identify many of these occasions (as demonstrated by the success of detecting this method by censors). Use of homophones is also a pattern used for censorship circumvention. More Because that isn't enough:
So, in order to properly tailor segmentation, at least some portion of this needs to be accounted for. Many of these things can be specified in a language tag, but eventually the degree of specificity required in the language tag becomes extreme, verging on impossible.
I fully believe that segmentation for Character-based languages requires the equivalent of a focused Hunspell-like project, a project that I've heard was required to be created in order to support spellchecking for Hungarian. It is currently doctoral-thesis level work for linguists. A dictionary approach is a decent first approximation, and is why a tailored Until I'm done helping my Cantonese instructor with his doctoral thesis (perhaps a Hunspell-like project for Cantonese) Jieba (which also uses a dictionary approach for all but unknown words) will likely be the best option. All that to say: I do not believe that we can expect to be successful in providing an out-of-the-box solution without providing a tailoring API, whether provision of a dictionary or other methodology. That the state of the art for written character-based languages also uses a dictionary means that we shouldn't assume that we will be able to improve on that without evidence in hand in advance. |
There is a lot of discussion in this thread and be honest I have not read all of the details. Just want to pointing out several details
You can do BOTH 1) and 2) now.
|
I agree with Shane's comments. The model is to provide the service for the best match to what is available, AND provide a way for developers to query if what they get is what they expect. So any developer can check whether a service is available for yue, or en-AU, etc. OT: I also agree that 'i18n equity' needs more focus and attention, although I'm not sure if that is the best term. There are a large number of languages (~7,000) with a long tail. And I don't see people ever spending the same amount of work to support, say, https://en.wikipedia.org/wiki/Pima_Bajo_language, as to support a sizable language like 'yue'. An achievable goal for the 'digitally disadvantaged languages' would be to enable at least display, input, and locale selection on major platforms. |
My understanding is that |
I think the term equity could be understood in many different ways, and that's why it's a slippery term. Your 'proportional i18n equity' is clearer (though not as pithy). |
There are nothing in the ECMA402 mandate that cannot happen and / or guarantee that won't happen in the future. There are neither a reason, based on the text of ECMA402, to prevent either ICU or anyone using ICU or anyone not using ICU to add such into the ICU (or other) dictionary. |
BTW, the reason ICU's dictionary didn't handle Cantonese is very simple- no body try to and there are no strong reason attempt to yet. There are many dialects in China has very similar condition as Cantonese, for example Wu (or Shanghainese), MingNan ( or Taiwanese) , all have not enough text in written form online as training material. From my point of view, none of them need a "tailoring" approach and can be simply addressed by appending into the cj dictionary in ICU if someone bother to support them. Tailoring is only needed if there would be a conflict between two different dialect, but so far the issue is not conflict between, but just lack of support. |
Please, y'all, do take the time to review when you have time. I've spent the time to try and explain my concerns for the committee's consideration. I have a full time job that is not in tech at this point and will respond as I have opportunities. For example, this response will not address all comments since my last, just @FrankYFTang's first one.
I agree that this is fine, but is tangential to my concerns; my concern is that we don't have a path for convincing implementations to include Cantonese, or a path for convincing ICU to include Cantonese, or have a planned strategy for "language packs" or other just-in-time loading, or stated most generally, "how doesthis API actually support other languages?" There are already numerous complaints about the size of the existing data increase just for the current dictionaries (approaching 4 megabytes) and any further extension of ICU to support Taiwanese, Shanghainese, Cantonese, or any other language which should be possible to address within the bounds of the existing
I would argue that taking this approach makes this not an API, but instead an
I agree that this specification, as proposed, is not bound to a dictionary approach. But I don't believe that we can ignore how implementations will implement it and the limitations that would impose on consumers of the API. Given the constraints, we can say pretty concretely that V8 would (at least in V1) provide no method to support Cantonese (as a concrete example). Further, it sure seems like a lot of people in this thread are looking at alternative implementations that may provide better results across a large number of languages. Given that all of those are open research projects at this point it might be prudent to wait for results from those first so that we're not specifying the next
My approach has been very focused on attempting to solve for problems that I can identify. I started with a very explicit proposal (custom dictionaries) but am excited to explore any proposed solution that would elegantly support minority languages. At this point I'm not looking to make broad sweeping statements on what we should or should not specify as that unnecessarily constrains our available solution space.
Worth noting, the last meaningful update to cjdict happened in 2012, removing of CC-CEDICT data. That happened very shortly after the original implementation landed in August 2012. Also worth noting, the initial goal of supporting character-based languages appears to have begun in August 2002, and landed 10 years later in August 2012. This informs my opinions about designing an API for now, or delaying for research projects to be complete. It also demonstrates the historic inequity that even Mandarin has faced. As such, China had to mandate GB18030 character set support in order to sell software into China—in 2001. |
I tried to explain why I don't believe that scales here.
Cantonese is somewhere around the 15-20th most-spoken language in the world. I'm fully capable of building out a complete set of Cantonese support and at least attempting to land the code in every single possible place. But even if I were to do that, I have no guarantees that any 402 implementation is going to ship the data I need to make that work, or the code to make it happen. And that is for a top-20 language. Literally the first comment on the Firefox implementation of this proposal: "increases icudt by 3.57MB which makes it kind of unlikely to be approved by release drivers". Not promising for being able to get my implementations released. (At some point the answer is "no," even if the answer is "yes" for Cantonese. What is our design when we have to tell language ranked 5002 by usage, "no, you're not going to be default included"?) |
Since I may have accidentally helped coin a term, my intent behind using the word I'm very much not saying that we must have everything addressed for every language, but more that if a language has an individual champion who is willing to take it upon themselves (or as a part of a team, or in a government sponsored effort), the equitable ideal would be that it is possible to provide support equivalent in quality to that which English (as technology's lingua franca) receives. It should not require market-barrier-enforced mandates to achieve equity. And sure, we will have blind spots, we will get things wrong. But when somebody points out the places where we have failed to consider something and thus failed to meet that bar, we can work with them to figure out how to address those shortcomings. That each additional language supported comes with a cost to non-users is something I'm well aware of. We should definitely be paying attention to design in that space. (ICU4X has a solid premise.) |
not as in "written form", right? For example, there are only 107,213 articles in BTW, what do you think the developer will use this for? |
The payload problem is a major point of discussion in ECMA-402; it's how we arrived at the revised Stage 2 and Stage 3 entrance requirements (see our presentation on this subject from last month here). Intl.Segmenter has gotten this far because browser vendors agreed that the improvement to the web ecosystem was worth the payload. In other words, the cost of adding this feature has already been taken into account. Browser implementors, like Frank and Zibi, are fully aware of the tradeoffs. I'm also confident in them to adopt the community's recommendations on how to improve Cantonese segmentation and pull that into their respective browser engines when it becomes available. As far as BCP-47 expressiveness, Unicode locale extensions are constantly evolving. New Unicode extension keywords are frequently added to UTS 35. Segmentation and collation are two key use cases for UTS 35, so if there is something that UTS 35 doesn't support, I'm confident that a proposal for that addition would be taken seriously. To revisit the original suggestion, which was to add an option to tailor the dictionary: I understand the mental model behind this suggestion, but I haven't heard an answer to the question of why adding such a tailoring option is better than the status quo of checking |
The public websites where I suspect you'll find the most written Cantonese are these two, they're both effectively Reddit clones: However, the primary use for this is very clearly for private audiences that won't show up in web metrics like page counts:
Those three environments combine for a tremendous portion of web use by time—maybe even a majority. Every single one of those would be improved by having the ability to accurately segment more languages, Cantonese included. (Facebook literally implemented custom cursor behavior for theirs—it sure would be nice to have reasonable word jumps when authoring.) Some use cases, enumerated: cursor navigation, spell check, grammar check, teaching tools (one of the big reasons I care), better linebreaking (it's currently awful), improved voice-to-text (a naive version could get much farther), and more. Let me know if I haven't listed enough, I can come back with a longer list to make sure everybody is satisfied—I mean that seriously, not sarcastically or passive aggressively; I'll build the inventory of use cases. Further, simply making some of the pain go away also goes a long way toward increasing the use of a language. But I also wonder why I have to define these needs for Cantonese where the existence of a use case for any other majority language is assumed. This again demonstrates the bias in favor of more-dominant languages that I've been pointing out in this thread. Beyond that, 402 explicitly didn't constrain itself to Web or even JS environments. So by focusing exclusively on those we are already putting on blinders. (Aside: @FrankYFTang I am guessing that you might can read characters? 簡體字定係繁體字啊?And yes, I chose to ask in Cantonese intentionally. 😜) |
Let me also put some additional color into this thread since today especially I'm emotionally exhausted. I live in Hong Kong. Today, 47 people who participated in an election where I worked at a polling station were jailed for having a dissenting opinion. On election day (July 11, 2020) I shook the hands or had conversations with many of these people. Now all of them are in jail for the indefinite future. The headline, from Washington Post: With new mass detentions, every prominent Hong Kong activist is either in jail or exile. Or, more personally, after a little personal protest that I alone staged, these are three of the security guards who paid me a visit in the classroom where I study Cantonese: (Check out the background of the second picture to see teaching materials pinned to the wall.) And yet, I'm just an immigrant here, privileged by my skin color, passport country, and status in Hong Kong such that I'm somewhat insulated from the worst consequences that many of my friends and family will suffer. When you're small and feel helpless against a giant machine, all you want to do is find some way where you may be able to improve the situation, some place where you think you can maybe affect change. So I'm here, writing in this thread. I'm writing code that can be used to support usage of Cantonese: https://github.com/cantonese. And I'm trying to inventory the technical things that I could do, participate in, here: https://cantonese.github.io. Now it's 2am and I need to sleep. But even tonight I had to do something to be able to feel not-so-helpless. Tomorrow morning I return to my Cantonese coursework at 9:30am. In a month I'll have a degree, but that's not why I'm studying. At about that same time I'll have a daughter, and will begin teaching her Cantonese so that she isn't cut off from the world she comes from. And you can be damn sure that any barrier I can remove to her accessing the culture of her family, friends, and history is something that I will fight to remove. That too is another reason I am here. So, when I say that I feel like this API is wanting, these are the things that are on my mind. When I say that this API is inequitable, this is why I care. Y'all are approaching this from a technical perspective which I can respect, but we write code for humans, in service of humans. From my perspective, this API can only serve some humans, and then everybody else is at the mercy of how much the end-developer cares (or knows). The counterproposal repeatedly offered in this thread is to "load a polyfill"—which doesn't use this API so much as replace it, and offload the problem to the end-developer. This makes it self-evident to me that the API doesn't solve the problem, and that this API needs additional consideration. We need to find a solution to this problem as deep into the priority of constituencies as possible, in order to make it available to as many people as possible. And that's what I want us to address. |
My proposal in this particular thread isn't one I'm particularly in love with, but was concrete enough to serve as a straw man. Don't ask me to defend it too strongly; I'm not willing to do so. But it serves to demonstrate the inequity. @sffc I will reply to your note later. |
Discussion from 2021-03-11 TC39-TG2 meeting: https://github.com/tc39/ecma402/blob/master/meetings/notes-2021-03-11.md#custom-dictionaries-133 Procedurally: Seeing that Intl.Segmenter is approaching Stage 4, and is already shipped in Chrome and Safari, we should move discussions like this one to the main ECMA-402 project to serve as a basis for a future proposal. I would be happy to entertain concrete proposals from @nathanhammond on this subject. |
HK urged to consider simplified Chinese and Mandarin Beijing's Ministry of Education on Wednesday suggested Hong Kong clarify the status of simplified Chinese and Mandarin in law, and for students here to learn Mandarin under a system in which the language is incorporated into the local exam system. |
Nathan- Feel free to comment in https://unicode-org.atlassian.net/browse/ICU-21571 If you have the legal right to contribute the list of Cantonese words I will work with you to make a prototype for that. |
ICU's BreakIterator has clear limitations in its approach for character-based languages without textual word boundaries. When used directly, it allows you to specify a dictionary to work around limitations in its approach, but the
Intl.Segmenter
API does not expose that functionality. I worry that by standardizing on ICU's BreakIterator approach—without providing the escape valve ICU provides of specifying a custom dictionary—encodes bias into ECMA 402.For example, ICU's supplied segmentation dictionaries conflate a significant number of languages with distinct usage. This can be particularly fragile across locales. Taiwan, Hong Kong, Singapore, and China all have distinct usage.
(Not to mention languages less-used than the dominant language in those locales.)
I believe that we need to expose custom dictionaries in order to ship this.
The text was updated successfully, but these errors were encountered: