-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a way to do caseless comparison #99
Comments
Specifically, what if we had |
I wrote up a quick explainer for toCaseFold. Any thoughts? Would this group be interested in seeing case folding pushed forward, or is it an edge case that's not so relevant? cc @sebmarkbage @jungshik @ericf @jswalden @caridy @zbraniecki |
Apologies, this is unnecessary; you can already set the sensitivity lower for comparison, e.g., "foo".localeCompare("FOO", "en", {sensitivity: "accent"}) // 0 This bug can be closed |
Fascinating. Do you know how those sensitivity options match to Unicode's comparison algorithms? The spec doesn't seem super clear, but maybe I am not reading it fully... There's still value in toCaseFold for map keys and such though. But I guess it probably loses priority. |
It's based on an earlier part of the collation key, described in UTS 10. I think the next thing to do from here if we want to improve performance for a case like that would be to expose an API to get the collation key of a string, probably as a Uint8Array, though that would be rather inconvenient as a Map key. |
@domenic See http://unicode.org/reports/tr10/#Multi_Level_Comparison 'base', 'accent' and 'variant' correspond to level 1 (primary strength), level 2 (secondary strength) and level 3 (tertiary strength) in UTS 10. My memory is fuzzy as to what exactly 'case' is for. (there were a lot of back'n'forth on this issue when version 1.0 spec was worked on). http://www.ecma-international.org/ecma-402/3.0/#collator-objects has the following:
I believe that setting the sensitivity to 'case' will turn on 'case level' in UTS 10. Ok, I read the v8 implementation (which was done while the spec 1.0 was written). Setting the sensitivity to 'case' will use level 1 in UTS 10 (Primary strength) AND turn on 'case level'. That is level 1.5 in a sense (primary difference + case difference is taken into account but accent is ignored). As a result 'case level' cannot be set independent of 'collation level'. It's mostly ok except that 'level 2.5' (level between level 2 and level 3) cannot be created. In level 2.5, level 3 differences (regular kana vs small kana, "A" vs "Ⓐ" )other than case difference will be ignored. Maybe, the spec need to have a (non-normative) note explaining what setting 'sensitivity' means. |
It may be useful to note that this is a different meaning of case insensitive comparison than case folding. For example, case folding would leave punctuation included, and this notion of strength would not IIRC. I think the collation definition is probably more useful semantically, but some standards/algorithms (e.g. HTML) make reference to case folding. |
Wow, thanks @jungshik for the detailed answer! This stuff is complicated...
IMO the spec already has that in "The sensitivity of collator is interpreted as follows:". Although it looks normative, I guess it is non-normative, since the actual behavior delegates to UTS 10. (I assume?) |
@eaenet You and I discussed this in person, and I believe our tentative conclusion was, it still could have value to provide case folding, due to the differences from collation that I mentioned in this comment. If we don't provide case folding, end users may implement it themselves wrong. Any thoughts from anyone on that proposition? Note that case folding is used internally by ECMAScript in case-insensitive RegExps. |
@littledan agreed. At some point we should have a discussion about what APIs make sense as part of ECMAScript and which make more sense as Web APIs. |
Sorry I overlooked that. Yes, indeed the spec explains that. I think it's still normative. A lot of things in Ecma402 refers to the Unicode standard and UTS 10 is a part of the TUS. |
@littledan , @eaenet: Are you aware of ignorePunctuation (boolean) in Intl.Collator? |
@jungshik Doesn't ignorePunctuation do something different than case folding? Not sure if it's needed, but some things definitely use case folding (e.g., some case-insensitive file systems, databases). |
Unicode defines several caseless comparisons: caseless, compatibility caseless, canonical caseless, and identifier caseless. It would be nice to be able to do these from JavaScript. @littledan says the best way is to expose the Unicode case-folding operations.
The text was updated successfully, but these errors were encountered: