Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to do caseless comparison #99

Open
domenic opened this issue Aug 14, 2016 · 13 comments
Open

Provide a way to do caseless comparison #99

domenic opened this issue Aug 14, 2016 · 13 comments
Labels
c: text Component: case mapping, collation, properties Proposal Larger change requiring a proposal s: comment Status: more info is needed to move forward

Comments

@domenic
Copy link
Member

domenic commented Aug 14, 2016

Unicode defines several caseless comparisons: caseless, compatibility caseless, canonical caseless, and identifier caseless. It would be nice to be able to do these from JavaScript. @littledan says the best way is to expose the Unicode case-folding operations.

@littledan
Copy link
Member

Specifically, what if we had toCaseFold/toLocaleCaseFold methods on String.prototype? This would allow users to build all three of the caseless comparisons that @domenic mentions. These are really basic functions for many Unicode algorithms, so they seem like a good building block to have. cc @jungshik

@littledan
Copy link
Member

littledan commented Aug 26, 2016

I wrote up a quick explainer for toCaseFold. Any thoughts? Would this group be interested in seeing case folding pushed forward, or is it an edge case that's not so relevant?

cc @sebmarkbage @jungshik @ericf @jswalden @caridy @zbraniecki

@littledan
Copy link
Member

littledan commented Aug 26, 2016

Apologies, this is unnecessary; you can already set the sensitivity lower for comparison, e.g.,

"foo".localeCompare("FOO", "en", {sensitivity: "accent"})  // 0

This bug can be closed

@domenic
Copy link
Member Author

domenic commented Aug 26, 2016

Fascinating. Do you know how those sensitivity options match to Unicode's comparison algorithms? The spec doesn't seem super clear, but maybe I am not reading it fully...

There's still value in toCaseFold for map keys and such though. But I guess it probably loses priority.

@littledan
Copy link
Member

It's based on an earlier part of the collation key, described in UTS 10. I think the next thing to do from here if we want to improve performance for a case like that would be to expose an API to get the collation key of a string, probably as a Uint8Array, though that would be rather inconvenient as a Map key.

@jungshik
Copy link

@domenic See http://unicode.org/reports/tr10/#Multi_Level_Comparison
and http://userguide.icu-project.org/collation/concepts

'base', 'accent' and 'variant' correspond to level 1 (primary strength), level 2 (secondary strength) and level 3 (tertiary strength) in UTS 10.

My memory is fuzzy as to what exactly 'case' is for. (there were a lot of back'n'forth on this issue when version 1.0 spec was worked on).

http://www.ecma-international.org/ecma-402/3.0/#collator-objects has the following:

Collator, however, requires that the usage is specified through the usage property of the
options object, alternate handling through the ignorePunctuation property of the options object, > and case level and the strength through the sensitivity property of the options object.

I believe that setting the sensitivity to 'case' will turn on 'case level' in UTS 10. Ok, I read the v8 implementation (which was done while the spec 1.0 was written). Setting the sensitivity to 'case' will use level 1 in UTS 10 (Primary strength) AND turn on 'case level'. That is level 1.5 in a sense (primary difference + case difference is taken into account but accent is ignored).

As a result 'case level' cannot be set independent of 'collation level'. It's mostly ok except that 'level 2.5' (level between level 2 and level 3) cannot be created. In level 2.5, level 3 differences (regular kana vs small kana, "A" vs "Ⓐ" )other than case difference will be ignored.

Maybe, the spec need to have a (non-normative) note explaining what setting 'sensitivity' means.

@littledan
Copy link
Member

It may be useful to note that this is a different meaning of case insensitive comparison than case folding. For example, case folding would leave punctuation included, and this notion of strength would not IIRC. I think the collation definition is probably more useful semantically, but some standards/algorithms (e.g. HTML) make reference to case folding.

@domenic
Copy link
Member Author

domenic commented Aug 29, 2016

Wow, thanks @jungshik for the detailed answer! This stuff is complicated...

Maybe, the spec need to have a (non-normative) note explaining what setting 'sensitivity' means.

IMO the spec already has that in "The sensitivity of collator is interpreted as follows:". Although it looks normative, I guess it is non-normative, since the actual behavior delegates to UTS 10. (I assume?)

@littledan
Copy link
Member

@eaenet You and I discussed this in person, and I believe our tentative conclusion was, it still could have value to provide case folding, due to the differences from collation that I mentioned in this comment. If we don't provide case folding, end users may implement it themselves wrong. Any thoughts from anyone on that proposition? Note that case folding is used internally by ECMAScript in case-insensitive RegExps.

@eaenet
Copy link

eaenet commented Sep 8, 2016

@littledan agreed. At some point we should have a discussion about what APIs make sense as part of ECMAScript and which make more sense as Web APIs.

@jungshik
Copy link

jungshik commented Sep 8, 2016

Maybe, the spec need to have a (non-normative) note explaining what setting 'sensitivity' means.

IMO the spec already has that in "The sensitivity of collator is interpreted as follows:". Although it > looks normative, I guess it is non-normative, since the actual behavior delegates to UTS 10. (I
assume?)

Sorry I overlooked that. Yes, indeed the spec explains that. I think it's still normative. A lot of things in Ecma402 refers to the Unicode standard and UTS 10 is a part of the TUS.

@jungshik
Copy link

jungshik commented Sep 8, 2016

@littledan , @eaenet: Are you aware of ignorePunctuation (boolean) in Intl.Collator?

@littledan
Copy link
Member

@jungshik Doesn't ignorePunctuation do something different than case folding? Not sure if it's needed, but some things definitely use case folding (e.g., some case-insensitive file systems, databases).

@sffc sffc added s: help wanted Status: help wanted; needs proposal champion c: text Component: case mapping, collation, properties and removed enhancement labels Mar 19, 2019
@sffc sffc added Proposal Larger change requiring a proposal s: comment Status: more info is needed to move forward and removed s: help wanted Status: help wanted; needs proposal champion labels Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: text Component: case mapping, collation, properties Proposal Larger change requiring a proposal s: comment Status: more info is needed to move forward
Projects
None yet
Development

No branches or pull requests

5 participants