-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add locale sensitive substring matching functions to Intl.Collator #506
Comments
What you can do today:
What you can't currently do:
|
We need some reasonable way to do simple locale sensitive string search.
Are all well known names for functions that have lots of utility but cannot be reasonably implemented using just |
You can kinda do it with Intl.Collator already with some very careful slicing. But you have to make sure to normalize both strings the same way first. And if you want to skip punctuation it's more complicated. Here's a library that attempts to implement that: https://github.com/arty-name/locale-index-of. So while it's possible, I think it would be really nice to include it in the language. This would also make it more likely that developers implement these functions correctly as opposed to naively. |
Sure; this sounds like good material for a proposal. Would you mind sharing some code snippets to help people get a better idea? A list of function names is good, but actual code snippets are better. |
Sure. My main use cases are around filtering a list of items displayed in a user interface as the user types. For example, an autocomplete input or filterable table. For these usecases, a fuzzier search is required rather than an exact match. We still want to show results even if the user does not type in exactly the same case or diacritics as the items being filtered. Here's an example of what I'd like to be able to do. Using let items = ['Offenbach', 'Österreich', 'Aardvark', 'Kangaroo'];
let collator = new Intl.Collator('fr', {usage: 'search', sensitivity: 'base'});
let search = 'o';
let matches = items.filter(v => collator.includes(v, search));
// => ['Offenbach', 'Österreich', 'Kangaroo'] In some cases it may also be useful to know the range of the match. For example, perhaps the application wishes to highlight the matched string in the UI. In this case, let items = ['Offenbach', 'Österreich', 'Aardvark', 'Kangaroo'];
let collator = new Intl.Collator('fr', {usage: 'search', sensitivity: 'base'});
let search = 'a';
let matches = items
.map(v => ({item: v, matches: [...collator.search(v, search)]}))
.filter(m => m.matches.length > 0);
// =>
// [
// {item: 'Offenbach', matches: [{index: 6, length: 1}],
// {item: 'Aardvark', matches: [{index: 0, length: 1}, {index: 1, length: 1}, {index: 5, length: 1}],
// {item: 'Kangaroo', matches: [{index: 1, length: 1}, {index: 4, length: 1}]
// ] |
I've been here before but this was today's reason (there's some TypeScript in there but mostly it just JavaScript). // this is a React hook
function useFilter(): [string, (filter: string) => void, RegExp] {
const [filter, setFilter] = React.useState("")
const debounced = useDebounce(filter)
const pattern = React.useMemo(() => {
return new RegExp(debounced.replace(/[^\w]/g, "\\$&"), "i") // not too happy about this
}, [debounced])
return [filter, setFilter, pattern]
} const COLLATOR = new Intl.Collator(undefined, { sensitivity: "base" }) // ignore case, ignore accent
// this is some render code, what I want to do here is just filter the list of stuff based on the filter
const [filter, setFilter, filterPattern] = useFilter()
data
?.filter((d) => filterPattern.test(d.v[0]))
.sort((a, b) => COLLATOR.compare(a.v[0], b.v[0]))
.slice(0, 50)
.map((d) => {
const key = encodeEntityId(d.e);
return (
<li key={key}>
<NavLink to={`/property/${key}`}>{d.v[0]}</NavLink>
</li>
);
}) ?? null; What I would like to be able to do here is to just use the export type Pattern = string[]
const ANY: Pattern = []
export function createPattern(pattern: string, wildcard = "*"): Pattern {
if (pattern === wildcard) {
return ANY
}
return pattern.split(wildcard)
}
export function isMatch(s: string, pattern: string | Pattern): boolean {
const p: Pattern = typeof pattern === "string" ? createPattern(pattern) : pattern
if (p === ANY) {
return true
}
if (p.length === 1) {
return p[0] === s
}
if (!s.startsWith(p[0])) {
return false
}
if (!s.endsWith(p[p.length - 1])) {
return false
}
return _isMatch(s, p[0].length, p, 1, p.length - 1)
}
function _isMatch(s: string, i: number, pattern: Pattern, j: number, k: number): boolean {
const p = pattern[j]
if (!(i + p.length <= s.length)) {
return false
}
if (j === k) {
return true
}
const x = s.indexOf(p, i)
if (x === -1) {
return false
}
if (_isMatch(s, x + p.length, pattern, j + 1, k)) {
return true
}
return _isMatch(s, i + 1, pattern, j, k)
} The above subsequence matching algorithm leverages I would just assume that these would be available as defined today on the
I don't believe |
There's nothing fuzzy about what's going on here. This is simply about being able to search for substrings using the |
ICU has StringSearch which looks like it uses a collator internally to implement this. See also: http://userguide.icu-project.org/collation/icu-string-search-service and http://www.unicode.org/reports/tr10/#Searching |
I would like to add that I'm trying to build a database in JavaScript and having more fine grained control over string operations to build case insensitive (or with respect to a specific collation) text indexes is currently not impossible (or extremely inefficient). I don't believe that any of the ongoing standards work here is going to help with that nor do I believe that they will be able to address the problem in the context of my domain. I don't know how I would move this forward but having, at the very least |
Note that this sort of feature needs a fair bit more code than just a Collator, and you can't just build it on top of compare(). You basically need to incrementally map appropriate sequences of characters to sequences of collation elements and match the desired parts of those (depending on the collation strength) to the collation elements for the pattern. There are options for matching behavior (asymmetric search is somewhat popular for handling diacritics) and for which match boundaries to return when there are multiple possibilities, related to Unicode normalization and/or word boundaries. |
When proposing the scope for ICU4X collator MVP, I looked at what Firefox does for ctrl-f (not collator-based) and what CLDR search collations accomplish compared to that. I think exposing a search-specific API that's based on fold case, ignoring diacritics, and a bunch of special cases to accomplish other things—these appear surprisingly few—that CLDR search collations accomplish (but Firefox doesn't) is seriously worth considering compared to adding a collation-based search API: The main concern of collation is to establish an ordering that is appropriate according to human expectations. However, for search, only matches need to be appropriate according to human expectations and, to the extent there is internally a concept of "less" or "greater", the order is an implementation detail. As a result, the key design driver for what collation data needs to be like is not relevant to search. At least when the search collations are stored as separate instances of data, the cost/benefit of collation-based search (when considering the data size a cost) looks like something that could be improved upon considerably by a search-specific design. Also, (anecdotally by visual observation of ctrl-f on a large HTML file in Firefox vs. Chrome) not having to map the haystack to collation elements (fold case is a simpler transformation) seems beneficial for performance. |
Can someone explain to me what the corner cases are here... I want to do a simple prefix search. I can do this by simply chopping strings with I get that a code point is not a "letter" but the user submitted string will not be broken up at arbitrary code points. Even if a malicious user crafted some nasty string of code points I am perfectly okay with them ending up with a nonsense search result. |
The issue title and description talk about I understand that searching and collating are different tasks. Would this topic become less controversial, if instead of asking for new methods on We already have locale-aware This way we could easily lose all objections about collation being different and irrelevant to the topic. Then we could focus on the goals of developers, like "I have to find the position of a substring in a text, while smartly ignoring case, diacritics, and punctuation". |
While convenient, I think this is the wrong approach because each locale specific method now has to go through a lookup phase to find a "collation". The search module from the ICU library appears to be doing exactly this, it can do Should we introduce |
Conceptually, For example, suppose you want to tell whether However, in many cases, even |
No, it's not just UTF-16 code points. Internally, a collator implementation maps the sequence of characters to a sequence of collation elements and operates on those. Currently, ECMA-402 exposes only the comparison operation and ICU4X's collator is at present designed to cover only the operations that are in scope of ECMA-402 at present. Extending that operation with code to support However, extending it in an efficient way to support (To be clear, my concern is not how the entry point is named but whether collation-based search is taken into Web Platform scope and the implications on an ICU4X back end. As noted, I'm somewhat skeptical of collator-based search conceptually, since much of the design of collations goes into defining culturally-relevant "less" and "greater" results while search only cares about the "equal" case being culturally-relevant and "less" and "greater" are implementation details.) |
Can For example, while you can order strings with To do this, you need to know, for example, that |
It seems like a very bad idea to expose this kind of implementation detail in a way that would make the design of the collation elements Web-exposed and, therefore, part of the compatibility surface of the Web and, therefore, presumptively frozen. You wouldn't know that the first collation element of |
Well, can't you do it without exposing the raw values? Something like an opaque I think that while how collation elements are compared is an implementation detail, the collation elements themselves is not. Collation elements are the technical equivalent of the everyday concept of letters of an alphabet. And I think that everyone can agree that the ability to divide text into letters is a useful thing to do, much like dividing text into sentences and words. Even just knowing the number of collation elements in a string would be immensely helpful. There might not be a perfect answer to the question, "how long is this string?" But the ability to say that the word |
But they aren't. There might be more or fewer collation elements than letters (for any definition of letter). What matters is that the collation elements happen to sort in an appropriate way.
Collation element don't give you a division into letters. Collation element give a sequence of thingies that are useful for culturally-appropriate sorting.
I have doubts about it being useful to be able to say that. In any case, counting the collation elements does not let you say that: There are multiple levels of comparison, but the e.g. primary and secondary weights might be on one collation element or as two separate elements each with the other weight marked ignorable on its level. Thus, the number of collation elements is a implementation detail. |
Culturally, in alphabetical scripts, we sort alphabetically. Therefore sorting is alphabetiziing. It stands to reason that collation elements must necessarily be very closely tied in some way to letters in an alphabet. Indeed, even in non-alphabetical writing, sorting is still dependent on breaking strings down to smaller units. These units, alphabetical or not, are not implementation details of sorting. They have linguistic and cultural significance. Or at the very least, to put it another way, the information the collator has very often pertains to letters of alphabets more closely than what could be otherwise obtained with other methods currently available.
With What's the difference between the two cases? Technically, they might be different, but culturally, I do not think there is a difference. And with |
One simple use case in US Search and found all 17 "beyonce" in the following article, including "Beyoncé", "Beyonce", "BEYONCE", and "beyonce". It should also be able to match "BEYONCÉ" of course. Don't miss "Beyoncé" please! Notice this is a feature even website in US need! :) Timothée Chalamet, Penélope Cruz, Renée Zellweger, Clémence Poésy, Zoë Kravitz, Gisele Bündchen, Roberto Gómez Bolaños |
TG2 discussion: https://github.com/tc39/ecma402/blob/main/meetings/notes-2024-05-23.md#add-locale-sensitive-substring-matching-functions-to-intlcollator It seems to be a real use case with real applications. Although ICU4C supports locale-aware string search via the Collator API, some delegates feel that this is an inferior solution to the problem that we don't want to propagate. We would like to see this problem addressed upstream in Unicode/CLDR and then we can revisit the ECMA-402 side once there is more prior art. |
It would be useful for
Intl.Collator
to support some additional methods for locale sensitive substring matching. For examplestartsWith
,endsWith
,includes
,indexOf
, etc. These methods exist onString.prototype
, but do not support the options thatIntl.Collator
provides for case, diacritic, and punctuation insensitive searching.It is possible to implement these on top of the existing collator
compare
function, but it is challenging and potentially hacky to do so correctly. I think a built in solution in the language would make sense for a lot of usecases, e.g. client side filtering of lists.Has there been a proposal for anything like this already?
The text was updated successfully, but these errors were encountered: