-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode Properties #90
Comments
As @littledan mentioned, this is an experimental feature in V8. The comment in the regexp parser describes the current syntax we are using:
For example \P is the inverse of \p, so binary properties with "False" as property value can be expressed via \P. |
For the record, the V8 flag @littledan mentioned is Is full compatibility with existing Some related info: mathiasbynens/es-regexp-unicode-character-class-escapes#2 @goyakin Can we track your spec work somewhere (GitHub)? |
We considered doing loose matching and having a "In"-prefix for blocks. But having thought about it, we decided against either. Looking at Perl, it seems to be a good idea to be strict rather than overly ambiguous. |
@hashseed Agreed; that is what the discussion I referenced concluded as well. I’ve now updated my example (originally intended to explain how aliases should throw only) to avoid confusion. Note that in your example you’re still doing a form of loose matching, i.e. ignoring |
I thought the underscore is actually part of the name. That's what PropertyAlias.txt and PropertyNameAlias.txt as well as ICU suggest. |
As far as I can see, only |
I hope to present this as a stage 0 strawman at a future TC39 meeting. After implementing support for |
Thanks for following up on this, Mathias! Having followed the unicode mail thread, I think I can get behind the idea of considering whitespace, hyphens and underscores as equivalent, when looking up property names and property value names including their aliases. E.g. \p{Lowercase Letter} would be allowed just as well as \p{Lowercase-Letter} and p{Ll}, but not \p{Lower case Letter}. This would solve the conflict between Blocks.txt and PropertyValuaAliases.txt. |
@hashseed There is another issue though: e.g. Would you be open to that, or would you rather stick to strict matching in that case? |
@mathiasbynens — thanks for your work on this. What's puzzling to me is why I'm a definite -1 on leniency to match |
@srl295 It may not be what it’s for, but it would be a direct consequence of following http://unicode.org/reports/tr18/#RL1.2 which specifies that “matching of […] values must follow the Matching Rules from UAX44”, specifically http://unicode.org/reports/tr44/#Matching_Symbolic. (As stated, I’d be fine with not following that, and implementing strict matching instead — just explaining the reasoning here.)
Yeah, that’s what I didn’t know when I started the thread. I’d be willing to bet that there are other developers wishing to use
This is a problem that can be solved through proper developer documentation, of course. But taking all of it into consideration, I’m leaning towards supporting @hashseed’s suggestion + case-insensitivity. |
@mathiasbynens UAX44-LM2 is of course a great reason to, quote, Ignore case, whitespace, underscore (‘’), and all medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E_, unquote. So I'm +1 on that.
But, it should work (and does in ICU )— because of UAX44-LM2. Are there any names in It does seem that both the |
@srl295 Have you seen mathiasbynens/ecma262#1 (comment)? It was the context for the above discussion.
Sure — if we decide to follow that. My initial spec draft included a variant of loose matching per UAX44-LM2 (minus non-ASCII hyphens and the
No. But note that this is also true for @hashseed’s suggestion combined with case-insensitivity (which is what I was proposing here), which would be a more strict solution than UAX44-LM2. I’d strongly prefer that over UAX44-LM2, at least for the initial spec text + implementations. We can always loosen up the matching algorithm later, but if we do it right from the start, there’s no going back. |
If matching If we consider following UAX44-LM2 a bad idea, and there is no reason to care about matching With that in place, we can still gather feedback from developers. If not having loose matching is an actual developer pain point, we can still address that in a future PR. |
Updated https://github.com/mathiasbynens/ecma262/pull/1/files to explicitly mention |
There is now a standalone repo for this proposal: https://github.com/mathiasbynens/es-regex-unicode-property-escapes Let’s move the discussion over there. |
@mathiasbynens OK. I think there's a lot more needed than just support within regex (as important as that is), especially getting the general property and other properties given a codepoint. in ICU there's uchar_getIntProperty so
etc. Not proposing this specific API, just trying to get the concept rolling. |
@srl295 Are there use cases that you have in mind where it is important to use the property value, rather than test whether the character has a particular property value? That would help motivate adding such an API. |
@littledan sure, anything that's not just a single boolean:
sure, you could do
… but why? Actually I would prefer making the property available over extending regex. Because if you have the properties, you can implement regex in JS. But without the properties enumerated, it's a lot harder to do the reverse. |
@srl295 I don't think exposing a way to test for property value for a particular character should affect this proposal. |
Let's suppose that there's such a use case. Even then, wouldn't it better to make that API a part of Ecma 262 instead of Ecma 402 (given what has been added to Regex) ? |
It's a somewhat esoteric question which place this lands in; the 262/402 split doesn't correspond to the split in some implementations. For example, V8 does not support normalization or Unicode RegExp properties when "i18n" is compiled out. I suspect it's not the only one. A rough argument for putting it in 402 is, this is where the library functions for things that aren't methods on existing objects go. And it seems reasonable to make this a property of the Intl object. |
Well, the current 'V8_INTL_SUPPORT' needs to be split into two eventually or its 'boundary' has to be changed, IMHO once https://bugs.chromium.org/p/v8/issues/detail?id=5500#c9 (replace unibrow with ICU) is resolved. One should be about Ecma402 support (Intl.* API support) and the other should be about whether ICU is used or not (ICU vs unibrow). Depending on the above V8 bug is resolved, the latter would not be necessary at all (i.e. ICU is always used) in which case Unicode RegExp properties (a part of Ecma262) would always be supported regardless of Intl.* API (Ecma 402) support. |
For anyone who's looking to contribute to ECMA 402, this is a "shovel ready" project, just in need of a writeup for a concrete API, and presentation to the committee. |
in practice, does |
this may end up being 262… |
|
I believe this is much more convenient in 262 even w/ the fact it involves a good amount of work from the delegates working w/ ECMA-402. As long as we have someone to champion it in a TC39 meeting, this should be well clarified. |
https://github.com/srl295/es-unicode-properties/issues EDIT (May 2021): This proposal is currently stalled, pending more concrete use cases. |
Can you make the Unicode properties needed for internationalization and low-level text rendering available? It’s becoming increasingly common to do low-level text rendering in JavaScript because certain APIs like WebGL require it or because people are making more complex web apps like word processors, paint programs, or graphic design tools. Implementing internationalization support for this low-level rendering like the bidi algorithm, vertical orientation, and text shaping requires a lot of these Unicode properties, so it would be great if there were an API making them available. Right now, libraries like Harfbuzzjs simply include a compressed version of the Unicode database in their code, and it’s not too big, but since web browsers already know this information, it would be great if web browsers made it available to JavaScript. Preferably these properties would be fast to access too. |
@my2iu thanks for your comment. Looks like the proposal has not seen a lot of activity lately, but hopefully that would change soon... |
@my2iu @ryzokuken that's why I proposed this, but there was a lot of pushback that there weren't real use cases for anything that couldn't be covered by regex. See https://github.com/srl295/es-unicode-properties |
Your proposed specification uses a lot of strings in its API, so I’m concerned that it would be slow. Regexes might be fine if they were extended to handle all the weird Unicode properties needed for low level text rendering and if the regexes could be optimized to be fast. |
Not sure why that would be slow, it's mostly the same strings. Can you make a concrete list of regex properties that are currently missing? I.e. specific properties from the Unicode spec you need? Also see https://github.com/srl295/es-unicode-properties#why-not-just-use-regex |
Sure. Now that I think about it, something that operates on code points might be the fastest, plus with an API that either has a lot of methods or with a “matcher” object that can be optimized like with regexes. Actually, does JS even have proper support for code points and surrogates yet? The last time I checked, people were still arguing whether JS strings were UCS2, UTF16, or UTF32. |
It could be an overload. The getter could take either a string or an integer codepoint. This is discussed in srl295/es-unicode-properties#5
yes. |
Strawman from @reed-at-google about what would be necessary for Skia's needs: https://github.com/google/skia/blob/main/site/docs/dev/design/uni_characterize.md |
That strawman API seems a little wonky. I’m not a huge WASM expert, but I don’t think strings are directly transferable from WASM to JS. You call from WASM to JS, and then from JS, you can reach into the WASM memory space to copy the raw bytes into JS and convert things into JS strings. Since WASM code is normally C++, things would normally be UTF-8, but UTF-16 might be possible as well, though WASM may prefer UTF-8. As such, I’m not sure whether minimizing the number of JS to WASM transitions needs to be necessarily reflected in the API design, and having an API that operates on JS strings (as opposed to typed arrays) isn’t necessarily the fastest thing for WASM either. It does bring up some good points about how batching might improve performance, depending on the overhead of JS to browser calls on VMs. |
Request for the "decimal" property: #579 |
https://github.com/srl295/es-unicode-properties
The text was updated successfully, but these errors were encountered: