Unicode Properties #90

srl295 · 2016-05-19T16:25:13Z

https://github.com/srl295/es-unicode-properties

srl295 · 2016-05-19T16:26:06Z

littledan · 2016-05-20T07:32:01Z

Something that's been discussed is exposing these to RegExps. V8 does this currently behind a special flag, thanks to @hashseed's work. I don't know if a spec is written but I heard @goyakin may work on exposing properties through RegExps.

hashseed · 2016-05-20T08:08:59Z

As @littledan mentioned, this is an experimental feature in V8. The comment in the regexp parser describes the current syntax we are using:

  // Parse the property class as follows:
  // - \pN with a single-character N is equivalent to \p{N}
  // - In \p{name}, 'name' is interpreted
  //   - either as a general category property value name.
  //   - or as a binary property name.
  // - In \p{name=value}, 'name' is interpreted as an enumerated property name,
  //   and 'value' is interpreted as one of the available property value names.
  // - Aliases in PropertyAlias.txt and PropertyValueAlias.txt can be used.
  // - Loose matching is not applied.

For example
/\p{East_Asian_Width=H}/u.test("\u20a9") // true

\P is the inverse of \p, so binary properties with "False" as property value can be expressed via \P.
For example
/\p{ASCII_Hex_Digit}/u.test("A") // true
/\P{ASCII_Hex_Digit}/u.test("A") // false

mathiasbynens · 2016-05-20T08:10:31Z

For the record, the V8 flag @littledan mentioned is --harmony_regexp_property. Tests that show how the current implementation works: https://chromium.googlesource.com/v8/v8/+/master/test/mjsunit/harmony/regexp-property-exact-match.js

Is full compatibility with existing \p implementations a hard requirement? If I were implementing \p{…} in ES I explicitly wouldn’t support Is/In prefixes, shorthands, loose matching, property aliases, property value aliases, or whitespace around = / :. E.g. throw on /\\p{In_Cyrillic_Sup}/u, /\\p{Block=Cyrillic_Sup}/u and /\\p{Block=Cyrillic_Supplementary}/u and only accept /\\p{Block=Cyrillic_Supplement}/u which is the canonical block name. We have the opportunity to be strict here and encourage readable code; let’s do it.

Some related info: mathiasbynens/es-regexp-unicode-character-class-escapes#2

@goyakin Can we track your spec work somewhere (GitHub)?

hashseed · 2016-05-20T08:14:19Z

We considered doing loose matching and having a "In"-prefix for blocks. But having thought about it, we decided against either. Looking at Perl, it seems to be a good idea to be strict rather than overly ambiguous.
Your example would be /\p{Block=Cyrillic_Supplement}/u or /\p{blk=Cyrillic_Sup}/u. Reason to have the property name be explicit is because there is ambiguity between Script and Block property value names. And honestly stating it explicitly really should not hurt anyone.

mathiasbynens · 2016-05-20T12:20:24Z

@hashseed Agreed; that is what the discussion I referenced concluded as well. I’ve now updated my example (originally intended to explain how aliases should throw only) to avoid confusion.

Note that in your example you’re still doing a form of loose matching, i.e. ignoring _. (The canonical block name is Cyrillic Supplement and not Cyrillic_Supplement.)

hashseed · 2016-05-21T19:04:51Z

I thought the underscore is actually part of the name. That's what PropertyAlias.txt and PropertyNameAlias.txt as well as ICU suggest.

mathiasbynens · 2016-05-21T23:55:12Z

As far as I can see, only PropertyValueAliases.txt suggests it. Blocks.txt has the block name with spaces instead of underscores. I’ve asked for clarification here: http://www.unicode.org/mail-arch/unicode-ml/y2016-m05/thread.html#79

mathiasbynens · 2016-06-06T11:42:58Z

I hope to present this as a stage 0 strawman at a future TC39 meeting.

After implementing support for \p{…} and \P{…} in my regular expression transpiler https://github.com/mathiasbynens/regexpu-core (online demo), I’ve started to work on a concrete spec proposal. Here’s an early draft: mathiasbynens/ecma262#1 Feedback welcome.

hashseed · 2016-06-07T05:39:11Z

Thanks for following up on this, Mathias!

Having followed the unicode mail thread, I think I can get behind the idea of considering whitespace, hyphens and underscores as equivalent, when looking up property names and property value names including their aliases.

E.g. \p{Lowercase Letter} would be allowed just as well as \p{Lowercase-Letter} and p{Ll}, but not \p{Lower case Letter}.

This would solve the conflict between Blocks.txt and PropertyValuaAliases.txt.

mathiasbynens · 2016-06-07T06:21:26Z

@hashseed There is another issue though: e.g. Blocks.txt has Superscripts and Subscripts, whereas PropertyValueAliases.txt has Superscripts_And_Subscripts, which is the canonical property value. Note the difference in casing of the letter a. To support \p{Block=Superscripts and Subscripts} in addition to \p{Superscripts_And_Subscripts} we need case-insensitivity as well.

Would you be open to that, or would you rather stick to strict matching in that case?

srl295 · 2016-06-07T19:19:29Z

@mathiasbynens — thanks for your work on this. What's puzzling to me is why Blocks.txt is even being looked at here. It's for display names, not programmatic use. PropertyValueAliases.txt is the right place to find property value aliases — just as the response on the mailing list said.

I'm a definite -1 on leniency to match Blocks.txt — that's not what it's for. We should just match PropertyValueAliases.txt

mathiasbynens · 2016-06-07T19:37:13Z

@srl295 It may not be what it’s for, but it would be a direct consequence of following http://unicode.org/reports/tr18/#RL1.2 which specifies that “matching of […] values must follow the Matching Rules from UAX44”, specifically http://unicode.org/reports/tr44/#Matching_Symbolic. (As stated, I’d be fine with not following that, and implementing strict matching instead — just explaining the reasoning here.)

What's puzzling to me is why Blocks.txt is even being looked at here. It's for display names, not programmatic use.

Yeah, that’s what I didn’t know when I started the thread. I’d be willing to bet that there are other developers wishing to use \p{…} in regexps that don’t know about this. Blocks.txt doesn’t seem like an illogical place to go looking for the proper block names, IMHO. Those devs would be surprised to find that \p{Block=Superscripts and Subscripts} doesn’t work. It doesn’t help that Blocks.txt also includes this:

# Note:   When comparing block names, casing, whitespace, hyphens,
#         and underbars are ignored.
#         For example, "Latin Extended-A" and "latin extended a" are equivalent.
#         For more information on the comparison of property values, 
#            see UAX #44: http://www.unicode.org/reports/tr44/

This is a problem that can be solved through proper developer documentation, of course. But taking all of it into consideration, I’m leaning towards supporting @hashseed’s suggestion + case-insensitivity.

srl295 · 2016-06-07T20:00:38Z

@mathiasbynens UAX44-LM2 is of course a great reason to, quote, Ignore case, whitespace, underscore (‘’), and all medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E_, unquote. So I'm +1 on that.

\p{Block=Superscripts and Subscripts} doesn't work

But, it should work (and does in ICU )— because of UAX44-LM2. Are there any names in Blocks.txt that wouldn't match PropertyValueAliases.txt given the leniency?

It does seem that both the Blocks.txt comment and UAX44 could be improved for some more clarity — discussing PropertyValueAliases.txt

mathiasbynens · 2016-06-07T20:07:51Z

@srl295 Have you seen mathiasbynens/ecma262#1 (comment)? It was the context for the above discussion.

But, it should work — because of UAX44-LM2.

Sure — if we decide to follow that. My initial spec draft included a variant of loose matching per UAX44-LM2 (minus non-ASCII hyphens and the is prefix) but we later decided to use strict matching instead.

Are there any names in Blocks.txt that wouldn't match PropertyValueAliases.txt given the leniency?

No. But note that this is also true for @hashseed’s suggestion combined with case-insensitivity (which is what I was proposing here), which would be a more strict solution than UAX44-LM2. I’d strongly prefer that over UAX44-LM2, at least for the initial spec text + implementations. We can always loosen up the matching algorithm later, but if we do it right from the start, there’s no going back.

hashseed · 2016-06-08T05:17:54Z

If matching Blocks.txt is not really that important, I'm actually hesitant to follow UAX44-LM2 at all, including whitespace and underscore. If we simply follow UAX44-LM2, we end up with loose matching, which I thought we agreed on being a bad idea. The reason I mentioned for this is that we do not want to end up with regexps that read /\p{___lower C-A-S-E___}/ui. I don't see why we should carve out a subset of UAX44-LM2 instead of ignoring it altogether.

If we consider following UAX44-LM2 a bad idea, and there is no reason to care about matching Blocks.txt, then I'm in favor being super strict and only match what's listed in PropertyValueAliases.txt. We can explicitly state that in the spec text, and add a note about Blocks.txt. I think standardizing on underscore as separator is nicer than having this exception for Blocks.txt. Scripts.txt for example use names with underscore. You could argue that either way could surprise users.

With that in place, we can still gather feedback from developers. If not having loose matching is an actual developer pain point, we can still address that in a future PR.

mathiasbynens · 2016-06-08T12:33:40Z

Updated https://github.com/mathiasbynens/ecma262/pull/1/files to explicitly mention PropertyAliases.txt & PropertyValueAliases.txt.

mathiasbynens · 2016-06-10T09:58:50Z

There is now a standalone repo for this proposal: https://github.com/mathiasbynens/es-regex-unicode-property-escapes Let’s move the discussion over there.

srl295 · 2016-06-10T15:17:33Z

@mathiasbynens OK. I think there's a lot more needed than just support within regex (as important as that is), especially getting the general property and other properties given a codepoint.

in ICU there's uchar_getIntProperty so

 … =  uchar_getIntProperty( 'A', UCHAR_GENERAL_CATEGORY); // == U_UPPERCASE_LETTER (Lu)
 … =  uchar_getIntProperty( 'A', UCHAR_SCRIPT); // USCRIPT_LATIN (Latn)

etc. Not proposing this specific API, just trying to get the concept rolling.

littledan · 2016-06-10T17:21:55Z

@srl295 Are there use cases that you have in mind where it is important to use the property value, rather than test whether the character has a particular property value? That would help motivate adding such an API.

srl295 · 2016-06-10T17:54:36Z

@littledan sure, anything that's not just a single boolean:

Getting the decimal value of a character: to be able to implement parsing/analysis
Getting the script of a character, combining class, bidi properties to be able to do advanced layout
getting the general category of a character to determine how it should be processed ( equivalent of isprint etc )

sure, you could do

if ( /\p{Gc=Lo}/.test('A') ) { 
  …
} else if ( /\p{Gc=Lm}/.test('A') ) { 
  …
} else if ( /\p{Gc=Mc}/.test('A') ) { 
  …
}

… but why?

Actually I would prefer making the property available over extending regex. Because if you have the properties, you can implement regex in JS. But without the properties enumerated, it's a lot harder to do the reverse.

hashseed · 2016-06-13T05:50:43Z

@srl295 I don't think exposing a way to test for property value for a particular character should affect this proposal.

jungshik · 2017-06-20T01:09:41Z

@srl295 Are there use cases that you have in mind where it is important to use the property value, rather than test whether the character has a particular property value? That would help motivate adding such an API.

Let's suppose that there's such a use case. Even then, wouldn't it better to make that API a part of Ecma 262 instead of Ecma 402 (given what has been added to Regex) ?

littledan · 2017-06-20T08:53:04Z

It's a somewhat esoteric question which place this lands in; the 262/402 split doesn't correspond to the split in some implementations. For example, V8 does not support normalization or Unicode RegExp properties when "i18n" is compiled out. I suspect it's not the only one.

A rough argument for putting it in 402 is, this is where the library functions for things that aren't methods on existing objects go. And it seems reasonable to make this a property of the Intl object.

jungshik · 2017-06-20T21:00:17Z

For example, V8 does not support normalization or Unicode RegExp properties when "i18n" is compiled out. I suspect it's not the only one.

Well, the current 'V8_INTL_SUPPORT' needs to be split into two eventually or its 'boundary' has to be changed, IMHO once https://bugs.chromium.org/p/v8/issues/detail?id=5500#c9 (replace unibrow with ICU) is resolved. One should be about Ecma402 support (Intl.* API support) and the other should be about whether ICU is used or not (ICU vs unibrow). Depending on the above V8 bug is resolved, the latter would not be necessary at all (i.e. ICU is always used) in which case Unicode RegExp properties (a part of Ecma262) would always be supported regardless of Intl.* API (Ecma 402) support.

littledan · 2017-08-10T16:53:37Z

For anyone who's looking to contribute to ECMA 402, this is a "shovel ready" project, just in need of a writeup for a concrete API, and presentation to the committee.

srl295 · 2019-10-15T15:07:37Z

in practice, does [..."e𞤫𞤫"].length get optimized somehow, to where it doesn't need actually iterate and such? just curious.

srl295 · 2019-10-15T15:08:24Z

https://github.com/srl295/es-unicode-properties

srl295 · 2019-10-15T15:08:35Z

https://github.com/tc39/template-for-proposals (that’s for 262, not sure if 402 has something different)

this may end up being 262…

mathiasbynens · 2019-10-15T15:28:02Z

in practice, does [..."e𞤫𞤫"].length get optimized somehow, to where it doesn't need actually iterate and such? just curious.

https://v8.dev/blog/spread-elements

leobalter · 2019-10-15T15:43:29Z

I believe this is much more convenient in 262 even w/ the fact it involves a good amount of work from the delegates working w/ ECMA-402.

As long as we have someone to champion it in a TC39 meeting, this should be well clarified.

sffc · 2019-10-15T17:19:13Z

~~Please move further discussion to the proposal repo.~~

https://github.com/srl295/es-unicode-properties/issues

EDIT (May 2021): This proposal is currently stalled, pending more concrete use cases.

my2iu · 2021-05-09T15:48:19Z

Can you make the Unicode properties needed for internationalization and low-level text rendering available? It’s becoming increasingly common to do low-level text rendering in JavaScript because certain APIs like WebGL require it or because people are making more complex web apps like word processors, paint programs, or graphic design tools. Implementing internationalization support for this low-level rendering like the bidi algorithm, vertical orientation, and text shaping requires a lot of these Unicode properties, so it would be great if there were an API making them available. Right now, libraries like Harfbuzzjs simply include a compressed version of the Unicode database in their code, and it’s not too big, but since web browsers already know this information, it would be great if web browsers made it available to JavaScript. Preferably these properties would be fast to access too.

ryzokuken · 2021-05-10T04:45:44Z

@my2iu thanks for your comment. Looks like the proposal has not seen a lot of activity lately, but hopefully that would change soon...

srl295 · 2021-05-10T15:29:55Z

@my2iu @ryzokuken that's why I proposed this, but there was a lot of pushback that there weren't real use cases for anything that couldn't be covered by regex. See https://github.com/srl295/es-unicode-properties

my2iu · 2021-05-10T17:03:52Z

Your proposed specification uses a lot of strings in its API, so I’m concerned that it would be slow. Regexes might be fine if they were extended to handle all the weird Unicode properties needed for low level text rendering and if the regexes could be optimized to be fast.

srl295 · 2021-05-10T17:27:22Z

Your proposed specification uses a lot of strings in its API, so I’m concerned that it would be slow. Regexes might be fine if they were extended to handle all the weird Unicode properties needed for low level text rendering and if the regexes could be optimized to be fast.

Not sure why that would be slow, it's mostly the same strings.

Can you make a concrete list of regex properties that are currently missing? I.e. specific properties from the Unicode spec you need?

Also see https://github.com/srl295/es-unicode-properties#why-not-just-use-regex

my2iu · 2021-05-10T18:08:04Z

Sure. Now that I think about it, something that operates on code points might be the fastest, plus with an API that either has a lot of methods or with a “matcher” object that can be optimized like with regexes. Actually, does JS even have proper support for code points and surrogates yet? The last time I checked, people were still arguing whether JS strings were UCS2, UTF16, or UTF32.

srl295 · 2021-05-10T18:20:04Z

operates on code points might be the fastest

It could be an overload. The getter could take either a string or an integer codepoint. This is discussed in srl295/es-unicode-properties#5

does JS even have proper support for code points and surrogates yet?

yes.

sffc · 2021-06-11T18:58:32Z

Strawman from @reed-at-google about what would be necessary for Skia's needs:

https://github.com/google/skia/blob/main/site/docs/dev/design/uni_characterize.md

my2iu · 2021-06-11T22:58:42Z

That strawman API seems a little wonky. I’m not a huge WASM expert, but I don’t think strings are directly transferable from WASM to JS. You call from WASM to JS, and then from JS, you can reach into the WASM memory space to copy the raw bytes into JS and convert things into JS strings. Since WASM code is normally C++, things would normally be UTF-8, but UTF-16 might be possible as well, though WASM may prefer UTF-8. As such, I’m not sure whether minimizing the number of JS to WASM transitions needs to be necessarily reflected in the API design, and having an API that operates on JS strings (as opposed to typed arrays) isn’t necessarily the fastest thing for WASM either. It does bring up some good points about how batching might improve performance, depending on the overhead of JS to browser calls on VMs.

sffc · 2021-06-12T01:51:59Z

Request for the "decimal" property: #579

littledan mentioned this issue Nov 16, 2016

Which properties should we include in Unicode escapes? tc39/proposal-regexp-unicode-property-escapes#18

Closed

caridy added the enhancement label Aug 10, 2017

caridy added the help wanted label Aug 10, 2017

brettz9 mentioned this issue Dec 14, 2017

Detecting directionality from script type (or from a language associated with a script) #205

Open

sffc mentioned this issue Oct 15, 2019

Semantics of getUnicodeProperty and multi-character strings srl295/es-unicode-properties#1

Open

sffc added s: in progress Status: the issue has an active proposal and removed s: help wanted Status: help wanted; needs proposal champion labels Oct 15, 2019

tc39 locked and limited conversation to collaborators Oct 15, 2019

sffc added s: comment Status: more info is needed to move forward Proposal Larger change requiring a proposal and removed s: in progress Status: the issue has an active proposal labels Jun 5, 2020

sffc mentioned this issue May 8, 2021

Unicode Database and Related APIs tc39/proposal-intl-segmenter#140

Open

tc39 unlocked this conversation May 8, 2021

sffc mentioned this issue May 8, 2021

Is this proposal still active? srl295/es-unicode-properties#6

Open

sffc mentioned this issue Jun 12, 2021

Add a number parser for parsing numbers using non-latin numerals #579

Closed

Unicode Properties #90

Unicode Properties #90

Comments

srl295 commented May 19, 2016 • edited Loading

srl295 commented May 19, 2016

littledan commented May 20, 2016

hashseed commented May 20, 2016 • edited Loading

mathiasbynens commented May 20, 2016 • edited Loading

hashseed commented May 20, 2016 • edited Loading

mathiasbynens commented May 20, 2016 • edited Loading

hashseed commented May 21, 2016

mathiasbynens commented May 21, 2016 • edited Loading

mathiasbynens commented Jun 6, 2016

hashseed commented Jun 7, 2016

mathiasbynens commented Jun 7, 2016 • edited Loading

srl295 commented Jun 7, 2016

mathiasbynens commented Jun 7, 2016 • edited Loading

srl295 commented Jun 7, 2016 • edited Loading

mathiasbynens commented Jun 7, 2016 • edited Loading

hashseed commented Jun 8, 2016 • edited Loading

mathiasbynens commented Jun 8, 2016

mathiasbynens commented Jun 10, 2016

srl295 commented Jun 10, 2016

littledan commented Jun 10, 2016

srl295 commented Jun 10, 2016

hashseed commented Jun 13, 2016

jungshik commented Jun 20, 2017

littledan commented Jun 20, 2017

jungshik commented Jun 20, 2017

littledan commented Aug 10, 2017

srl295 commented Oct 15, 2019

srl295 commented Oct 15, 2019

srl295 commented Oct 15, 2019

mathiasbynens commented Oct 15, 2019

leobalter commented Oct 15, 2019

sffc commented Oct 15, 2019 • edited Loading

my2iu commented May 9, 2021

ryzokuken commented May 10, 2021

srl295 commented May 10, 2021

my2iu commented May 10, 2021

srl295 commented May 10, 2021 • edited Loading

my2iu commented May 10, 2021

srl295 commented May 10, 2021 • edited Loading

sffc commented Jun 11, 2021

my2iu commented Jun 11, 2021

sffc commented Jun 12, 2021

srl295 commented May 19, 2016 •

edited

Loading

hashseed commented May 20, 2016 •

edited

Loading

mathiasbynens commented May 20, 2016 •

edited

Loading

hashseed commented May 20, 2016 •

edited

Loading

mathiasbynens commented May 20, 2016 •

edited

Loading

mathiasbynens commented May 21, 2016 •

edited

Loading

mathiasbynens commented Jun 7, 2016 •

edited

Loading

mathiasbynens commented Jun 7, 2016 •

edited

Loading

srl295 commented Jun 7, 2016 •

edited

Loading

mathiasbynens commented Jun 7, 2016 •

edited

Loading

hashseed commented Jun 8, 2016 •

edited

Loading

sffc commented Oct 15, 2019 •

edited

Loading

srl295 commented May 10, 2021 •

edited

Loading

srl295 commented May 10, 2021 •

edited

Loading