Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode property escape support to RegExp #1295

Closed

Conversation

jonathanj
Copy link

@jonathanj jonathanj commented Feb 3, 2024

Summary

Extends RegExp to handle \p and \P Unicode property escapes, standalone and within character classes, for Unicode mode. The implementation extends the existing idea of tables of codepoint ranges to cover all of the properties explicitly required by ES262.

The genUnicodeTable.py script was updated to generate code related to the binary and non-binary Unicode properties explicitly mentioned in ES262 (ref); based on the Unicode 15.1.0 data files.

Closes #1027

Test Plan

  • Added regexp_unicode_properties.js to the Hermes test suite
    • Binary properties and the General_Category properties
    • Non-binary properties (Script=Latin, Script_Extensions=Thai)
    • Inverted character class escapes \p{…} and \P{…}
    • All of the above atom escapes in and out of character classes
    • Exercise the parser for the above forms, and incomplete forms
  • test262 suite via hermes/utils/testsuite/run_testsuite.py
    • I had to use --test-skiplist otherwise around 417 tests (in test262/test/built-ins/RegExp/property-escapes) are skipped with the reason "Skipping test with 'const'". This was confusing to me since most of these do pass, but I couldn't find an explainer for this skip.
    • There are some failures here, at least one of which is caused by test262 not being updated for Unicode 15.1.0, which I see as test suite issue rather than an implementation issue.

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Feb 3, 2024
@jonathanj
Copy link
Author

jonathanj commented Feb 3, 2024

(I'm including this outside of the PR description, in case that's used for automation purposes.)

Looking for Feedback

  • I have modified genUnicodeTable.py to fetch from HTTP URLs instead of FTP, because it is significantly faster and that starts to matter now that there are about a dozen fetches, just checking if there's a reason these were specifically FTP before?
  • I'm sure there is low-hanging performance fruit (I'm not a C++ dev), I mostly extended what already existed to achieve a working result, but am very keen to hear any ideas.
  • UnicodeData.inc is now several thousand lines longer than it was previously, because of all the category tables, but still far smaller than including ICU. I'm wondering if this comes as a shock, or mostly in line with what was expected to support Unicode properties?
  • The BuildingAndRunning.md docs suggest to use hermes/utils/format.sh to format all code, but that fails for me expecting clang-format version 12, which isn't in macOS Homebrew any more. I configured VS Code to use clang-format 17 (from Homebrew), but I'm not sure if the version difference will pose an issue?
  • The contribution guidelines suggest "squashing your commits", which I took to mean "squash your PR to a single commit", but I realise now that maybe it means "squash commits for related concepts". I'm happy to rebase and break this into a few commits for each piece of work, if it'll make things easier.

@tmikov
Copy link
Contributor

tmikov commented Feb 5, 2024

Hi! Thank you for this contribution, it must have been a lot of work.

We would like to land this in Hermes, but since this is a lot of code, I hope you can bear through our review process, which can be very detail oriented.

At first glance I see some things that likely need to be tweaked, like the static constructors of std::unordered_map and std::string. Another concern is that at first glance it looks like this change adds 300KB of binary size, which is quite significant - a "lean" version of Hermes compiled for Arm32 is about 800KB, so this would be close to a 40% increase. (BTW, I suspect part of the increase is due to the static constructors, so it will likely go down.)

In any case, this is required ECMAScript functionality, so we are very willing to work with you to land this, and I hope you are willing to work with us.

The contribution guidelines suggest "squashing your commits", which I took to mean "squash your PR to a single commit", but I realise now that maybe it means "squash commits for related concepts". I'm happy to rebase and break this into a few commits for each piece of work, if it'll make things easier.

As a first step, I think it would be definitely very helpful if you break this into commits, to make it easier to review and iterate on each.

The contribution guideline to squash everything is generally well intentioned, because it assumes most contributions will be smaller and because Github does not provide the best UI for reviewing a stack of commits. But a sizable contribution like this really needs independent review of its parts.

@jonathanj
Copy link
Author

We would like to land this in Hermes, but since this is a lot of code, I hope you can bear through our review process, which can be very detail oriented.

In any case, this is required ECMAScript functionality, so we are very willing to work with you to land this, and I hope you are willing to work with us.

I'm more than happy to go through your review process, and look forward to it!

At first glance I see some things that likely need to be tweaked, like the static constructors of std::unordered_map and std::string. Another concern is that at first glance it looks like this change adds 300KB of binary size

This is exactly the kind of feedback I was hoping for, thank you. I knew writing this that this wouldn't necessarily be the best approach, but it was fairly easy to understand and iterate on. I'd be glad to see this evolve into a better solution.

As a first step, I think it would be definitely very helpful if you break this into commits, to make it easier to review and iterate on each.

Of course, I'll get to that as soon as I can.

@jonathanj jonathanj force-pushed the 1027-unicode-property-escapes branch 2 times, most recently from 040070d to 187aa18 Compare February 6, 2024 18:27
@jonathanj
Copy link
Author

@tmikov I've restructured the original giant commit into several more digestible ones, I hope it helps make the review easier.

@neildhar
Copy link
Contributor

Thanks for splitting this up, I think the first thing to do here is to try and reduce the size of the additional data so that we get a more realistic picture of the final size. The idea is to eliminate the need to emit static constructors to populate the data structures you're using, particularly unordered_map, vector, and string. Ideally, all of the data should be represented efficiently as constexpr data structures, optionally with some small runtime code to process them if necessary.

To pack all of the data into constexpr data structures, we can make the following transformations:

  1. Replace any std::vector with a separate static constexpr array, since we know their size and contents statically. Then, wherever they are needed, we can use an llvh::ArrayRef to refer to them. (we may need to pull in that llvh header if it isn't already available)
  2. Replace any std::string with a std::string_view. Since these strings are all pointing to constant data, a constexpr string_view referencing the constant string will be much more efficient than dynamically constructing the string.
  3. Convert any unordered_map into a sorted array of key,value pairs. This allows the map to be encoded efficiently, and we can perform binary search to find an element.

I've demonstrated these transformations on a small snippet in https://godbolt.org/z/Mvr7qKxnT, showing how the current data structures can be changed into this format. Notice that the generated output is dramatically smaller, while encoding the same information.

As you make these changes, you can measure the effect they have on the final size. You can configure and build Hermes as follows:

cmake -S hermes -B build_opt -G Ninja -DCMAKE_BUILD_TYPE=MinSizeRel -DHERMES_ENABLE_DEBUGGER=OFF -DHERMES_IS_MOBILE_BUILD=ON
cd build_opt
ninja libhermes

To give you a rough sense of the numbers, the generated binary (on my M1 Mac) goes from 2.4MB before this PR to 2.6MB after it. Hopefully with these changes, the size gap will shrink considerably.

@jonathanj jonathanj force-pushed the 1027-unicode-property-escapes branch 3 times, most recently from a7830ed to 62705ac Compare February 15, 2024 18:35
@jonathanj
Copy link
Author

jonathanj commented Feb 15, 2024

Thanks for the detailed feedback @neildhar, the godbolt output for those two snippets was really interesting! I've made the changes to genUnicodeTable.py and the utility functions to use the proposed data structures.

I ran optimized build you suggested for main, this branch after the changes, and this branch before the changes.

-rwxr-xr-x@ 1 jonathan  staff  1728920 Feb 15 18:30 hermesc    # main
-rwxr-xr-x@ 1 jonathan  staff  1957320 Feb 15 19:48 hermesc    # this branch (after changes)
-rwxr-xr-x@ 1 jonathan  staff  1978584 Feb 15 17:38 hermesc    # this branch (before changes)

The difference between main and this branch after the changes is 249664 bytes, which is about the difference you mentioned in your own builds. The difference between this branch before and after the changes is 21264 bytes, which is smaller than I expected (although I guess ~20KB of assembly just to initialize static data is a lot). Is this about the size difference you expected?

I ran some rough calculations for the size of the generated Unicode data:

# Rough number of 32-bit ints
$ cat hermes/lib/Platform/Unicode/UnicodeData.inc | grep -o '0x' | wc -l
   36742
# Bytes of data
$ echo '36742 * 8' | bc 
293936

Seems like the ~200KB increase in binary size roughly matches up with the amount of raw data that has been added.

As far as I can tell the code now matches your suggested structures, but I'd love to find out that I overlooked something obvious. What do you think the next steps are, trying to improve how the data is stored?

@neildhar
Copy link
Contributor

Thanks for making these changes so quickly, and for being so receptive iterating on this! I agree that the improvement is somewhat smaller than I would have expected. The next step is probably to try and cut down dynamic relocations, it looks like they add tens of kilobytes to the size, and will also slow down startup.

This will be a slightly more involved transformation, since we're trying to have as few pointers as possible in the constant data. Right now, every string and array pointer in the data needs a dynamic relocation to set the actual pointer at runtime, based on where the library is loaded into memory.

The basic idea is to turn the array and string pointers into offsets instead of pointers. We need to merge all of the arrays of a given type into a single giant pool (same for strings). Then in each place where we currently have a pointer (like each string_view and ArrayRef), we would instead keep an offset into that large array, and a length. Depending on how large the arrays and elements are, we can use either 16 bit or 32 bit values.

This adds some marginal cost when accessing the data, but it represents the data much more efficiently (since the offsets will be smaller than pointers), and avoids the startup and size overhead of the dynamic relocations.

I repurposed the same example to demonstrate how this would look: https://godbolt.org/z/novqf4PdK

Note that the size win here will be less obvious in godbolt, but we should see an improvement in the final binary. I expect this to be on the same order as the code size reduction you observed from the previous change.

In the case of strings, it will also be worth doing some basic deduplication, since many of the strings are repeated.

This should get us fairly close to the bare minimum needed to encode this data, which will guide how to include this in the build, and then we can proceed with reviewing the actual logic.

@jonathanj
Copy link
Author

jonathanj commented Feb 24, 2024

Thank you again for the excellent feedback! Near the beginning of this change, having a single large pool did occur to me but I didn't know about dynamic reallocations, so I decided to avoid the complexity and would probably not have had the 3-stage lookup anyway.

Sorry about the longer turnaround on this iteration, between a busy day job and wrapping my head around how to restructure the Python logic, it just took me longer to get to. I ran the optimized build again and this is the new size:

-rwxr-xr-x@ 1 jonathan  staff  1880152 Feb 24 01:26 hermesc

Looks like a 75KB improvement, I ended up going with 16-bit values since none of the offset+size values were larger than 65535:

string_offset_bits: 12
string_size_bits: 5
range_pool_offset_bits: 15
range_pool_size_bits: 10
range_array_pool_offset_bits: 9
range_array_pool_size_bits: 3

Interesting (but maybe not useful) things:

  • clang-format is no longer reformatting the ranges into columns of 3, so the file is now ±20K LOC. If I avoid inserting comments into the range pools, it goes back to creating columns of 3 again and the file is down to ±6K LOC.
  • The original UNICODE_LETTERS, etc. tables (used in the lexer and other places) do duplicate around 500 ranges, they seem to contribute ~200 bytes to the binary size

@jonathanj
Copy link
Author

@neildhar Are there any other changes you'd like me to make, prior to moving on to reviewing the logic?

@neildhar
Copy link
Contributor

neildhar commented Mar 8, 2024

@jonathanj Sorry for the delay, nope, I'll take a look at the actual logic next.

Given the size regression, we'll also want to retain the option to disable this. There are two steps to doing this (which you can do in parallel while I review the rest):

  1. Add a new CMake option HERMES_ENABLE_REGEXP_UNICODE_CHARACTER_CLASS that we can use to turn the feature off.
  2. Dump whether this is available under Features: when you run hermes -version. Check for this in the test262 runner and conditionally exclude this feature when it is not available.

@jonathanj
Copy link
Author

Thanks @neildhar!

  1. Add a new CMake option HERMES_ENABLE_REGEXP_UNICODE_CHARACTER_CLASS that we can use to turn the feature off.

I ended up naming the option HERMES_ENABLE_UNICODE_REGEXP_PROPERTY_ESCAPES, I thought that aligned better with ES262's language and covers both atom and class escapes. Happy to change it if you feel I made a mistake. I think I defined the option with a default of ON, I'm not sure if that was the right default.

Do I need to specify this option in the gradle, podspec, Circle CI config, and other build files, like is done for HERMES_ENABLE_INTL or HERMES_ENABLE_DEBUGGER?

  1. Dump whether this is available under Features: when you run hermes -version. Check for this in the test262 runner and conditionally exclude this feature when it is not available.

Done. There are still quite a few regexp-unicode-property-escapes test262 tests skipped because of the const issue mentioned in the PR description.

@tmikov
Copy link
Contributor

tmikov commented Mar 14, 2024

@jonathanj I think we will keep this option always enabled for OSS releases (we are much more sensitive to binary size internally). If we set it on to default, there is no need to worry about gradle, podspec, etc. About CircleCI: I think we should also keep it on (though I am curious what @neildhar thinks).

@neildhar
Copy link
Contributor

I agree, it seems simplest to turn this on by default, and not override it in any of our build configs. We can specify different defaults when building internally.

Copy link
Contributor

@neildhar neildhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies for the delay in getting to this.

I've reviewed the C++ changes, which mostly look good. It's awesome that you were able to familiarise yourself with our RegExp implementation and the nuances of this feature!

I have one correctness concern, and some organisational suggestions that should hopefully simplify things.

doc/RegExp.md Show resolved Hide resolved
lib/Regex/RegexParser.cpp Show resolved Hide resolved
lib/Regex/RegexParser.cpp Outdated Show resolved Hide resolved
include/hermes/Platform/Unicode/CharacterProperties.h Outdated Show resolved Hide resolved
lib/Regex/RegexParser.cpp Show resolved Hide resolved
lib/Platform/Unicode/CharacterProperties.cpp Outdated Show resolved Hide resolved
}

// Canonicalize the property name.
auto canonicalNameEntry = findNameMapEntry(nameMapStart, nameMapEnd, key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding it a little hard to follow the naming here between name, value, and key. Are there more precise terms we can use? Otherwise, a comment describing what each thing refers to would be helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I blame you, I think it's confusing too and would really appreciate finding some better names. Maybe I can explain the terms, and you can decide if there are better names or if perhaps a comment is more useful.

The RegExp Unicode property syntax accepts:

  • Binary properties, where the presence of the property implies a value of true, e.g. ASCII_Hex , Emoji, etc.
    • Additionally, all of General_Category is also allowed to be specified as a binary property, e.g. Lowercase_Letter, N, Symbol, etc.
  • Properties with a value, where the value is not a binary choice, e.g. Scripts=Latin, Script_Extensions=Katakana, General_Category=Symbol, etc.

The "property name" and "property value" terminology comes from ES262.

The "key" terminology is my own, and is imprecise because it is used to resolve a name to its canonical form, by looking it up in canonicalPropertyNameMap_<CATEGORY>, where CATEGORY is determined as follows:

  • If there is no propertyValue (in the PropertyName=PropertyValue sense) given, then CATEGORY assumed to be BinaryProperty
    • If that assumption is incorrect, and it's not in the binary property category, then try GeneralCategory
  • If there is a propertyValue, then CATEGORY is determined by propertyName (e.g. canonicalPropertyNameMap_Script)

In both cases key needs to be resolved according to the canonical name of the value, but in the binary property case the "value" can be thought of something like General_Category=VALUE. I chose key in the sense of value = dict[key] but maybe a term like needle better conveys the "we need to find this" idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Perhaps the best thing to do is to compute canonicalNameEntry in each of the cases, so the logic is easier to follow. It would also avoid duplicating the lookup for binary properties. So for example:

  NameMapEntry *canonicalNameEntry;
  llvh::ArrayRef<RangeMapEntry> rangeMap;

  if (propertyValue.empty()) {
    // There was no property value, this is either a binary property or a value
    // from General_Category, as per `LoneUnicodePropertyNameOrValue`.
    if(canonicalNameEntry = findNameMapEntry(canonicalPropertyNameMap_BinaryProperty, propertyName))
      rangeMap = unicodePropertyRangeMap_BinaryProperty;
    else if(canonicalNameEntry = findNameMapEntry(canonicalPropertyNameMap_GeneralCategory, propertyName))
      rangeMap = unicodePropertyRangeMap_GeneralCategory;
  }
...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great suggestion, thanks! It definitely helps localise the initial setup, before moving onto the range lookups.

I did have to do findMapEntry(llvh::ArrayRef(…), …); to appease compiler, let me know if you have any thoughts on that.

lib/Platform/Unicode/CharacterProperties.cpp Outdated Show resolved Hide resolved
lib/Platform/Unicode/CharacterProperties.cpp Outdated Show resolved Hide resolved
lib/Platform/Unicode/CharacterProperties.cpp Outdated Show resolved Hide resolved
@neildhar
Copy link
Contributor

I'm working through the generation script now, and have some suggestions that would make it much easier to review:

  1. Could we update the unicode version in a separate commit from all of the changes to the script? This will be a useful sanity check that the changes to the script do not change the data emitted for existing fields like UNICODE_LETTERS.
  2. It would be very helpful to have high level comments in the script about the transformations that some of these functions do. Specifically, the format of the incoming data, and the structure that it produces.
  3. The new output should use the same 3-column style for the data as the old output does.

Jonathan Jacobs added 9 commits March 30, 2024 23:18
Download a significant portion of the Unicode database in order to
generate codepoint tables, and property name/value canonicalization,
for the Unicode 15.1.0 binary and non-binary property names/aliases
supported by ES262.

Certain properties that are not explicitly defined in the Unicode
database are derived here, such as "Assigned", and "Any".
Resolves the property name/value to a Unicode range table, then
directly adds all entries from that table to a CodePointSet. It also
handles the inverted case (i.e. `\P`) where all ranges _not_ in a
table are added to the CodePointSet.
This uses the newly named Unicode range tables for the same
functionality, tables such as `UNICODE_LETTER` are left as-is, since
they are more efficient as they are for their purpose.
In Unicode mode, RegExp is now able to parse `\p{...}` and `\P{...}`
atom escapes, both in and out of character classes.

Outside of Unicode mode, `\p` and `\P` are unnecessary escapings of
literal `p` and `P` respectively.

A new error type `InvalidPropertyName` was added, which is returned
for all property-name related parsing/resolving issues. This seems to
be consistent with other JS engines.
Jonathan Jacobs added 4 commits April 3, 2024 17:59
This means a script extension range can be specified as a single
`ArrayRef` that includes the range for the original script.
Now, the class atom contains the ArrayRef for the codepoint ranges,
and the range is validated, and that range is then used separately to
add the codepoints to the bracket.

This simplifies a lot of the validation logic, and requires fewer
conditional returns as a result.
@jonathanj jonathanj force-pushed the 1027-unicode-property-escapes branch from d2e98a1 to 339f5c9 Compare April 3, 2024 20:25
/// Find a matching entry (such as \p NameMapEntry or \p RangeMapEntry) by
/// matching a string \p name against the entry's \p name field.
template <class T>
const T *findNameMapEntry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to match the description, and mark it as static.

Suggested change
const T *findNameMapEntry(
static const T *findMapEntry(

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've done that with the commit that fixes up the logic flow in unicodePropertyRanges.

}

// Canonicalize the property name.
auto canonicalNameEntry = findNameMapEntry(nameMapStart, nameMapEnd, key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Perhaps the best thing to do is to compute canonicalNameEntry in each of the cases, so the logic is easier to follow. It would also avoid duplicating the lookup for binary properties. So for example:

  NameMapEntry *canonicalNameEntry;
  llvh::ArrayRef<RangeMapEntry> rangeMap;

  if (propertyValue.empty()) {
    // There was no property value, this is either a binary property or a value
    // from General_Category, as per `LoneUnicodePropertyNameOrValue`.
    if(canonicalNameEntry = findNameMapEntry(canonicalPropertyNameMap_BinaryProperty, propertyName))
      rangeMap = unicodePropertyRangeMap_BinaryProperty;
    else if(canonicalNameEntry = findNameMapEntry(canonicalPropertyNameMap_GeneralCategory, propertyName))
      rangeMap = unicodePropertyRangeMap_GeneralCategory;
  }
...

This helps localise the logic/assignment before proceeding to the
general range lookup part.
@jonathanj
Copy link
Author

@neildhar Just checking in if there's anything I still need to address, or can help with for this review.

@neildhar
Copy link
Contributor

@jonathanj Thanks for the ping, there isn't anything outstanding, let me import it so we can more easily do a quick review of the python changes

@facebook-github-bot
Copy link
Contributor

@neildhar has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@neildhar neildhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good, apart from the minor comments, the only real ask is to add a file-level doc-comment in genUnicodeTable.py explaining what the steps are in producing the final output. Once that's done, this should be good to merge.

Thank you for your patience with this process!

Comment on lines 140 to 141
unsigned offset:16;
unsigned size:16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't need a bitfield here and below, we can just use uint16_t

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've updated all of these to uint16_t.

"""
binary_property_aliases = {
canonical_name: []
for canonical_name in "ASCII ASCII_Hex_Digit Alphabetic Bidi_Control Bidi_Mirrored "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be preferable to just make this a list to begin with instead of creating and splitting a string. It can also be assigned to a separate variable beforehand to make the loop easier to read.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I've gone ahead and made that change.

@facebook-github-bot
Copy link
Contributor

@jonathanj has updated the pull request. You must reimport the pull request before landing.

@jonathanj
Copy link
Author

Everything looks good, apart from the minor comments, the only real ask is to add a file-level doc-comment in genUnicodeTable.py explaining what the steps are in producing the final output. Once that's done, this should be good to merge.

I added instructions to the end of the existing file-level comment, that explain invoking the script from the project working root, use with clang-format, and the redirect path. I think I sort of just figured this out, so I actually have no idea if this was the intended process! Let me know if I need to adjust it.

Thank you for your patience with this process!

Thank you for your patience with my submission, I'm very appreciative of the insightful comments and explanations along the way, I learned some interesting new things! It feels good to contribute to an exciting project like Hermes, with supportive maintainers.

@facebook-github-bot
Copy link
Contributor

@neildhar has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jonathanj
Copy link
Author

Hello @neildhar 👋

Just checking in if there's anything else I should do, perhaps rebase on main? The test-e2e check has been failing, as a result of a build issue, which seems to be unrelated to my changes.

@neildhar
Copy link
Contributor

Hey @jonathanj, nothing at the moment, I've rebased it internally and it should get merged soon.

@facebook-github-bot
Copy link
Contributor

@neildhar merged this pull request in dcf8e7b.

facebook-github-bot pushed a commit that referenced this pull request Jul 30, 2024
Summary:
Original Author: jonathan@yoco.co.za
Original Git: dcf8e7b
Original Reviewed By: avp
Original Revision: D56493540

Extends RegExp to handle `\p` and `\P` Unicode property escapes, standalone and within character classes, for Unicode mode. The implementation extends the existing idea of tables of codepoint ranges to cover all of the properties explicitly required by ES262.

The `genUnicodeTable.py` script was updated to generate code related to the binary and non-binary Unicode properties explicitly mentioned in ES262 ([ref](https://tc39.es/ecma262/multipage/text-processing.html#sec-runtime-semantics-unicodematchproperty-p)); based on the Unicode 15.1.0 data files.

Closes #1027

Pull Request resolved: #1295

Pulled By: neildhar

Reviewed By: neildhar

Differential Revision: D60361619

fbshipit-source-id: 70d7293114576d4b05f1dbc2d9b22e553bec089f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for ES2018 Unicode character class escapes
4 participants