Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for UTF-8 Character Class Processing #2687

Closed

Conversation

erslavin
Copy link
Contributor

@erslavin erslavin commented Sep 13, 2023

Description of Change(s)

  • Added UTF-8 utility functions to tf to read UTF-8 encoded characters and check if code points belong in XID_Start / XID_Continue character classes
  • Added pre-processing script to generate character class static data structures for identifying code points in XID_Start / XID_Continue from source DerivedCoreProperties.txt
  • Added tests for XID_Start / XID_Continue code point validity
  • Added reference Unicode 15.1.0 database files as reference for pre-processing script

This PR is stacked on top of "Support for UTF-8 chars in TfDictionaryLessThan" #2673

Fixes Issue(s)

  • I have verified that all unit tests pass with the proposed changes
  • I have submitted a signed Contributor License Agreement

@jesschimein
Copy link
Contributor

Filed as internal issue #USD-8702

@erslavin erslavin changed the title Support for UTF-8 chars in Paths Support for UTF-8 Character Class Processing Nov 1, 2023
@erslavin
Copy link
Contributor Author

erslavin commented Nov 1, 2023

Backed out the identifier specific changes to make this commit only deal with the data structures and logic necessary for processing Unicode character classes and iterating over / converting UTF-8 encoded characters to code points to check character class containment.

- Modified path parser to accept UTF-8 characters for Identifiers
- Modified identifier validity rules to accept valid UTF-8 identifiers
  (XID_Start followed by XID_Continue)
- Added UTF-8 utility functions to tf
- Added tests for UTF-8 based paths
- Added reference UnicodeDatabase.txt for character classes
- Added UTF-8 utility functions to tf to read UTF-8 encoded
  characters and check if code points belong in XID_Start /
  XID_Continue character classes
- Added pre-processing script to generate character class static
  data structures for identifying code points in XID_Start /
  XID_Continue from source DerivedCoreProperties.txt
- Added tests for XID_Start / XID_Continue code point validity
- Added reference Unicode 15.1.0 database files as reference
  for pre-processing script
- Added initializer list to TfStaticData constructors
- Removed formatting changes
@erslavin
Copy link
Contributor Author

Replaced by #2830

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants