-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Unicode UTF-8 Identifiers #2120
Support for Unicode UTF-8 Identifiers #2120
Conversation
- Added additional lexer / parser for UTF-8 Identifiers - Added TF setting to scope changes behind TF_UTF8_IDENTIFIERS - Made first pass at Tf string methods to switch on setting and use ASCII vs. Unicode validation rules - Enabled different rules for identifiers vs. prim names (validation for prim names falls back to identifiers when under ASCII rules) - Added UCA sorting for UTF-8 strings and tests against Unicode conformance tests - Pulled common parse layer methods out of textFileFormat parser to enable use by both ASCII and Unicode flavors - Added support for Unicode UTF-8 identifiers / prim names in usdGenSchema
- Added missing files generated by new scripts to cache unicode character class data
I would stay away from using any usd or locale specific collating. You do not want the sort to change depending on the locale. Also (at least on Linux) the libraries have serious errors or bugs, such as ignoring punctuation, so that this is considered the correct sorted order:
(in case you can't figure out what happened, 'a'<'j'<'k') It's ok if you use information about USD code points to choose collating order, but it has to be exactly the same order for all users, and with no rude surprises like the above. Also it is far more important to fix numerical ordering (so |
@spitzak The implementation here is a (unoptimized) hand-rolled implementation of the Unicode Collation Algorithm, which you can find here: https://unicode.org/reports/tr10/. This does not depend on any locale specific functions (and in fact, ignores the very few use cases where locale context is required to ascertain correct ordering), nor does it depend on additional libraries to not introduce additional dependencies into USD. The resulting ordering will be exactly the same for all users - this is the beauty of UCA. However, and specifically with regard to numerical sorting as you might be used to in ASCII dictionary orderings, the default collation tables (which are quite extensive) do not order latin numerics the way you might think ( I expect there to be some hesitance in regard to the different sorting, and that's ok (however do note that the sort order is still exactly the same for every user while still taking into account the complexities of characters outside the non-extended Latin range). The tradeoffs to having support for Unicode vs. not until there is a complete agreement in sort order to me weigh in favor of having, especially since support is behind a run-time switch that defaults to existing behavior. Where the sort order is different in the existing tests should be clear from the changes in the PR. Furthermore, USD provides functionality to customize property ordering, so there is a mitigation that can be used to preserve today's functionality, if that's specifically what is needed in your workflows. |
That is good if the sort order is the same for everybody in all cases. This includes when LC_COLLATE is set to "" or "C"? The others are just annoyances. PLEASE fix the ASCII punctuation so it always sorts before any letters. I have to run with LC_COLLATE set to C to fix this as it makes the names average users write on files not sort in any logical way. Make absolutely sure that "foo.jpg" sorts less than "fooa.jpg". This breaks the sorting of variable-sized numbers far more than the need for strverscmp. As for the numbers, well failure to support that is really really common. But it is hugely annoying to users who do not understand why they have to zero-pad all their numbers. My recommendation is that the sort algorithm purposely split the strings at ASCII punctuation and numbers and sort only the character substrings with the unicode collating algorithm. The other substrings should be considered less than the characters and compared with each other using strverscmp. |
Note that LC_COLLATE="C" is not a good solution, it does not sort characters in the way users expect. But the problem with ASCII puncutation (especially '.' and '-' and '/') is so overwhelmingly bad that I have to run it this way. A hybrid solution that does this correctly but also uses Unicode for the hard part of sorting letters would be ideal. |
Filed as internal issue #USD-7818 |
- Implemented chosen sort algorithm Lexicographic ASCII + UTF-8 Byte Ordering - Removed all functionality related to UCA - Removed all functionality for case mapping - Adjusted tests to reflect original ordering that did not change with new algorithm - Added documentation notes where appropriate for Tf string methods that should only be used in ASCII-only circumstances
- Lofted up TfMakeValidPrimName / TfIsValidPrimName to Sdf
- Removed overly restrictive comments on certain TfString methods and documented specific restrictions - Added TF_DEV_AXIOM statements where required to validate those specific restrictions
- Changed default setting for TF_UTF8_IDENTIFIERS to `true`, enabling new Unicode code paths by default
- Changed logic for ASCII checks to check leading bit - Leveraged ASCII check in _LessImpl - Removed TF_UTF8_IDENTIFIERS from header - all access goes through UseUTF8Identifiers which checks value of TF_UTF8_IDENTIFIERS - Fixed typo on TfDictionaryLessThan - Removed specific overload for checking valid prim name regardless of TF_UTF8_IDENTIFIERS value
- Placed appropriate parser objects back in PXR namespace
- Placed appropriate objects back in PXR namespace - Modified yy files to ensure they emit proper parser context object for yyparse
- Removed empty line in comment
- Changed IsValidUTF8PrimName to IsValidUTF8Name - Removed TF_API from several inlined methods
- Refactored naming methods to streamline implementation - Added comment to test to indicate long term treatment if runtime switch is removed
- Refactored XID_Start / Continue methods to take code point - Refactored usage of these methods to pass code point via dereferenced iterator
@erslavin I think we can close this PR now that it's been broken up and integrated in 24.03. |
/AzurePipelines run |
Pull request contains merge conflicts. |
Old PR |
Description of Change(s)
Initial draft for review session with Pixar based on submitted white paper with additional changes to support separating out original ASCII parser and implementation of new Unicode UTF-8 parser to fully retain original logic.
Caveats:
TF_UTF8_IDENTIFIERS
setting is on; this is due to the TF environment banner being written out, which makes the output differ from the files checked in to compare againstFixes Issue(s)