-
-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid unicode characters removed from datagrid #578
Conversation
Thanks for submitting your first pull request! You are awesome! 🤗 |
Thank you for opening the PR! On conceptual level, what if someone has a table with all Unicode code points? Or if those are mapped to something else in a font. Would it be better to rewrite eliding to use Unicode-aware slice by first converting the string to an array as in https://stackoverflow.com/questions/62341685/javascript-unicode-aware-string-slice/62341816#62341816 ? |
Hi! Yes, that is a better approach I think. I just commited a new solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nicojapas
Letting this opened to let @krassowski have a look at the latter version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I will open a follow-up PR with unit tests.
Fixes #456
When dealing with astral symbols and ellipsing, datagrid generates invalid Unicode characters because of the use of
substring()
.With the regular expression
/[\u{D800}-\u{DFFF}]/gu
we match any character falling within the range of surrogate code points. This includes both high surrogates (0xD800 to 0xDBFF) and low surrogates (0xDC00 to 0xDFFF). So any invalid Unicode character resulting from splitting a surrogate pair is removed withreplace()
.