-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Update to Unicode 16.0.0 #536
Conversation
case RUN_TYPE_LF_EXT: | ||
if (is_lower != (type - RUN_TYPE_U_EXT)) | ||
break; | ||
c = case_conv_ext[data]; | ||
break; | ||
case RUN_TYPE_U_EXT2: | ||
case RUN_TYPE_U_EXT3: | ||
// TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone needs to tell me how to implement these, my brain is not capable of writing this kind of obfuscated C code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caveat emptor, I'm not 100% sure but I suspect you can merge the RUN_TYPE_U_EXT2
case with RUN_TYPE_LF_EXT2
and update the guard to:
if (is_lower != (type - RUN_TYPE_U_EXT2))
break;
RUN_TYPE_U_EXT2
should immediately precede RUN_TYPE_LF_EXT2
in the enum declaration for that to work (not the case right now.)
For RUN_TYPE_U_EXT3
, the code should probably look like this:
case RUN_TYPE_U_EXT3:
res[0] = c - code + case_conv_ext[data >> 8];
res[1] = case_conv_ext[(data >> 4) & 0x0f];
res[2] = case_conv_ext[data & 0x0f];
return 3;
if (ci->u_len == 2 && ci->u_data[1] == 0x399 && | ||
ci->l_len == 1) { | ||
len = 1; | ||
while (code + len <= CHARCODE_MAX) { | ||
ci1 = &tab[code + len]; | ||
if (!(ci1->u_len == 2 && | ||
ci1->u_data[1] == 0x399 && | ||
ci1->u_data[0] == ci->u_data[0] + len && | ||
ci1->l_len == 1 && | ||
ci1->l_data[0] == ci->l_data[0] + len)) | ||
break; | ||
len++; | ||
} | ||
te->len = len; | ||
te->type = RUN_TYPE_U2_399_EXT2; | ||
te->ext_data[0] = ci->u_data[0]; | ||
te->ext_data[1] = ci->l_data[0]; | ||
te->ext_len = 2; | ||
return; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy and pasted from the one below except there is no F
component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means the if statement below is dead code now, right? Less specific subsumes more specific.
It's probably okay if you swap them. The code itself looks correct at a superficial glance.
case RUN_TYPE_LF_EXT: | ||
if (is_lower != (type - RUN_TYPE_U_EXT)) | ||
break; | ||
c = case_conv_ext[data]; | ||
break; | ||
case RUN_TYPE_U_EXT2: | ||
case RUN_TYPE_U_EXT3: | ||
// TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caveat emptor, I'm not 100% sure but I suspect you can merge the RUN_TYPE_U_EXT2
case with RUN_TYPE_LF_EXT2
and update the guard to:
if (is_lower != (type - RUN_TYPE_U_EXT2))
break;
RUN_TYPE_U_EXT2
should immediately precede RUN_TYPE_LF_EXT2
in the enum declaration for that to work (not the case right now.)
For RUN_TYPE_U_EXT3
, the code should probably look like this:
case RUN_TYPE_U_EXT3:
res[0] = c - code + case_conv_ext[data >> 8];
res[1] = case_conv_ext[(data >> 4) & 0x0f];
res[2] = case_conv_ext[data & 0x0f];
return 3;
if (ci->u_len == 2 && ci->u_data[1] == 0x399 && | ||
ci->l_len == 1) { | ||
len = 1; | ||
while (code + len <= CHARCODE_MAX) { | ||
ci1 = &tab[code + len]; | ||
if (!(ci1->u_len == 2 && | ||
ci1->u_data[1] == 0x399 && | ||
ci1->u_data[0] == ci->u_data[0] + len && | ||
ci1->l_len == 1 && | ||
ci1->l_data[0] == ci->l_data[0] + len)) | ||
break; | ||
len++; | ||
} | ||
te->len = len; | ||
te->type = RUN_TYPE_U2_399_EXT2; | ||
te->ext_data[0] = ci->u_data[0]; | ||
te->ext_data[1] = ci->l_data[0]; | ||
te->ext_len = 2; | ||
return; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means the if statement below is dead code now, right? Less specific subsumes more specific.
It's probably okay if you swap them. The code itself looks correct at a superficial glance.
te->type = RUN_TYPE_L_EXT; | ||
te->ext_len = 1; | ||
te->ext_data[0] = te->data; | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some clever rearranging of the enum values, you can avoid code duplication by folding the three cases into one:
case RUN_TYPE_L:
case RUN_TYPE_LF:
case RUN_TYPE_U:
te->type += RUN_TYPE_L_EXT - RUN_TYPE_L;
te->ext_len = 1;
te->ext_data[0] = te->data;
break;
@linusg do you plan on revisiting this? FWIW, I don't think this fully closes out #77 until the regex engine is taught about unicode emoji (tr51). Ex. I did a proof of concept where I pregenerate the regex bytecode for Basic_Emoji, Emoji_Keycap_Sequence, etc. and memcpy it when compiling regular expressions but that results in really bloated bytecode. |
I won't get around to it until sometime next week at the earliest, if someone else wants to pick this up before then that's fine by me :) |
No rush. I was just wondering if I should adopt it. |
I didn't find the time for this yet and likely won't in the coming weeks either - if you or someone else wants to finish it please feel free, I'd love to have this merged! |
Thank you, much appreciated! |
Please review this very carefully, I have no idea what I'm doing :)
Closes #77.
Closes #530.