Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Update to Unicode 16.0.0 #536

Closed
wants to merge 1 commit into from
Closed

Conversation

linusg
Copy link
Contributor

@linusg linusg commented Sep 14, 2024

Please review this very carefully, I have no idea what I'm doing :)

Closes #77.
Closes #530.

case RUN_TYPE_LF_EXT:
if (is_lower != (type - RUN_TYPE_U_EXT))
break;
c = case_conv_ext[data];
break;
case RUN_TYPE_U_EXT2:
case RUN_TYPE_U_EXT3:
// TODO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone needs to tell me how to implement these, my brain is not capable of writing this kind of obfuscated C code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat emptor, I'm not 100% sure but I suspect you can merge the RUN_TYPE_U_EXT2 case with RUN_TYPE_LF_EXT2 and update the guard to:

if (is_lower != (type - RUN_TYPE_U_EXT2))
    break;

RUN_TYPE_U_EXT2 should immediately precede RUN_TYPE_LF_EXT2 in the enum declaration for that to work (not the case right now.)

For RUN_TYPE_U_EXT3, the code should probably look like this:

case RUN_TYPE_U_EXT3:
    res[0] = c - code + case_conv_ext[data >> 8];
    res[1] = case_conv_ext[(data >> 4) & 0x0f];
    res[2] = case_conv_ext[data & 0x0f];
    return 3;

Comment on lines +1026 to +1046
if (ci->u_len == 2 && ci->u_data[1] == 0x399 &&
ci->l_len == 1) {
len = 1;
while (code + len <= CHARCODE_MAX) {
ci1 = &tab[code + len];
if (!(ci1->u_len == 2 &&
ci1->u_data[1] == 0x399 &&
ci1->u_data[0] == ci->u_data[0] + len &&
ci1->l_len == 1 &&
ci1->l_data[0] == ci->l_data[0] + len))
break;
len++;
}
te->len = len;
te->type = RUN_TYPE_U2_399_EXT2;
te->ext_data[0] = ci->u_data[0];
te->ext_data[1] = ci->l_data[0];
te->ext_len = 2;
return;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy and pasted from the one below except there is no F component

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means the if statement below is dead code now, right? Less specific subsumes more specific.

It's probably okay if you swap them. The code itself looks correct at a superficial glance.

case RUN_TYPE_LF_EXT:
if (is_lower != (type - RUN_TYPE_U_EXT))
break;
c = case_conv_ext[data];
break;
case RUN_TYPE_U_EXT2:
case RUN_TYPE_U_EXT3:
// TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat emptor, I'm not 100% sure but I suspect you can merge the RUN_TYPE_U_EXT2 case with RUN_TYPE_LF_EXT2 and update the guard to:

if (is_lower != (type - RUN_TYPE_U_EXT2))
    break;

RUN_TYPE_U_EXT2 should immediately precede RUN_TYPE_LF_EXT2 in the enum declaration for that to work (not the case right now.)

For RUN_TYPE_U_EXT3, the code should probably look like this:

case RUN_TYPE_U_EXT3:
    res[0] = c - code + case_conv_ext[data >> 8];
    res[1] = case_conv_ext[(data >> 4) & 0x0f];
    res[2] = case_conv_ext[data & 0x0f];
    return 3;

Comment on lines +1026 to +1046
if (ci->u_len == 2 && ci->u_data[1] == 0x399 &&
ci->l_len == 1) {
len = 1;
while (code + len <= CHARCODE_MAX) {
ci1 = &tab[code + len];
if (!(ci1->u_len == 2 &&
ci1->u_data[1] == 0x399 &&
ci1->u_data[0] == ci->u_data[0] + len &&
ci1->l_len == 1 &&
ci1->l_data[0] == ci->l_data[0] + len))
break;
len++;
}
te->len = len;
te->type = RUN_TYPE_U2_399_EXT2;
te->ext_data[0] = ci->u_data[0];
te->ext_data[1] = ci->l_data[0];
te->ext_len = 2;
return;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means the if statement below is dead code now, right? Less specific subsumes more specific.

It's probably okay if you swap them. The code itself looks correct at a superficial glance.

te->type = RUN_TYPE_L_EXT;
te->ext_len = 1;
te->ext_data[0] = te->data;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some clever rearranging of the enum values, you can avoid code duplication by folding the three cases into one:

case RUN_TYPE_L:
case RUN_TYPE_LF:
case RUN_TYPE_U:
    te->type += RUN_TYPE_L_EXT - RUN_TYPE_L;
    te->ext_len = 1;
    te->ext_data[0] = te->data;
    break;

@bnoordhuis
Copy link
Contributor

@linusg do you plan on revisiting this?

FWIW, I don't think this fully closes out #77 until the regex engine is taught about unicode emoji (tr51). Ex. /\p{Basic_Emoji}/ with the challenge that emoji are variable-length encodings.

I did a proof of concept where I pregenerate the regex bytecode for Basic_Emoji, Emoji_Keycap_Sequence, etc. and memcpy it when compiling regular expressions but that results in really bloated bytecode.

@linusg
Copy link
Contributor Author

linusg commented Sep 27, 2024

I won't get around to it until sometime next week at the earliest, if someone else wants to pick this up before then that's fine by me :)

@bnoordhuis
Copy link
Contributor

No rush. I was just wondering if I should adopt it.

@linusg
Copy link
Contributor Author

linusg commented Oct 7, 2024

I didn't find the time for this yet and likely won't in the coming weeks either - if you or someone else wants to finish it please feel free, I'd love to have this merged!

@bnoordhuis
Copy link
Contributor

I took a slightly different tack because I wanted to understand why 15.1 behaved differently from 16.0 but the update to 15.1 was done in 416ab66, followed immediately afterwards by an update to 16.0 in 243b968. Thanks anyway for the pull request!

@bnoordhuis bnoordhuis closed this Nov 9, 2024
@linusg linusg deleted the unicode-16 branch November 9, 2024 22:34
@linusg
Copy link
Contributor Author

linusg commented Nov 9, 2024

Thank you, much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode needs to be updated to 16.0.0 Update unicode
2 participants