[WIP] Update to Unicode 16.0.0 #536

linusg · 2024-09-14T00:14:41Z

Please review this very carefully, I have no idea what I'm doing :)

Closes #77.
Closes #530.

linusg · 2024-09-14T00:15:26Z

libunicode.c

    case RUN_TYPE_LF_EXT:
        if (is_lower != (type - RUN_TYPE_U_EXT))
            break;
        c = case_conv_ext[data];
        break;
+    case RUN_TYPE_U_EXT2:
+    case RUN_TYPE_U_EXT3:
+        // TODO


Someone needs to tell me how to implement these, my brain is not capable of writing this kind of obfuscated C code.

Caveat emptor, I'm not 100% sure but I suspect you can merge the RUN_TYPE_U_EXT2 case with RUN_TYPE_LF_EXT2 and update the guard to:

if (is_lower != (type - RUN_TYPE_U_EXT2)) break;

RUN_TYPE_U_EXT2 should immediately precede RUN_TYPE_LF_EXT2 in the enum declaration for that to work (not the case right now.)

For RUN_TYPE_U_EXT3, the code should probably look like this:

case RUN_TYPE_U_EXT3: res[0] = c - code + case_conv_ext[data >> 8]; res[1] = case_conv_ext[(data >> 4) & 0x0f]; res[2] = case_conv_ext[data & 0x0f]; return 3;

linusg · 2024-09-14T00:18:31Z

unicode_gen.c

+        if (ci->u_len == 2 && ci->u_data[1] == 0x399 &&
+            ci->l_len == 1) {
+            len = 1;
+            while (code + len <= CHARCODE_MAX) {
+                ci1 = &tab[code + len];
+                if (!(ci1->u_len == 2 &&
+                      ci1->u_data[1] == 0x399 &&
+                      ci1->u_data[0] == ci->u_data[0] + len &&
+                      ci1->l_len == 1 &&
+                      ci1->l_data[0] == ci->l_data[0] + len))
+                    break;
+                len++;
+            }
+            te->len = len;
+            te->type = RUN_TYPE_U2_399_EXT2;
+            te->ext_data[0] = ci->u_data[0];
+            te->ext_data[1] = ci->l_data[0];
+            te->ext_len = 2;
+            return;
+        }
+


Copy and pasted from the one below except there is no F component

That means the if statement below is dead code now, right? Less specific subsumes more specific.

It's probably okay if you swap them. The code itself looks correct at a superficial glance.

bnoordhuis · 2024-09-15T10:25:53Z

libunicode.c

    case RUN_TYPE_LF_EXT:
        if (is_lower != (type - RUN_TYPE_U_EXT))
            break;
        c = case_conv_ext[data];
        break;
+    case RUN_TYPE_U_EXT2:
+    case RUN_TYPE_U_EXT3:
+        // TODO


Caveat emptor, I'm not 100% sure but I suspect you can merge the RUN_TYPE_U_EXT2 case with RUN_TYPE_LF_EXT2 and update the guard to:

if (is_lower != (type - RUN_TYPE_U_EXT2)) break;

RUN_TYPE_U_EXT2 should immediately precede RUN_TYPE_LF_EXT2 in the enum declaration for that to work (not the case right now.)

For RUN_TYPE_U_EXT3, the code should probably look like this:

case RUN_TYPE_U_EXT3: res[0] = c - code + case_conv_ext[data >> 8]; res[1] = case_conv_ext[(data >> 4) & 0x0f]; res[2] = case_conv_ext[data & 0x0f]; return 3;

bnoordhuis · 2024-09-15T10:30:22Z

unicode_gen.c

+        if (ci->u_len == 2 && ci->u_data[1] == 0x399 &&
+            ci->l_len == 1) {
+            len = 1;
+            while (code + len <= CHARCODE_MAX) {
+                ci1 = &tab[code + len];
+                if (!(ci1->u_len == 2 &&
+                      ci1->u_data[1] == 0x399 &&
+                      ci1->u_data[0] == ci->u_data[0] + len &&
+                      ci1->l_len == 1 &&
+                      ci1->l_data[0] == ci->l_data[0] + len))
+                    break;
+                len++;
+            }
+            te->len = len;
+            te->type = RUN_TYPE_U2_399_EXT2;
+            te->ext_data[0] = ci->u_data[0];
+            te->ext_data[1] = ci->l_data[0];
+            te->ext_len = 2;
+            return;
+        }
+


That means the if statement below is dead code now, right? Less specific subsumes more specific.

It's probably okay if you swap them. The code itself looks correct at a superficial glance.

bnoordhuis · 2024-09-15T10:36:00Z

unicode_gen.c

+                    te->type = RUN_TYPE_L_EXT;
+                    te->ext_len = 1;
+                    te->ext_data[0] = te->data;
+                    break;


With some clever rearranging of the enum values, you can avoid code duplication by folding the three cases into one:

case RUN_TYPE_L: case RUN_TYPE_LF: case RUN_TYPE_U: te->type += RUN_TYPE_L_EXT - RUN_TYPE_L; te->ext_len = 1; te->ext_data[0] = te->data; break;

bnoordhuis · 2024-09-27T16:19:19Z

@linusg do you plan on revisiting this?

FWIW, I don't think this fully closes out #77 until the regex engine is taught about unicode emoji (tr51). Ex. /\p{Basic_Emoji}/ with the challenge that emoji are variable-length encodings.

I did a proof of concept where I pregenerate the regex bytecode for Basic_Emoji, Emoji_Keycap_Sequence, etc. and memcpy it when compiling regular expressions but that results in really bloated bytecode.

linusg · 2024-09-27T16:22:07Z

I won't get around to it until sometime next week at the earliest, if someone else wants to pick this up before then that's fine by me :)

bnoordhuis · 2024-09-27T16:29:25Z

No rush. I was just wondering if I should adopt it.

linusg · 2024-10-07T10:01:42Z

I didn't find the time for this yet and likely won't in the coming weeks either - if you or someone else wants to finish it please feel free, I'd love to have this merged!

bnoordhuis · 2024-11-09T22:18:12Z

I took a slightly different tack because I wanted to understand why 15.1 behaved differently from 16.0 but the update to 15.1 was done in 416ab66, followed immediately afterwards by an update to 16.0 in 243b968. Thanks anyway for the pull request!

linusg · 2024-11-09T22:35:25Z

Thank you, much appreciated!

linusg commented Sep 14, 2024

View reviewed changes

[WIP] Update to Unicode 16.0.0

4e33f0e

linusg force-pushed the unicode-16 branch from c4ab565 to 4e33f0e Compare September 14, 2024 00:24

bnoordhuis reviewed Sep 15, 2024

View reviewed changes

bnoordhuis mentioned this pull request Nov 9, 2024

Update to Unicode 15.1, 16.0 #666

Merged

bnoordhuis closed this Nov 9, 2024

linusg deleted the unicode-16 branch November 9, 2024 22:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Update to Unicode 16.0.0 #536

[WIP] Update to Unicode 16.0.0 #536

linusg commented Sep 14, 2024

linusg Sep 14, 2024

bnoordhuis Sep 15, 2024

linusg Sep 14, 2024

bnoordhuis Sep 15, 2024

bnoordhuis Sep 15, 2024

bnoordhuis Sep 15, 2024

bnoordhuis Sep 15, 2024

bnoordhuis commented Sep 27, 2024

linusg commented Sep 27, 2024

bnoordhuis commented Sep 27, 2024

linusg commented Oct 7, 2024

bnoordhuis commented Nov 9, 2024

linusg commented Nov 9, 2024

[WIP] Update to Unicode 16.0.0 #536

[WIP] Update to Unicode 16.0.0 #536

Conversation

linusg commented Sep 14, 2024

linusg Sep 14, 2024

Choose a reason for hiding this comment

bnoordhuis Sep 15, 2024

Choose a reason for hiding this comment

linusg Sep 14, 2024

Choose a reason for hiding this comment

bnoordhuis Sep 15, 2024

Choose a reason for hiding this comment

bnoordhuis Sep 15, 2024

Choose a reason for hiding this comment

bnoordhuis Sep 15, 2024

Choose a reason for hiding this comment

bnoordhuis Sep 15, 2024

Choose a reason for hiding this comment

bnoordhuis commented Sep 27, 2024

linusg commented Sep 27, 2024

bnoordhuis commented Sep 27, 2024

linusg commented Oct 7, 2024

bnoordhuis commented Nov 9, 2024

linusg commented Nov 9, 2024