diff --git a/data/TestPlans.txt b/data/TestPlans.txt index 47c71a2..3d74993 100644 --- a/data/TestPlans.txt +++ b/data/TestPlans.txt @@ -1022,3 +1022,230 @@ EncodingName: cl100k_base Sample: 🍏🍎🍐🍊🍋🍌🍉🍇🍓🍈🍒🍑 Encoded: [9468, 235, 237, 9468, 235, 236, 9468, 235, 238, 9468, 235, 232, 9468, 235, 233, 9468, 235, 234, 9468, 235, 231, 9468, 235, 229, 9468, 235, 241, 9468, 235, 230, 9468, 235, 240, 9468, 235, 239] +EncodingName: o200k_base +Sample: a a +Encoded: [64, 261] + +EncodingName: o200k_base +Sample: hello +Encoded: [24912] + +EncodingName: o200k_base +Sample: Hello, World! How are you today? 🌍 +Encoded: [13225, 11, 5922, 0, 3253, 553, 481, 4044, 30, 130321, 235] + +EncodingName: o200k_base +Sample: こんにちは、世界!お元気ですか? +Encoded: [95839, 1395, 28428, 3393, 8930, 6753, 25717, 15121, 7128, 4802] + +EncodingName: o200k_base +Sample: Hola, mundo! ¿Cómo estás hoy? 🇪🇸 +Encoded: [49864, 11, 10225, 0, 12873, 46515, 58166, 20502, 30, 173468, 103, 55506, 116] + +EncodingName: o200k_base +Sample: Привет, мир! Как дела? +Encoded: [23881, 131903, 11, 37934, 0, 26029, 78857, 30] + +EncodingName: o200k_base +Sample: 안녕하세요, 세상! 오늘 기분이 어때요? 🇰🇷 +Encoded: [14307, 171731, 11, 28126, 8612, 0, 106820, 11061, 15567, 2186, 21252, 41856, 7952, 30, 173468, 108, 55506, 115] + +EncodingName: o200k_base +Sample: Bonjour, le monde ! Comment ça va aujourd'hui ? 🇫🇷 +Encoded: [45751, 11, 505, 15807, 1073, 15406, 13590, 3423, 32226, 43820, 1423, 173468, 104, 55506, 115] + +EncodingName: o200k_base +Sample: The quick brown fox jumps over 13 lazy dogs. 😺 +Encoded: [976, 4853, 19705, 68347, 65613, 1072, 220, 1311, 29082, 16798, 13, 22861, 118] + +EncodingName: o200k_base +Sample: Здравствуйте, это мой первый раз здесь. Что мне делать? +Encoded: [182298, 11, 8577, 65733, 62134, 4702, 44039, 13, 53319, 27934, 45321, 30] + +EncodingName: o200k_base +Sample: હેલો, વિશ્વ! તમે આજે કેમ છો? 🇮🇳 +Encoded: [6094, 11954, 1903, 11, 5059, 71706, 15432, 0, 21720, 1138, 107600, 1138, 3058, 38937, 4289, 1903, 30, 173468, 106, 55506, 111] + +EncodingName: o200k_base +Sample: ความรักและการเป็นกันเองเป็นสิ่งสำคัญที่สุดในโลก 🇹🇭 +Encoded: [26224, 1619, 18971, 45798, 11855, 21876, 19373, 3015, 6560, 121316, 21876, 19373, 4406, 2781, 2055, 2795, 75160, 5131, 61134, 3998, 8070, 4406, 21584, 28208, 93469, 173468, 117, 55506, 255] + +EncodingName: o200k_base +Sample: Python vs Java: Which programming language should you learn first? +Encoded: [60502, 10217, 13114, 25, 21580, 23238, 6439, 1757, 481, 4484, 1577, 30] + +EncodingName: o200k_base +Sample: A journey of a thousand miles begins with a single step. - Lao Tzu +Encoded: [32, 12647, 328, 261, 26791, 10753, 18015, 483, 261, 4590, 5983, 13, 533, 144616, 353, 7846] + +EncodingName: o200k_base +Sample: Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt. 🇩🇪 +Encoded: [8796, 111745, 39103, 89476, 93295, 9627, 1076, 111745, 39103, 23079, 13, 173468, 102, 55506, 103] + +EncodingName: o200k_base +Sample: יש לי כמה שאלות בנוגע לפרויקט החדש שלך. 🇮🇱 +Encoded: [7899, 42151, 60962, 129852, 2433, 34083, 110495, 108591, 181894, 162562, 69019, 13, 173468, 106, 55506, 109] + +EncodingName: o200k_base +Sample: Det är en vacker dag i Sverige. 🇸🇪 +Encoded: [3639, 7706, 469, 323, 17798, 8724, 575, 64714, 13, 173468, 116, 55506, 103] + +EncodingName: o200k_base +Sample: A ∀ x (P(x) → Q(x)) ∧ (∃x P(x)) → ∃x Q(x) +Encoded: [32, 35353, 222, 1215, 350, 47, 4061, 8, 15155, 1486, 4061, 915, 35353, 100, 350, 18085, 225, 87, 398, 4061, 915, 15155, 35353, 225, 87, 1486, 4061, 8] + +EncodingName: o200k_base +Sample: O Brasil é o maior país da América do Sul. 🇧🇷 +Encoded: [46, 15278, 1212, 293, 15966, 11106, 1033, 45086, 621, 27109, 13, 173468, 100, 55506, 115] + +EncodingName: o200k_base +Sample: L'amore è una forza potente che unisce le persone. 🇮🇹 +Encoded: [43, 30344, 510, 6272, 1969, 125511, 111848, 1378, 537, 48541, 505, 40144, 13, 173468, 106, 55506, 117] + +EncodingName: o200k_base +Sample: Είναι μια ηλιόλουστη ημέρα στην Ελλάδα. 🇬🇷 +Encoded: [10303, 16239, 33246, 13115, 57330, 2097, 85087, 42851, 122278, 7648, 21399, 112618, 13, 173468, 105, 55506, 115] + +EncodingName: o200k_base +Sample: Teslim tarihi yaklaşıyor, projeyi zamanında bitirmemiz gerekiyor. 🇹🇷 +Encoded: [110176, 5406, 162005, 16000, 148409, 17368, 11, 16022, 33468, 30355, 10884, 3546, 2835, 347, 482, 195151, 13, 173468, 117, 55506, 115] + +EncodingName: o200k_base +Sample: Det finnes ingen bedre tid enn nå for å starte noe nytt. 🇳🇴 +Encoded: [3639, 145817, 30430, 56755, 8692, 23075, 19937, 395, 7086, 167203, 49921, 66369, 13, 173468, 111, 55506, 112] + +EncodingName: o200k_base +Sample: Aanvaard de uitdagingen van het leven met moed en vastberadenheid. 🇳🇱 +Encoded: [68832, 84482, 334, 180964, 1164, 1448, 21987, 1421, 137256, 469, 11332, 718, 9519, 7157, 13, 173468, 111, 55506, 109] + +EncodingName: o200k_base +Sample: Chào mừng bạn đến với thế giới của lập trình. 🇻🇳 +Encoded: [1205, 35134, 284, 75104, 22673, 27528, 18019, 46773, 69217, 12153, 96352, 49051, 13, 173468, 119, 55506, 111] + +EncodingName: o200k_base +Sample: Dlaczego warto uczyć się języków obcych? 🇵🇱 +Encoded: [136923, 182265, 82074, 337, 150478, 9721, 140914, 3705, 87043, 1067, 55175, 30, 173468, 113, 55506, 109] + +EncodingName: o200k_base +Sample: E = mc², uma equação famosa na física. 🇵🇹 +Encoded: [36, 314, 36958, 13848, 11, 3030, 2801, 3890, 96317, 898, 50251, 13, 173468, 113, 55506, 117] + +EncodingName: o200k_base +Sample: 你今天遇到什么有趣的事情了吗?🇨🇳 +Encoded: [12370, 47256, 57127, 6946, 10555, 3666, 57922, 1616, 162913, 112451, 4802, 55506, 101, 55506, 111] + +EncodingName: o200k_base +Sample: Nå er det tid for å feire med familie og venner. 🇳🇴 +Encoded: [45, 592, 1111, 1476, 8692, 395, 7086, 1193, 594, 1475, 39603, 2085, 131786, 13, 173468, 111, 55506, 112] + +EncodingName: o200k_base +Sample: Þetta er góður dagur til að læra eitthvað nýtt. 🇮🇸 +Encoded: [7860, 20476, 1111, 91455, 17041, 8724, 330, 3453, 5993, 29333, 614, 180350, 49697, 1037, 13, 173468, 106, 55506, 116] + +EncodingName: o200k_base +Sample: გამარჯობა! როგორ ხართ დღეს? 🇬🇪 +Encoded: [165502, 69106, 24045, 0, 57298, 10892, 10875, 55856, 30, 173468, 105, 55506, 103] + +EncodingName: o200k_base +Sample: Mā te whakawhiti kōrero e whai hua ai tātou. 🇳🇿 +Encoded: [44, 2485, 729, 145047, 174352, 92760, 41643, 319, 101354, 76899, 8440, 260, 36813, 283, 13, 173468, 111, 55506, 123] + +EncodingName: o200k_base +Sample: Это был незабываемый опыт, который я буду помнить всегда. +Encoded: [63250, 11066, 37028, 66181, 42684, 6770, 67711, 11, 21903, 3277, 61571, 179329, 34056, 13] + +EncodingName: o200k_base +Sample: Διαβάζοντας βιβλία, εμπλουτίζουμε τον εαυτό μας με γνώσεις. +Encoded: [16611, 5690, 63324, 9153, 92025, 164613, 113428, 11, 109925, 85087, 30711, 9153, 33850, 20894, 4278, 727, 75653, 35170, 9173, 8558, 954, 92830, 13] + +EncodingName: o200k_base +Sample: A számítástechnika világa tele van izgalmas lehetőségekkel. 🇭🇺 +Encoded: [32, 70578, 5348, 449, 168649, 3113, 11748, 449, 2225, 5443, 1164, 4297, 8298, 4227, 51215, 53922, 95521, 108844, 13, 173468, 255, 55506, 118] + +EncodingName: o200k_base +Sample: Vždy je dobré mít plán B, pokud něco nevyjde. 🇨🇿 +Encoded: [53, 99728, 1264, 54560, 377, 98517, 192660, 418, 11, 118907, 134570, 453, 16670, 56244, 13, 173468, 101, 55506, 123] + +EncodingName: o200k_base +Sample: Dragostea e un sentiment minunat care ne unește pe toți. 🇷🇴 +Encoded: [25765, 564, 12932, 319, 537, 39160, 182050, 266, 2631, 453, 2463, 74495, 1045, 316, 20660, 13, 173468, 115, 55506, 112] + +EncodingName: o200k_base +Sample: دیکھو، آسمان میں کتنی تارے ہیں! 🇵🇰 +Encoded: [547, 55459, 417, 1368, 3382, 11248, 1195, 6431, 144008, 14148, 112711, 1531, 12406, 0, 173468, 113, 55506, 108] + +EncodingName: o200k_base +Sample: Nenda polepole na ujifunze kila siku. 🇹🇿 +Encoded: [45, 5968, 25059, 112657, 898, 62112, 366, 119365, 52237, 54647, 13, 173468, 117, 55506, 123] + +EncodingName: o200k_base +Sample: Каква е твоята любима храна? 🇧🇬 +Encoded: [29831, 2224, 2404, 70888, 8886, 2734, 13230, 27621, 2442, 73698, 30, 173468, 100, 55506, 105] + +EncodingName: o200k_base +Sample: Sträva alltid efter att bli en bättre version av dig själv. +Encoded: [3504, 450, 2873, 63479, 22852, 1927, 27757, 469, 100580, 3926, 1452, 3807, 71554, 13] + +EncodingName: o200k_base +Sample: Філософія - це наука про знання. 🇺🇦 +Encoded: [10334, 17058, 107824, 30929, 533, 54543, 1235, 59929, 4964, 41072, 17561, 13, 173468, 118, 55506, 99] + +EncodingName: o200k_base +Sample: Το πρόγραμμα αυτό είναι πολύ ενδιαφέρον. 🇬🇷 +Encoded: [63423, 198704, 43845, 17278, 60896, 162904, 171319, 13, 173468, 105, 55506, 115] + +EncodingName: o200k_base +Sample: 4gH@!0sT*#(9^%$[x{}j+|Yz6;Q]~8 +Encoded: [19, 70, 39, 31, 0, 15, 82, 51, 9, 2, 7, 24, 61, 4, 3, 58, 87, 12083, 73, 10, 91, 56, 89, 21, 26, 48, 60, 93, 23] + +EncodingName: o200k_base +Sample: wNb)I<>#:i^P]*cR8ytUx1Q`6O@z/ +Encoded: [86, 67111, 8, 40, 28052, 97210, 72, 61, 47, 18579, 66, 49, 23, 5240, 182325, 16, 48, 63, 21, 46, 31, 89, 14] + +EncodingName: o200k_base +Sample: ÄÜö¿¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ +Encoded: [12921, 8858, 573, 11986, 20407, 61242, 18943, 43470, 43625, 41468, 18596, 64259, 19742, 25661, 4244, 74285, 8980, 98049, 6793, 32438, 13848, 45681, 14737, 39621, 69022, 5366, 68284, 84125, 11006, 1924, 43439, 27124, 75174, 11986] + +EncodingName: o200k_base +Sample: ƒšŠŒŽƒšŠŒŽƒšŠŒŽƒšŠŒŽƒšŠŒŽƒšŠŒŽ +Encoded: [99760, 812, 7490, 189136, 12915, 99760, 812, 7490, 189136, 12915, 99760, 812, 7490, 189136, 12915, 99760, 812, 7490, 189136, 12915, 99760, 812, 7490, 189136, 12915, 99760, 812, 7490, 189136, 12915] + +EncodingName: o200k_base +Sample: 5ħÅŸēýïūē$%#^*()_+{[ö&!@#?>|,.<> +Encoded: [20, 5762, 13631, 198355, 6238, 1840, 9954, 7637, 6238, 3, 4, 2, 61, 9, 416, 62, 10, 90, 58, 573, 5, 0, 31, 2, 10730, 91, 26887, 28052] + +EncodingName: o200k_base +Sample: 1B4t#%&*()_+dF5g^hJk7LmN0pQrS<>? +Encoded: [16, 33, 19, 83, 2, 4, 5, 9, 416, 62, 10, 67, 37, 20, 70, 61, 71, 41, 74, 22, 196093, 45, 15, 79, 135047, 50, 28052, 30] + +EncodingName: o200k_base +Sample: ¬§±²³µ¶·¹ºª«»¦©¯°±!@#$%^&*()_+ +Encoded: [74285, 18596, 32438, 13848, 45681, 39621, 69022, 5366, 84125, 11006, 25661, 4244, 1924, 41468, 19742, 98049, 6793, 32438, 0, 31, 108156, 108254, 5, 9, 416, 62, 10] + +EncodingName: o200k_base +Sample: 8mR5*w7^a$!F(0%#J9@X6vZ1)nU3]_Y/ +Encoded: [23, 76, 49, 20, 147727, 22, 61, 64, 3, 0, 37, 7, 15, 4, 2, 41, 24, 31, 55, 21, 85, 57, 16, 143612, 52, 18, 167793, 56, 14] + +EncodingName: o200k_base +Sample: 😊😀😁😂🤣😃😄😅😆😉😊😋😎😍😘😗😙😚☺️🙂🤗🤔 +Encoded: [102630, 84083, 156437, 41736, 92916, 13865, 225, 13865, 226, 13865, 227, 13865, 228, 72041, 102630, 13865, 233, 13865, 236, 74762, 122588, 13865, 245, 13865, 247, 13865, 248, 155014, 15148, 37459, 50378, 245, 50378, 242] + +EncodingName: o200k_base +Sample: 🤨😐😑😶🙄😏😣😥😮🤐😯😪😫😴😌🤓😛😜😝🤤 +Encoded: [50378, 101, 13865, 238, 13865, 239, 13865, 114, 70125, 226, 13865, 237, 13865, 96, 13865, 98, 13865, 106, 50378, 238, 13865, 107, 13865, 103, 13865, 104, 13865, 112, 13865, 234, 50378, 241, 13865, 249, 13865, 250, 13865, 251, 50378, 97] + +EncodingName: o200k_base +Sample: 😒😓😔😕🙃🤑😲😷🤒🤕🤢🤧😈👿👹👺💀☠️ +Encoded: [13865, 240, 13865, 241, 13865, 242, 13865, 243, 70125, 225, 4103, 11566, 13865, 110, 13865, 115, 50378, 240, 50378, 243, 50378, 95, 50378, 100, 13865, 230, 28823, 123, 28823, 117, 28823, 118, 31446, 222, 8434, 254, 15148] + +EncodingName: o200k_base +Sample: 😾😿🙀😽😼😻🙈🙉🙊👶👦👧👨👩👴👵👨‍⚕️👩‍⚕️ +Encoded: [13865, 122, 13865, 123, 70125, 222, 13865, 121, 13865, 120, 13865, 119, 70125, 230, 70125, 231, 70125, 232, 28823, 114, 28823, 99, 28823, 100, 28823, 101, 28823, 102, 28823, 112, 28823, 113, 28823, 101, 2524, 84396, 243, 15148, 28823, 102, 2524, 84396, 243, 15148] + +EncodingName: o200k_base +Sample: 🌞🌝🌚🌛🌜🌙⭐️🌟💫✨🔥💥☄️🌈☀️🌤️⛅️🌥️ +Encoded: [64364, 252, 64364, 251, 64364, 248, 64364, 249, 64364, 250, 64364, 247, 62160, 15148, 64364, 253, 31446, 104, 97375, 96606, 31446, 98, 8434, 226, 15148, 64364, 230, 8434, 222, 15148, 64364, 97, 15148, 158, 249, 227, 15148, 64364, 98, 15148] + +EncodingName: o200k_base +Sample: 🍏🍎🍐🍊🍋🍌🍉🍇🍓🍈🍒🍑 +Encoded: [102415, 237, 102415, 236, 102415, 238, 102415, 232, 102415, 233, 102415, 234, 102415, 231, 102415, 229, 102415, 241, 102415, 230, 102415, 240, 102415, 239] \ No newline at end of file diff --git a/src/GptEncoding.test.ts b/src/GptEncoding.test.ts index 5838aa9..847a56d 100644 --- a/src/GptEncoding.test.ts +++ b/src/GptEncoding.test.ts @@ -29,12 +29,12 @@ const results = { o200k_base: { space: [220], tab: [197], - 'This is some text': [2_028, 374, 1_063, 1_495], - indivisible: [485, 344, 23_936], + 'This is some text': [2_500, 382, 1_236, 2_201], + indivisible: [521, 349, 181_386], 'hello 👋 world 🌍': [24_912, 61_138, 233, 2_375, 130_321, 235], decodedHelloWorldTokens: ['hello', ' ', '👋', ' world', ' ', '🌍'], 'toString constructor hasOwnProperty valueOf': [ - 935, 916, 9220, 853, 18555, 3895, 1432, 2566, + 935, 916, 9_220, 853, 18_555, 3_895, 1_432, 2_566 ], 'hello, I am a text, and I have commas. a,b,c': [ 24_912, 11, 357, 939, 261, 2_201, 11, 326, 357, 679, 179_663, 13, 261,