Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax bare key restrictions to allow additional unicode letters and numbers #687

Closed
marzer opened this issue Dec 10, 2019 · 44 comments · Fixed by #891
Closed

Relax bare key restrictions to allow additional unicode letters and numbers #687

marzer opened this issue Dec 10, 2019 · 44 comments · Fixed by #891

Comments

@marzer
Copy link
Contributor

marzer commented Dec 10, 2019

Issue

TOML's "bare key" syntax is too restrictive. People who regularly use characters from languages other than English should be able to do so in TOML keys without additional gymnastics.

I know there's already been a lot of discussion about this but much of it was from when TOML was less established and I think it warrants revisiting.

Proposed change

Expand the set of accepted characters allowed in bare keys to include letters and numbers from the entire Unicode space, similar to how identifiers are handled in other Unicode-compliant contexts (e.g. python, javascript, etc.). Specifically:

  • Allow codepoints from categories Ll, Lm, Lo, Lt, Lu, Nd and Nl anywhere in a bare key
  • Allow codepoints from categories Mc and Mn anywhere in a bare key except as the first character

Rationale

After reading much of the existing discussion on the issue, I've identified the points below as being the main objections. I've written a counterpoint for each.

"ASCII-only is easy to understand"

Allowing Unicode letters and numbers wouldn't change the understandability of the written word in "mostly-ASCII" contexts, excepting maybe people from English-centric countries encountering characters they otherwise rarely see and being unsure how to pronounce them. I'm one of those people and my brain seems to consume them just fine. And it's almost certainly going to improve the understandability of bare keys to people for whom an ASCII environment is not their regular one.

It also wouldn't change the semantic/syntatic understandability of the language; I'm only advocating relaxing the spec to allow letter and number characters, not anything that might be confused for a language construct (no math symbols, for instance).

"Guides users to choose simple key names"

See above. I'd argue that the keys would be no less simple with this change. I live and work in a European country and a number of my friends and colleagues have non-ASCII letters in their name (e.g. ä). I doubt they consider their names to be complex; I certainly don't. If anything, by forcing people to jump through hoops just to type in their language, we're actually making the key names more complex w.r.t. cognitive load.

"Eliminate any weirdness that could come from having to deal with undelimited Unicode"

The TOML spec dictates UTF-8, not UTF-8-ish. UTF-8 is a solved problem at this point. If a parser doesn't correctly detect and handle malformed UTF-8, I'd argue that the parser needs fixing, not that we should bend over to accommodate users who are using crap tools and libraries. It's such a solved problem that you can even portably consume it using a state machine and validate it using vector intrinsics.

"Keys should be identifier-like"

Despite the fact that the concept of an "identifier" isn't a thing in TOML, I'll concede that in some situations this might be a concern. A reasonable example is using TOML in code generation contexts; if you used TOML keys to inform variable names historically you'd run into issues in many languages with non-ASCII characters, though this is no longer true. Even good old C++ supports unicode characters in identifiers on modern compilers.

...all of which is rendered moot by the fact that TOML supports hyphens in bare keys which are often invalid in identifier contexts, so this objection is a non-starter anyway.

"It complicates implementation"

It really doesn't. Many implementations will be able to leverage built-in helper functions or libraries for working with Unicode. For those that can't, I've put my money where my mouth is and implemented this as a proof-of-concept in my own TOML parser and I'm happy for my code to be used as a starting point:

Of course you might argue that simply accepting UTF-8 bytes from a TOML implementation is not an option for everyone, and you'd be right; there will always be situations where only ASCII makes sense (e.g. legacy codebases). I'd respond by pointing out that detecting non-ASCII characters in a character stream is laughably trivial. Applications requiring ASCII-only can easily enforce this themselves.

@marzer marzer changed the title Relax bare key character restrictions to allow some additional 'letter-like' Relax bare key character restrictions to allow some additional 'letter-likes' Dec 10, 2019
@ChristianSi
Copy link
Contributor

ChristianSi commented Dec 11, 2019

I wholeheartedly support this. Identifiers in non-English languages should not be discriminated against in TOML (and having to put them in quotes is a form of discrimination). [Note: the following refers to the original form of the proposal, which since then has been considerably extended.] Admittedly, with this proposal this would still only be the case for languages using the Latin script, but not for Russian, Arabic, Chinese etc. But it would still be a step in the right direction – and I see that there might be issues with allowing, say, Cyrillic letters, because they might be used to spoof a key that looks like ASCII but actually isn't. With Latin diacritics, this risk is much lower.

For completeness, I'd suggest to also support Latin Extended-B (Pan-Nigerian alphabet, Pinyin, Romanian), Latin Extended Additional (Vietnamese), and Latin Extended-C (Shona, a Bantu language).

@lmna
Copy link

lmna commented Dec 11, 2019

So there would be no simple rule for human writer to decide if quotes are necessary.

@lmna
Copy link

lmna commented Dec 11, 2019

Even good old C++ supports unicode characters in identifiers on modern compilers.

Do "modern C++ compilers" support most of Unicode, or only chosen subset of Latin Extended?

@marzer
Copy link
Contributor Author

marzer commented Dec 11, 2019

@ChristianSi Certainly there's lots of additional characters we could add. My list of suggestions was in no way meant to be exhaustive and I'm hoping that a more useful set of ranges is borne out of discussion. Being an Australian who only speaks English does limit my perspective a bit here!


@lmna

So there would be no simple rule for human writer to decide if quotes are necessary.

Technically-speaking the rule could be: quotes if you need whitespace, an escape code, or a TOML-reserved character, otherwise anything goes. My feeling is that requiring users to think about this at all is getting away from the design goals of TOML and drifting too far into Think-Like-A-Programmer territory, which risks defeating the purpose of a simple config file format that intends to "just work" the way people would expect in the layman case.

Do "modern C++ compilers" support most of Unicode, or only chosen subset of Latin Extended?

Absolutely no idea. Here's a live demo of some Unicode on Clang, GCC and MSVC; I encourage you to experiment. I'm certain if we needed a more definitive answer we could read the compiler source code (Clang or GCC) or ask the relevant developers (MSVC).

@pradyunsg
Copy link
Member

I'm on board.

I like the way Python handles alphabetical characters:

Alphabetic characters are those characters defined in the Unicode character database as “Letter”

That should be do-able for us, but I wonder how could complicate TOML implementations which are currently just doing a simple ASCII-value-range check. This also means we'd have to add a lot to TOML's ABNF.

@ChristianSi
Copy link
Contributor

@pradyunsg: I too would be fine with saying "Bare keys may contain arbitrary Unicode letters as well as ASCII digits, underscores, and dashes". But in effect this would likely mean that implementations would have to depend on some kind of Unicode library – easy in Python, where the isalpha() check is built in, probably not so easy in some other languages.

An advantage of @marzer 's original proposal, or my somewhat modified one, is that it would be easy to enumerate the affected ranges manually, in the ABNF or in code. With arbitrary Unicode letters this likely becomes effectively impossible – especially since the ranges would have to be extended with each new version of the Unicode standard.

But, of course, we might also decide to allow letters in arbitrary languages and scripts, not just Latin-based ones, and accept the Unicode dependency.

@ChristianSi
Copy link
Contributor

Here's a listing of all the Unicode Character Categories and all the characters that belong to each one. Unsurprisingly, it's quite long!

@marzer
Copy link
Contributor Author

marzer commented Dec 12, 2019

@ChristianSi Given that TOML is supposed to be UTF-8 I'm inclined to think that requiring implementations use unicode machinery, hand-rolled or otherwise, isn't really a big deal, regardless of the direction this proposal takes.

@pradyunsg I too like the python approach, and I don't think it would be too hard to implement. I'm currently writing a TOML library of my own and I'd be happy to build it into my utf-8 decoder as a proof-of-concept, if that's useful.

@marzer
Copy link
Contributor Author

marzer commented Dec 12, 2019

@ChristianSi I wrote a script to scrape letter characters from the website you linked, sort them and list them as ranges. If you omit the letter categories Lm and Lo the set of character ranges seems totally manageable:

removed, see below

@pradyunsg I don't know much about ABNF's but if characters can be expressed as ranges this wouldn't be much work.

@ChristianSi
Copy link
Contributor

@marzer Interesting. But the problem is that nearly all Unicode letters are in the "Letter, other" (Lo) category – 97 percent according to Wikipedia. Ignoring them, you only get the letters in alphabets that distinguish between upper case and lower case forms – Latin, Cyrillic, Greek, and a few others. But most writing systems don't – e.g. those used to write Chinese, Arabic, Hebrew, Korean, and certain Indian languages such as Tamil and Telugu know no such distinction. Hence their letters go into the "other" category.

What happens when you consider all letters? I suppose ranges become a bit unwieldy?

@marzer
Copy link
Contributor Author

marzer commented Dec 13, 2019

@ChristianSi Surprisingly it's not that unwieldy:

removed, see below

@ChristianSi
Copy link
Contributor

@marzer Ooo-kay. But it seems that website is outdated or incomplete. It only lists 16249 Other Letters, while Wikipedia says there should be 121414. This list seems complete.

Moreover, you should also add the Lm (Letter, modifier) category – there are only 259 of them.

@marzer
Copy link
Contributor Author

marzer commented Dec 13, 2019

@ChristianSi Ah, good find. I'll update the script later tonight and see how it looks.

@pradyunsg
Copy link
Member

Thanks for exploring this @ChristianSi and @marzer! ^>^

@pradyunsg
Copy link
Member

Marking this as a post-1.0 change, since I imagine this relaxation would not make any valid documents invalid -- thus, we can augment this in a non-major version bump.

@marzer
Copy link
Contributor Author

marzer commented Dec 13, 2019

@ChristianSi Ok, I updated the script to scrape directly from the unicode consortium's character database and amended it to include all of the letter characters, and it looks like this:

Added 125634 codepoints from 5 categories.
Ranges:
        0x41 => 0x5A
        0x61 => 0x7A
        0xAA
        0xB5
        0xBA
        0xC0 => 0xD6
        0xD8 => 0xF6
        0xF8 => 0x2C1
        0x2C6 => 0x2D1
        0x2E0 => 0x2E4
        0x2EC
        0x2EE
        0x370 => 0x374
        0x376 => 0x377
        0x37A => 0x37D
        0x37F
        0x386
        0x388 => 0x38A
        0x38C
        0x38E => 0x3A1
        0x3A3 => 0x3F5
        0x3F7 => 0x481
        0x48A => 0x52F
        0x531 => 0x556
        0x559
        0x560 => 0x588
        0x5D0 => 0x5EA
        0x5EF => 0x5F2
        0x620 => 0x64A
        0x66E => 0x66F
        0x671 => 0x6D3
        0x6D5
        0x6E5 => 0x6E6
        0x6EE => 0x6EF
        0x6FA => 0x6FC
        0x6FF
        0x710
        0x712 => 0x72F
        0x74D => 0x7A5
        0x7B1
        0x7CA => 0x7EA
        0x7F4 => 0x7F5
        0x7FA
        0x800 => 0x815
        0x81A
        0x824
        0x828
        0x840 => 0x858
        0x860 => 0x86A
        0x8A0 => 0x8B4
        0x8B6 => 0x8BD
        0x904 => 0x939
        0x93D
        0x950
        0x958 => 0x961
        0x971 => 0x980
        0x985 => 0x98C
        0x98F => 0x990
        0x993 => 0x9A8
        0x9AA => 0x9B0
        0x9B2
        0x9B6 => 0x9B9
        0x9BD
        0x9CE
        0x9DC => 0x9DD
        0x9DF => 0x9E1
        0x9F0 => 0x9F1
        0x9FC
        0xA05 => 0xA0A
        0xA0F => 0xA10
        0xA13 => 0xA28
        0xA2A => 0xA30
        0xA32 => 0xA33
        0xA35 => 0xA36
        0xA38 => 0xA39
        0xA59 => 0xA5C
        0xA5E
        0xA72 => 0xA74
        0xA85 => 0xA8D
        0xA8F => 0xA91
        0xA93 => 0xAA8
        0xAAA => 0xAB0
        0xAB2 => 0xAB3
        0xAB5 => 0xAB9
        0xABD
        0xAD0
        0xAE0 => 0xAE1
        0xAF9
        0xB05 => 0xB0C
        0xB0F => 0xB10
        0xB13 => 0xB28
        0xB2A => 0xB30
        0xB32 => 0xB33
        0xB35 => 0xB39
        0xB3D
        0xB5C => 0xB5D
        0xB5F => 0xB61
        0xB71
        0xB83
        0xB85 => 0xB8A
        0xB8E => 0xB90
        0xB92 => 0xB95
        0xB99 => 0xB9A
        0xB9C
        0xB9E => 0xB9F
        0xBA3 => 0xBA4
        0xBA8 => 0xBAA
        0xBAE => 0xBB9
        0xBD0
        0xC05 => 0xC0C
        0xC0E => 0xC10
        0xC12 => 0xC28
        0xC2A => 0xC39
        0xC3D
        0xC58 => 0xC5A
        0xC60 => 0xC61
        0xC80
        0xC85 => 0xC8C
        0xC8E => 0xC90
        0xC92 => 0xCA8
        0xCAA => 0xCB3
        0xCB5 => 0xCB9
        0xCBD
        0xCDE
        0xCE0 => 0xCE1
        0xCF1 => 0xCF2
        0xD05 => 0xD0C
        0xD0E => 0xD10
        0xD12 => 0xD3A
        0xD3D
        0xD4E
        0xD54 => 0xD56
        0xD5F => 0xD61
        0xD7A => 0xD7F
        0xD85 => 0xD96
        0xD9A => 0xDB1
        0xDB3 => 0xDBB
        0xDBD
        0xDC0 => 0xDC6
        0xE01 => 0xE30
        0xE32 => 0xE33
        0xE40 => 0xE46
        0xE81 => 0xE82
        0xE84
        0xE86 => 0xE8A
        0xE8C => 0xEA3
        0xEA5
        0xEA7 => 0xEB0
        0xEB2 => 0xEB3
        0xEBD
        0xEC0 => 0xEC4
        0xEC6
        0xEDC => 0xEDF
        0xF00
        0xF40 => 0xF47
        0xF49 => 0xF6C
        0xF88 => 0xF8C
        0x1000 => 0x102A
        0x103F
        0x1050 => 0x1055
        0x105A => 0x105D
        0x1061
        0x1065 => 0x1066
        0x106E => 0x1070
        0x1075 => 0x1081
        0x108E
        0x10A0 => 0x10C5
        0x10C7
        0x10CD
        0x10D0 => 0x10FA
        0x10FC => 0x1248
        0x124A => 0x124D
        0x1250 => 0x1256
        0x1258
        0x125A => 0x125D
        0x1260 => 0x1288
        0x128A => 0x128D
        0x1290 => 0x12B0
        0x12B2 => 0x12B5
        0x12B8 => 0x12BE
        0x12C0
        0x12C2 => 0x12C5
        0x12C8 => 0x12D6
        0x12D8 => 0x1310
        0x1312 => 0x1315
        0x1318 => 0x135A
        0x1380 => 0x138F
        0x13A0 => 0x13F5
        0x13F8 => 0x13FD
        0x1401 => 0x166C
        0x166F => 0x167F
        0x1681 => 0x169A
        0x16A0 => 0x16EA
        0x16F1 => 0x16F8
        0x1700 => 0x170C
        0x170E => 0x1711
        0x1720 => 0x1731
        0x1740 => 0x1751
        0x1760 => 0x176C
        0x176E => 0x1770
        0x1780 => 0x17B3
        0x17D7
        0x17DC
        0x1820 => 0x1878
        0x1880 => 0x1884
        0x1887 => 0x18A8
        0x18AA
        0x18B0 => 0x18F5
        0x1900 => 0x191E
        0x1950 => 0x196D
        0x1970 => 0x1974
        0x1980 => 0x19AB
        0x19B0 => 0x19C9
        0x1A00 => 0x1A16
        0x1A20 => 0x1A54
        0x1AA7
        0x1B05 => 0x1B33
        0x1B45 => 0x1B4B
        0x1B83 => 0x1BA0
        0x1BAE => 0x1BAF
        0x1BBA => 0x1BE5
        0x1C00 => 0x1C23
        0x1C4D => 0x1C4F
        0x1C5A => 0x1C7D
        0x1C80 => 0x1C88
        0x1C90 => 0x1CBA
        0x1CBD => 0x1CBF
        0x1CE9 => 0x1CEC
        0x1CEE => 0x1CF3
        0x1CF5 => 0x1CF6
        0x1CFA
        0x1D00 => 0x1DBF
        0x1E00 => 0x1F15
        0x1F18 => 0x1F1D
        0x1F20 => 0x1F45
        0x1F48 => 0x1F4D
        0x1F50 => 0x1F57
        0x1F59
        0x1F5B
        0x1F5D
        0x1F5F => 0x1F7D
        0x1F80 => 0x1FB4
        0x1FB6 => 0x1FBC
        0x1FBE
        0x1FC2 => 0x1FC4
        0x1FC6 => 0x1FCC
        0x1FD0 => 0x1FD3
        0x1FD6 => 0x1FDB
        0x1FE0 => 0x1FEC
        0x1FF2 => 0x1FF4
        0x1FF6 => 0x1FFC
        0x2071
        0x207F
        0x2090 => 0x209C
        0x2102
        0x2107
        0x210A => 0x2113
        0x2115
        0x2119 => 0x211D
        0x2124
        0x2126
        0x2128
        0x212A => 0x212D
        0x212F => 0x2139
        0x213C => 0x213F
        0x2145 => 0x2149
        0x214E
        0x2183 => 0x2184
        0x2C00 => 0x2C2E
        0x2C30 => 0x2C5E
        0x2C60 => 0x2CE4
        0x2CEB => 0x2CEE
        0x2CF2 => 0x2CF3
        0x2D00 => 0x2D25
        0x2D27
        0x2D2D
        0x2D30 => 0x2D67
        0x2D6F
        0x2D80 => 0x2D96
        0x2DA0 => 0x2DA6
        0x2DA8 => 0x2DAE
        0x2DB0 => 0x2DB6
        0x2DB8 => 0x2DBE
        0x2DC0 => 0x2DC6
        0x2DC8 => 0x2DCE
        0x2DD0 => 0x2DD6
        0x2DD8 => 0x2DDE
        0x2E2F
        0x3005 => 0x3006
        0x3031 => 0x3035
        0x303B => 0x303C
        0x3041 => 0x3096
        0x309D => 0x309F
        0x30A1 => 0x30FA
        0x30FC => 0x30FF
        0x3105 => 0x312F
        0x3131 => 0x318E
        0x31A0 => 0x31BA
        0x31F0 => 0x31FF
        0x3400 => 0x4DB4
        0x4E00 => 0x9FEE
        0xA000 => 0xA48C
        0xA4D0 => 0xA4FD
        0xA500 => 0xA60C
        0xA610 => 0xA61F
        0xA62A => 0xA62B
        0xA640 => 0xA66E
        0xA67F => 0xA69D
        0xA6A0 => 0xA6E5
        0xA717 => 0xA71F
        0xA722 => 0xA788
        0xA78B => 0xA7BF
        0xA7C2 => 0xA7C6
        0xA7F7 => 0xA801
        0xA803 => 0xA805
        0xA807 => 0xA80A
        0xA80C => 0xA822
        0xA840 => 0xA873
        0xA882 => 0xA8B3
        0xA8F2 => 0xA8F7
        0xA8FB
        0xA8FD => 0xA8FE
        0xA90A => 0xA925
        0xA930 => 0xA946
        0xA960 => 0xA97C
        0xA984 => 0xA9B2
        0xA9CF
        0xA9E0 => 0xA9E4
        0xA9E6 => 0xA9EF
        0xA9FA => 0xA9FE
        0xAA00 => 0xAA28
        0xAA40 => 0xAA42
        0xAA44 => 0xAA4B
        0xAA60 => 0xAA76
        0xAA7A
        0xAA7E => 0xAAAF
        0xAAB1
        0xAAB5 => 0xAAB6
        0xAAB9 => 0xAABD
        0xAAC0
        0xAAC2
        0xAADB => 0xAADD
        0xAAE0 => 0xAAEA
        0xAAF2 => 0xAAF4
        0xAB01 => 0xAB06
        0xAB09 => 0xAB0E
        0xAB11 => 0xAB16
        0xAB20 => 0xAB26
        0xAB28 => 0xAB2E
        0xAB30 => 0xAB5A
        0xAB5C => 0xAB67
        0xAB70 => 0xABE2
        0xAC00 => 0xD7A2
        0xD7B0 => 0xD7C6
        0xD7CB => 0xD7FB
        0xF900 => 0xFA6D
        0xFA70 => 0xFAD9
        0xFB00 => 0xFB06
        0xFB13 => 0xFB17
        0xFB1D
        0xFB1F => 0xFB28
        0xFB2A => 0xFB36
        0xFB38 => 0xFB3C
        0xFB3E
        0xFB40 => 0xFB41
        0xFB43 => 0xFB44
        0xFB46 => 0xFBB1
        0xFBD3 => 0xFD3D
        0xFD50 => 0xFD8F
        0xFD92 => 0xFDC7
        0xFDF0 => 0xFDFB
        0xFE70 => 0xFE74
        0xFE76 => 0xFEFC
        0xFF21 => 0xFF3A
        0xFF41 => 0xFF5A
        0xFF66 => 0xFFBE
        0xFFC2 => 0xFFC7
        0xFFCA => 0xFFCF
        0xFFD2 => 0xFFD7
        0xFFDA => 0xFFDC
        0x10000 => 0x1000B
        0x1000D => 0x10026
        0x10028 => 0x1003A
        0x1003C => 0x1003D
        0x1003F => 0x1004D
        0x10050 => 0x1005D
        0x10080 => 0x100FA
        0x10280 => 0x1029C
        0x102A0 => 0x102D0
        0x10300 => 0x1031F
        0x1032D => 0x10340
        0x10342 => 0x10349
        0x10350 => 0x10375
        0x10380 => 0x1039D
        0x103A0 => 0x103C3
        0x103C8 => 0x103CF
        0x10400 => 0x1049D
        0x104B0 => 0x104D3
        0x104D8 => 0x104FB
        0x10500 => 0x10527
        0x10530 => 0x10563
        0x10600 => 0x10736
        0x10740 => 0x10755
        0x10760 => 0x10767
        0x10800 => 0x10805
        0x10808
        0x1080A => 0x10835
        0x10837 => 0x10838
        0x1083C
        0x1083F => 0x10855
        0x10860 => 0x10876
        0x10880 => 0x1089E
        0x108E0 => 0x108F2
        0x108F4 => 0x108F5
        0x10900 => 0x10915
        0x10920 => 0x10939
        0x10980 => 0x109B7
        0x109BE => 0x109BF
        0x10A00
        0x10A10 => 0x10A13
        0x10A15 => 0x10A17
        0x10A19 => 0x10A35
        0x10A60 => 0x10A7C
        0x10A80 => 0x10A9C
        0x10AC0 => 0x10AC7
        0x10AC9 => 0x10AE4
        0x10B00 => 0x10B35
        0x10B40 => 0x10B55
        0x10B60 => 0x10B72
        0x10B80 => 0x10B91
        0x10C00 => 0x10C48
        0x10C80 => 0x10CB2
        0x10CC0 => 0x10CF2
        0x10D00 => 0x10D23
        0x10F00 => 0x10F1C
        0x10F27
        0x10F30 => 0x10F45
        0x10FE0 => 0x10FF6
        0x11003 => 0x11037
        0x11083 => 0x110AF
        0x110D0 => 0x110E8
        0x11103 => 0x11126
        0x11144
        0x11150 => 0x11172
        0x11176
        0x11183 => 0x111B2
        0x111C1 => 0x111C4
        0x111DA
        0x111DC
        0x11200 => 0x11211
        0x11213 => 0x1122B
        0x11280 => 0x11286
        0x11288
        0x1128A => 0x1128D
        0x1128F => 0x1129D
        0x1129F => 0x112A8
        0x112B0 => 0x112DE
        0x11305 => 0x1130C
        0x1130F => 0x11310
        0x11313 => 0x11328
        0x1132A => 0x11330
        0x11332 => 0x11333
        0x11335 => 0x11339
        0x1133D
        0x11350
        0x1135D => 0x11361
        0x11400 => 0x11434
        0x11447 => 0x1144A
        0x1145F
        0x11480 => 0x114AF
        0x114C4 => 0x114C5
        0x114C7
        0x11580 => 0x115AE
        0x115D8 => 0x115DB
        0x11600 => 0x1162F
        0x11644
        0x11680 => 0x116AA
        0x116B8
        0x11700 => 0x1171A
        0x11800 => 0x1182B
        0x118A0 => 0x118DF
        0x118FF
        0x119A0 => 0x119A7
        0x119AA => 0x119D0
        0x119E1
        0x119E3
        0x11A00
        0x11A0B => 0x11A32
        0x11A3A
        0x11A50
        0x11A5C => 0x11A89
        0x11A9D
        0x11AC0 => 0x11AF8
        0x11C00 => 0x11C08
        0x11C0A => 0x11C2E
        0x11C40
        0x11C72 => 0x11C8F
        0x11D00 => 0x11D06
        0x11D08 => 0x11D09
        0x11D0B => 0x11D30
        0x11D46
        0x11D60 => 0x11D65
        0x11D67 => 0x11D68
        0x11D6A => 0x11D89
        0x11D98
        0x11EE0 => 0x11EF2
        0x12000 => 0x12399
        0x12480 => 0x12543
        0x13000 => 0x1342E
        0x14400 => 0x14646
        0x16800 => 0x16A38
        0x16A40 => 0x16A5E
        0x16AD0 => 0x16AED
        0x16B00 => 0x16B2F
        0x16B40 => 0x16B43
        0x16B63 => 0x16B77
        0x16B7D => 0x16B8F
        0x16E40 => 0x16E7F
        0x16F00 => 0x16F4A
        0x16F50
        0x16F93 => 0x16F9F
        0x16FE0 => 0x16FE1
        0x16FE3
        0x17000 => 0x187F6
        0x18800 => 0x18AF2
        0x1B000 => 0x1B11E
        0x1B150 => 0x1B152
        0x1B164 => 0x1B167
        0x1B170 => 0x1B2FB
        0x1BC00 => 0x1BC6A
        0x1BC70 => 0x1BC7C
        0x1BC80 => 0x1BC88
        0x1BC90 => 0x1BC99
        0x1D400 => 0x1D454
        0x1D456 => 0x1D49C
        0x1D49E => 0x1D49F
        0x1D4A2
        0x1D4A5 => 0x1D4A6
        0x1D4A9 => 0x1D4AC
        0x1D4AE => 0x1D4B9
        0x1D4BB
        0x1D4BD => 0x1D4C3
        0x1D4C5 => 0x1D505
        0x1D507 => 0x1D50A
        0x1D50D => 0x1D514
        0x1D516 => 0x1D51C
        0x1D51E => 0x1D539
        0x1D53B => 0x1D53E
        0x1D540 => 0x1D544
        0x1D546
        0x1D54A => 0x1D550
        0x1D552 => 0x1D6A5
        0x1D6A8 => 0x1D6C0
        0x1D6C2 => 0x1D6DA
        0x1D6DC => 0x1D6FA
        0x1D6FC => 0x1D714
        0x1D716 => 0x1D734
        0x1D736 => 0x1D74E
        0x1D750 => 0x1D76E
        0x1D770 => 0x1D788
        0x1D78A => 0x1D7A8
        0x1D7AA => 0x1D7C2
        0x1D7C4 => 0x1D7CB
        0x1E100 => 0x1E12C
        0x1E137 => 0x1E13D
        0x1E14E
        0x1E2C0 => 0x1E2EB
        0x1E800 => 0x1E8C4
        0x1E900 => 0x1E943
        0x1E94B
        0x1EE00 => 0x1EE03
        0x1EE05 => 0x1EE1F
        0x1EE21 => 0x1EE22
        0x1EE24
        0x1EE27
        0x1EE29 => 0x1EE32
        0x1EE34 => 0x1EE37
        0x1EE39
        0x1EE3B
        0x1EE42
        0x1EE47
        0x1EE49
        0x1EE4B
        0x1EE4D => 0x1EE4F
        0x1EE51 => 0x1EE52
        0x1EE54
        0x1EE57
        0x1EE59
        0x1EE5B
        0x1EE5D
        0x1EE5F
        0x1EE61 => 0x1EE62
        0x1EE64
        0x1EE67 => 0x1EE6A
        0x1EE6C => 0x1EE72
        0x1EE74 => 0x1EE77
        0x1EE79 => 0x1EE7C
        0x1EE7E
        0x1EE80 => 0x1EE89
        0x1EE8B => 0x1EE9B
        0x1EEA1 => 0x1EEA3
        0x1EEA5 => 0x1EEA9
        0x1EEAB => 0x1EEBB
        0x20000 => 0x2A6D5
        0x2A700 => 0x2B733
        0x2B740 => 0x2B81C
        0x2B820 => 0x2CEA0
        0x2CEB0 => 0x2EBDF
        0x2F800 => 0x2FA1D

Not really any worse than before, even considering it's 125634 characters.

@eksortso
Copy link
Contributor

@marzer With what you provided, a PR could be prepared fairly quickly. Could you write that list similarly to how ucschar is written in RFC 3987? You don't need to wrap it; we can do that. But instead of e.g. 0x2F800 => 0x2FA1D, can you write %x2F800-2FA1D instead?

For reference, the part of RFC 3987 I'm referring to looks like this:

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

@marzer
Copy link
Contributor Author

marzer commented Dec 13, 2019

@eksortso here you go:

removed, see below

@ChristianSi
Copy link
Contributor

ChristianSi commented Dec 14, 2019

Thanks a lot for your efforts, @marzer ! That looks good so far and indeed manageable, but there are a few complications we have missed so far. I looked at what Python 3 and JavaScript allow in identifiers.

In addition to the five Letter categories we have already, they allow "Letter Number" (Nl) anywhere in an identifier and "Decimal Number" (Nd) anywhere, except at the start. TOML already allows all-numeric keys (they never occur in a position where they can be confused with actual numbers) so I'd consider it reasonable to allow both these categories anywhere in a key – people using Bengali letters in keys might, for example, reasonably expect to be able to use Bengali digits as well. Together they comprise less than 900 characters, so adding them should be quite manageable.

The final Number category (Other Number – No) is not allowed in identifiers in either language.

Moreover, both languages allow anywhere, except at the start, "Nonspacing Mark" (Mn) and "Spacing Mark" (Mc). Now it's important to understand that in Unicode, Marks (Mx categories) are always combining characters – they become logically attached to the preceding character and modify it. For example, Mn contains the "combining grave accent" which goes over the preceding letter and modifies it; Mc contains various Bengali vowel signs which likewise modify the preceding (supposedly Bengali) letter.

Hence it seems indeed important that we support these two categories too, since they are necessary to write certain words in certain languages – without them, support for multilingual bare keys would be incomplete and people might get odd error messages. It's also important that we must NOT allow them at the start of a bare key, since otherwise they would try to modify the preceding non-key character (likely a newline, space, or [ or . in table names or dotted keys) which would be nonsensical and blur the boundary at the start of a key. Together these categories have about 2250 entries, which likewise is manageable.

Finally, both JS and Python allow, except at the start, Connector Punctuation (Pc). That's a very short category with just 10 entries, including the underscore, which we allow already. I don't have strong feelings regarding this category, but would rather tend NOT to allow it in bare keys – we already have underscores and dashes as connectors, and, for example, the Centreline Low Line (﹎) with a tiny dot in the middle could theoretically be confused with the dots that actually separate key elements in hierarchical table names.

So, to summarize, I'd propose to additionally allow Nl and Nd anywhere in a bare key, and Mn and Mc anywhere except as first character (or code point, to be more exact).

In the README, we could then say:

Bare keys may contain arbitrary Unicode letters and digits as well as ASCII underscores (_) and dashes (-). (Technically, code points belonging to the Unicode categories Ll, Lm, Lo, Lt, Lu, Nd and Nl are allowed anywhere in a bare key, and those belonging to the categories Mc and Mn are allowed anywhere except as first code point.)

@marzer
Copy link
Contributor Author

marzer commented Dec 14, 2019

@ChristianSi LGTM. This proposal has significantly broadened in scope from my original thought bubble, but definitely for the better.

Can the Mx codepoints appear consecutively? If not, we'd also need to clarify that codepoints from Mx categories cannot appear at the beginning of a key and immediately following another Mx codepoint.

@ChristianSi
Copy link
Contributor

@marzer:

This proposal has significantly broadened in scope from my original thought bubble, but definitely for the better.

Indeed!

Yes, consecutive Mx codepoints are allowed – they all modify the preceding letter, e.g. by placing an acute accent above and an ogonek below it. That's necessary for some languages, such as Navajo.

@marzer marzer changed the title Relax bare key character restrictions to allow some additional 'letter-likes' Relax bare key restrictions to allow additional unicode characters Dec 15, 2019
@marzer marzer changed the title Relax bare key restrictions to allow additional unicode characters Relax bare key restrictions to allow additional unicode letters and numbers Dec 15, 2019
@marzer
Copy link
Contributor Author

marzer commented Dec 15, 2019

Alright I've updated the issue text to better reflect the current state of the discussion, as well as including links to my proof-of-concept implementation. I've also updated the script to generate the ABNF notation for the three relevant 'super-categories' of codepoints, which generates this:

; unicode codepoints from categories Ll, Lm, Lo, Lt, Lu
letters = %x41-5A / %x61-7A / %xAA / %xB5 /
        %xBA / %xC0-D6 / %xD8-F6 / %xF8-2C1 /
        %x2C6-2D1 / %x2E0-2E4 / %x2EC / %x2EE /
        %x370-374 / %x376-377 / %x37A-37D / %x37F /
        %x386 / %x388-38A / %x38C / %x38E-3A1 /
        %x3A3-3F5 / %x3F7-481 / %x48A-52F / %x531-556 /
        %x559 / %x560-588 / %x5D0-5EA / %x5EF-5F2 /
        %x620-64A / %x66E-66F / %x671-6D3 / %x6D5 /
        %x6E5-6E6 / %x6EE-6EF / %x6FA-6FC / %x6FF /
        %x710 / %x712-72F / %x74D-7A5 / %x7B1 /
        %x7CA-7EA / %x7F4-7F5 / %x7FA / %x800-815 /
        %x81A / %x824 / %x828 / %x840-858 /
        %x860-86A / %x8A0-8B4 / %x8B6-8C7 / %x904-939 /
        %x93D / %x950 / %x958-961 / %x971-980 /
        %x985-98C / %x98F-990 / %x993-9A8 / %x9AA-9B0 /
        %x9B2 / %x9B6-9B9 / %x9BD / %x9CE /
        %x9DC-9DD / %x9DF-9E1 / %x9F0-9F1 / %x9FC /
        %xA05-A0A / %xA0F-A10 / %xA13-A28 / %xA2A-A30 /
        %xA32-A33 / %xA35-A36 / %xA38-A39 / %xA59-A5C /
        %xA5E / %xA72-A74 / %xA85-A8D / %xA8F-A91 /
        %xA93-AA8 / %xAAA-AB0 / %xAB2-AB3 / %xAB5-AB9 /
        %xABD / %xAD0 / %xAE0-AE1 / %xAF9 /
        %xB05-B0C / %xB0F-B10 / %xB13-B28 / %xB2A-B30 /
        %xB32-B33 / %xB35-B39 / %xB3D / %xB5C-B5D /
        %xB5F-B61 / %xB71 / %xB83 / %xB85-B8A /
        %xB8E-B90 / %xB92-B95 / %xB99-B9A / %xB9C /
        %xB9E-B9F / %xBA3-BA4 / %xBA8-BAA / %xBAE-BB9 /
        %xBD0 / %xC05-C0C / %xC0E-C10 / %xC12-C28 /
        %xC2A-C39 / %xC3D / %xC58-C5A / %xC60-C61 /
        %xC80 / %xC85-C8C / %xC8E-C90 / %xC92-CA8 /
        %xCAA-CB3 / %xCB5-CB9 / %xCBD / %xCDE /
        %xCE0-CE1 / %xCF1-CF2 / %xD04-D0C / %xD0E-D10 /
        %xD12-D3A / %xD3D / %xD4E / %xD54-D56 /
        %xD5F-D61 / %xD7A-D7F / %xD85-D96 / %xD9A-DB1 /
        %xDB3-DBB / %xDBD / %xDC0-DC6 / %xE01-E30 /
        %xE32-E33 / %xE40-E46 / %xE81-E82 / %xE84 /
        %xE86-E8A / %xE8C-EA3 / %xEA5 / %xEA7-EB0 /
        %xEB2-EB3 / %xEBD / %xEC0-EC4 / %xEC6 /
        %xEDC-EDF / %xF00 / %xF40-F47 / %xF49-F6C /
        %xF88-F8C / %x1000-102A / %x103F / %x1050-1055 /
        %x105A-105D / %x1061 / %x1065-1066 / %x106E-1070 /
        %x1075-1081 / %x108E / %x10A0-10C5 / %x10C7 /
        %x10CD / %x10D0-10FA / %x10FC-1248 / %x124A-124D /
        %x1250-1256 / %x1258 / %x125A-125D / %x1260-1288 /
        %x128A-128D / %x1290-12B0 / %x12B2-12B5 / %x12B8-12BE /
        %x12C0 / %x12C2-12C5 / %x12C8-12D6 / %x12D8-1310 /
        %x1312-1315 / %x1318-135A / %x1380-138F / %x13A0-13F5 /
        %x13F8-13FD / %x1401-166C / %x166F-167F / %x1681-169A /
        %x16A0-16EA / %x16F1-16F8 / %x1700-170C / %x170E-1711 /
        %x1720-1731 / %x1740-1751 / %x1760-176C / %x176E-1770 /
        %x1780-17B3 / %x17D7 / %x17DC / %x1820-1878 /
        %x1880-1884 / %x1887-18A8 / %x18AA / %x18B0-18F5 /
        %x1900-191E / %x1950-196D / %x1970-1974 / %x1980-19AB /
        %x19B0-19C9 / %x1A00-1A16 / %x1A20-1A54 / %x1AA7 /
        %x1B05-1B33 / %x1B45-1B4B / %x1B83-1BA0 / %x1BAE-1BAF /
        %x1BBA-1BE5 / %x1C00-1C23 / %x1C4D-1C4F / %x1C5A-1C7D /
        %x1C80-1C88 / %x1C90-1CBA / %x1CBD-1CBF / %x1CE9-1CEC /
        %x1CEE-1CF3 / %x1CF5-1CF6 / %x1CFA / %x1D00-1DBF /
        %x1E00-1F15 / %x1F18-1F1D / %x1F20-1F45 / %x1F48-1F4D /
        %x1F50-1F57 / %x1F59 / %x1F5B / %x1F5D /
        %x1F5F-1F7D / %x1F80-1FB4 / %x1FB6-1FBC / %x1FBE /
        %x1FC2-1FC4 / %x1FC6-1FCC / %x1FD0-1FD3 / %x1FD6-1FDB /
        %x1FE0-1FEC / %x1FF2-1FF4 / %x1FF6-1FFC / %x2071 /
        %x207F / %x2090-209C / %x2102 / %x2107 /
        %x210A-2113 / %x2115 / %x2119-211D / %x2124 /
        %x2126 / %x2128 / %x212A-212D / %x212F-2139 /
        %x213C-213F / %x2145-2149 / %x214E / %x2183-2184 /
        %x2C00-2C2E / %x2C30-2C5E / %x2C60-2CE4 / %x2CEB-2CEE /
        %x2CF2-2CF3 / %x2D00-2D25 / %x2D27 / %x2D2D /
        %x2D30-2D67 / %x2D6F / %x2D80-2D96 / %x2DA0-2DA6 /
        %x2DA8-2DAE / %x2DB0-2DB6 / %x2DB8-2DBE / %x2DC0-2DC6 /
        %x2DC8-2DCE / %x2DD0-2DD6 / %x2DD8-2DDE / %x2E2F /
        %x3005-3006 / %x3031-3035 / %x303B-303C / %x3041-3096 /
        %x309D-309F / %x30A1-30FA / %x30FC-30FF / %x3105-312F /
        %x3131-318E / %x31A0-31BF / %x31F0-31FF / %x3400-4DBF /
        %x4E00-9FFC / %xA000-A48C / %xA4D0-A4FD / %xA500-A60C /
        %xA610-A61F / %xA62A-A62B / %xA640-A66E / %xA67F-A69D /
        %xA6A0-A6E5 / %xA717-A71F / %xA722-A788 / %xA78B-A7BF /
        %xA7C2-A7CA / %xA7F5-A801 / %xA803-A805 / %xA807-A80A /
        %xA80C-A822 / %xA840-A873 / %xA882-A8B3 / %xA8F2-A8F7 /
        %xA8FB / %xA8FD-A8FE / %xA90A-A925 / %xA930-A946 /
        %xA960-A97C / %xA984-A9B2 / %xA9CF / %xA9E0-A9E4 /
        %xA9E6-A9EF / %xA9FA-A9FE / %xAA00-AA28 / %xAA40-AA42 /
        %xAA44-AA4B / %xAA60-AA76 / %xAA7A / %xAA7E-AAAF /
        %xAAB1 / %xAAB5-AAB6 / %xAAB9-AABD / %xAAC0 /
        %xAAC2 / %xAADB-AADD / %xAAE0-AAEA / %xAAF2-AAF4 /
        %xAB01-AB06 / %xAB09-AB0E / %xAB11-AB16 / %xAB20-AB26 /
        %xAB28-AB2E / %xAB30-AB5A / %xAB5C-AB69 / %xAB70-ABE2 /
        %xAC00-D7A3 / %xD7B0-D7C6 / %xD7CB-D7FB / %xF900-FA6D /
        %xFA70-FAD9 / %xFB00-FB06 / %xFB13-FB17 / %xFB1D /
        %xFB1F-FB28 / %xFB2A-FB36 / %xFB38-FB3C / %xFB3E /
        %xFB40-FB41 / %xFB43-FB44 / %xFB46-FBB1 / %xFBD3-FD3D /
        %xFD50-FD8F / %xFD92-FDC7 / %xFDF0-FDFB / %xFE70-FE74 /
        %xFE76-FEFC / %xFF21-FF3A / %xFF41-FF5A / %xFF66-FFBE /
        %xFFC2-FFC7 / %xFFCA-FFCF / %xFFD2-FFD7 / %xFFDA-FFDC /
        %x10000-1000B / %x1000D-10026 / %x10028-1003A / %x1003C-1003D /
        %x1003F-1004D / %x10050-1005D / %x10080-100FA / %x10280-1029C /
        %x102A0-102D0 / %x10300-1031F / %x1032D-10340 / %x10342-10349 /
        %x10350-10375 / %x10380-1039D / %x103A0-103C3 / %x103C8-103CF /
        %x10400-1049D / %x104B0-104D3 / %x104D8-104FB / %x10500-10527 /
        %x10530-10563 / %x10600-10736 / %x10740-10755 / %x10760-10767 /
        %x10800-10805 / %x10808 / %x1080A-10835 / %x10837-10838 /
        %x1083C / %x1083F-10855 / %x10860-10876 / %x10880-1089E /
        %x108E0-108F2 / %x108F4-108F5 / %x10900-10915 / %x10920-10939 /
        %x10980-109B7 / %x109BE-109BF / %x10A00 / %x10A10-10A13 /
        %x10A15-10A17 / %x10A19-10A35 / %x10A60-10A7C / %x10A80-10A9C /
        %x10AC0-10AC7 / %x10AC9-10AE4 / %x10B00-10B35 / %x10B40-10B55 /
        %x10B60-10B72 / %x10B80-10B91 / %x10C00-10C48 / %x10C80-10CB2 /
        %x10CC0-10CF2 / %x10D00-10D23 / %x10E80-10EA9 / %x10EB0-10EB1 /
        %x10F00-10F1C / %x10F27 / %x10F30-10F45 / %x10FB0-10FC4 /
        %x10FE0-10FF6 / %x11003-11037 / %x11083-110AF / %x110D0-110E8 /
        %x11103-11126 / %x11144 / %x11147 / %x11150-11172 /
        %x11176 / %x11183-111B2 / %x111C1-111C4 / %x111DA /
        %x111DC / %x11200-11211 / %x11213-1122B / %x11280-11286 /
        %x11288 / %x1128A-1128D / %x1128F-1129D / %x1129F-112A8 /
        %x112B0-112DE / %x11305-1130C / %x1130F-11310 / %x11313-11328 /
        %x1132A-11330 / %x11332-11333 / %x11335-11339 / %x1133D /
        %x11350 / %x1135D-11361 / %x11400-11434 / %x11447-1144A /
        %x1145F-11461 / %x11480-114AF / %x114C4-114C5 / %x114C7 /
        %x11580-115AE / %x115D8-115DB / %x11600-1162F / %x11644 /
        %x11680-116AA / %x116B8 / %x11700-1171A / %x11800-1182B /
        %x118A0-118DF / %x118FF-11906 / %x11909 / %x1190C-11913 /
        %x11915-11916 / %x11918-1192F / %x1193F / %x11941 /
        %x119A0-119A7 / %x119AA-119D0 / %x119E1 / %x119E3 /
        %x11A00 / %x11A0B-11A32 / %x11A3A / %x11A50 /
        %x11A5C-11A89 / %x11A9D / %x11AC0-11AF8 / %x11C00-11C08 /
        %x11C0A-11C2E / %x11C40 / %x11C72-11C8F / %x11D00-11D06 /
        %x11D08-11D09 / %x11D0B-11D30 / %x11D46 / %x11D60-11D65 /
        %x11D67-11D68 / %x11D6A-11D89 / %x11D98 / %x11EE0-11EF2 /
        %x11FB0 / %x12000-12399 / %x12480-12543 / %x13000-1342E /
        %x14400-14646 / %x16800-16A38 / %x16A40-16A5E / %x16AD0-16AED /
        %x16B00-16B2F / %x16B40-16B43 / %x16B63-16B77 / %x16B7D-16B8F /
        %x16E40-16E7F / %x16F00-16F4A / %x16F50 / %x16F93-16F9F /
        %x16FE0-16FE1 / %x16FE3 / %x17000-187F7 / %x18800-18CD5 /
        %x18D00-18D08 / %x1B000-1B11E / %x1B150-1B152 / %x1B164-1B167 /
        %x1B170-1B2FB / %x1BC00-1BC6A / %x1BC70-1BC7C / %x1BC80-1BC88 /
        %x1BC90-1BC99 / %x1D400-1D454 / %x1D456-1D49C / %x1D49E-1D49F /
        %x1D4A2 / %x1D4A5-1D4A6 / %x1D4A9-1D4AC / %x1D4AE-1D4B9 /
        %x1D4BB / %x1D4BD-1D4C3 / %x1D4C5-1D505 / %x1D507-1D50A /
        %x1D50D-1D514 / %x1D516-1D51C / %x1D51E-1D539 / %x1D53B-1D53E /
        %x1D540-1D544 / %x1D546 / %x1D54A-1D550 / %x1D552-1D6A5 /
        %x1D6A8-1D6C0 / %x1D6C2-1D6DA / %x1D6DC-1D6FA / %x1D6FC-1D714 /
        %x1D716-1D734 / %x1D736-1D74E / %x1D750-1D76E / %x1D770-1D788 /
        %x1D78A-1D7A8 / %x1D7AA-1D7C2 / %x1D7C4-1D7CB / %x1E100-1E12C /
        %x1E137-1E13D / %x1E14E / %x1E2C0-1E2EB / %x1E800-1E8C4 /
        %x1E900-1E943 / %x1E94B / %x1EE00-1EE03 / %x1EE05-1EE1F /
        %x1EE21-1EE22 / %x1EE24 / %x1EE27 / %x1EE29-1EE32 /
        %x1EE34-1EE37 / %x1EE39 / %x1EE3B / %x1EE42 /
        %x1EE47 / %x1EE49 / %x1EE4B / %x1EE4D-1EE4F /
        %x1EE51-1EE52 / %x1EE54 / %x1EE57 / %x1EE59 /
        %x1EE5B / %x1EE5D / %x1EE5F / %x1EE61-1EE62 /
        %x1EE64 / %x1EE67-1EE6A / %x1EE6C-1EE72 / %x1EE74-1EE77 /
        %x1EE79-1EE7C / %x1EE7E / %x1EE80-1EE89 / %x1EE8B-1EE9B /
        %x1EEA1-1EEA3 / %x1EEA5-1EEA9 / %x1EEAB-1EEBB / %x20000-2A6DD /
        %x2A700-2B734 / %x2B740-2B81D / %x2B820-2CEA1 / %x2CEB0-2EBE0 /
        %x2F800-2FA1D / %x30000-3134A
        ; 131241 codepoints in total


; unicode codepoints from categories Nd, Nl
numbers = %x30-39 / %x660-669 / %x6F0-6F9 / %x7C0-7C9 /
        %x966-96F / %x9E6-9EF / %xA66-A6F / %xAE6-AEF /
        %xB66-B6F / %xBE6-BEF / %xC66-C6F / %xCE6-CEF /
        %xD66-D6F / %xDE6-DEF / %xE50-E59 / %xED0-ED9 /
        %xF20-F29 / %x1040-1049 / %x1090-1099 / %x16EE-16F0 /
        %x17E0-17E9 / %x1810-1819 / %x1946-194F / %x19D0-19D9 /
        %x1A80-1A89 / %x1A90-1A99 / %x1B50-1B59 / %x1BB0-1BB9 /
        %x1C40-1C49 / %x1C50-1C59 / %x2160-2182 / %x2185-2188 /
        %x3007 / %x3021-3029 / %x3038-303A / %xA620-A629 /
        %xA6E6-A6EF / %xA8D0-A8D9 / %xA900-A909 / %xA9D0-A9D9 /
        %xA9F0-A9F9 / %xAA50-AA59 / %xABF0-ABF9 / %xFF10-FF19 /
        %x10140-10174 / %x10341 / %x1034A / %x103D1-103D5 /
        %x104A0-104A9 / %x10D30-10D39 / %x11066-1106F / %x110F0-110F9 /
        %x11136-1113F / %x111D0-111D9 / %x112F0-112F9 / %x11450-11459 /
        %x114D0-114D9 / %x11650-11659 / %x116C0-116C9 / %x11730-11739 /
        %x118E0-118E9 / %x11950-11959 / %x11C50-11C59 / %x11D50-11D59 /
        %x11DA0-11DA9 / %x12400-1246E / %x16A60-16A69 / %x16B50-16B59 /
        %x1D7CE-1D7FF / %x1E140-1E149 / %x1E2F0-1E2F9 / %x1E950-1E959 /
        %x1FBF0-1FBF9
        ; 886 codepoints in total


; unicode codepoints from categories Mn, Mc
combining_marks = %x300-36F / %x483-487 / %x591-5BD / %x5BF /
        %x5C1-5C2 / %x5C4-5C5 / %x5C7 / %x610-61A /
        %x64B-65F / %x670 / %x6D6-6DC / %x6DF-6E4 /
        %x6E7-6E8 / %x6EA-6ED / %x711 / %x730-74A /
        %x7A6-7B0 / %x7EB-7F3 / %x7FD / %x816-819 /
        %x81B-823 / %x825-827 / %x829-82D / %x859-85B /
        %x8D3-8E1 / %x8E3-903 / %x93A-93C / %x93E-94F /
        %x951-957 / %x962-963 / %x981-983 / %x9BC /
        %x9BE-9C4 / %x9C7-9C8 / %x9CB-9CD / %x9D7 /
        %x9E2-9E3 / %x9FE / %xA01-A03 / %xA3C /
        %xA3E-A42 / %xA47-A48 / %xA4B-A4D / %xA51 /
        %xA70-A71 / %xA75 / %xA81-A83 / %xABC /
        %xABE-AC5 / %xAC7-AC9 / %xACB-ACD / %xAE2-AE3 /
        %xAFA-AFF / %xB01-B03 / %xB3C / %xB3E-B44 /
        %xB47-B48 / %xB4B-B4D / %xB55-B57 / %xB62-B63 /
        %xB82 / %xBBE-BC2 / %xBC6-BC8 / %xBCA-BCD /
        %xBD7 / %xC00-C04 / %xC3E-C44 / %xC46-C48 /
        %xC4A-C4D / %xC55-C56 / %xC62-C63 / %xC81-C83 /
        %xCBC / %xCBE-CC4 / %xCC6-CC8 / %xCCA-CCD /
        %xCD5-CD6 / %xCE2-CE3 / %xD00-D03 / %xD3B-D3C /
        %xD3E-D44 / %xD46-D48 / %xD4A-D4D / %xD57 /
        %xD62-D63 / %xD81-D83 / %xDCA / %xDCF-DD4 /
        %xDD6 / %xDD8-DDF / %xDF2-DF3 / %xE31 /
        %xE34-E3A / %xE47-E4E / %xEB1 / %xEB4-EBC /
        %xEC8-ECD / %xF18-F19 / %xF35 / %xF37 /
        %xF39 / %xF3E-F3F / %xF71-F84 / %xF86-F87 /
        %xF8D-F97 / %xF99-FBC / %xFC6 / %x102B-103E /
        %x1056-1059 / %x105E-1060 / %x1062-1064 / %x1067-106D /
        %x1071-1074 / %x1082-108D / %x108F / %x109A-109D /
        %x135D-135F / %x1712-1714 / %x1732-1734 / %x1752-1753 /
        %x1772-1773 / %x17B4-17D3 / %x17DD / %x180B-180D /
        %x1885-1886 / %x18A9 / %x1920-192B / %x1930-193B /
        %x1A17-1A1B / %x1A55-1A5E / %x1A60-1A7C / %x1A7F /
        %x1AB0-1ABD / %x1ABF-1AC0 / %x1B00-1B04 / %x1B34-1B44 /
        %x1B6B-1B73 / %x1B80-1B82 / %x1BA1-1BAD / %x1BE6-1BF3 /
        %x1C24-1C37 / %x1CD0-1CD2 / %x1CD4-1CE8 / %x1CED /
        %x1CF4 / %x1CF7-1CF9 / %x1DC0-1DF9 / %x1DFB-1DFF /
        %x20D0-20DC / %x20E1 / %x20E5-20F0 / %x2CEF-2CF1 /
        %x2D7F / %x2DE0-2DFF / %x302A-302F / %x3099-309A /
        %xA66F / %xA674-A67D / %xA69E-A69F / %xA6F0-A6F1 /
        %xA802 / %xA806 / %xA80B / %xA823-A827 /
        %xA82C / %xA880-A881 / %xA8B4-A8C5 / %xA8E0-A8F1 /
        %xA8FF / %xA926-A92D / %xA947-A953 / %xA980-A983 /
        %xA9B3-A9C0 / %xA9E5 / %xAA29-AA36 / %xAA43 /
        %xAA4C-AA4D / %xAA7B-AA7D / %xAAB0 / %xAAB2-AAB4 /
        %xAAB7-AAB8 / %xAABE-AABF / %xAAC1 / %xAAEB-AAEF /
        %xAAF5-AAF6 / %xABE3-ABEA / %xABEC-ABED / %xFB1E /
        %xFE00-FE0F / %xFE20-FE2F / %x101FD / %x102E0 /
        %x10376-1037A / %x10A01-10A03 / %x10A05-10A06 / %x10A0C-10A0F /
        %x10A38-10A3A / %x10A3F / %x10AE5-10AE6 / %x10D24-10D27 /
        %x10EAB-10EAC / %x10F46-10F50 / %x11000-11002 / %x11038-11046 /
        %x1107F-11082 / %x110B0-110BA / %x11100-11102 / %x11127-11134 /
        %x11145-11146 / %x11173 / %x11180-11182 / %x111B3-111C0 /
        %x111C9-111CC / %x111CE-111CF / %x1122C-11237 / %x1123E /
        %x112DF-112EA / %x11300-11303 / %x1133B-1133C / %x1133E-11344 /
        %x11347-11348 / %x1134B-1134D / %x11357 / %x11362-11363 /
        %x11366-1136C / %x11370-11374 / %x11435-11446 / %x1145E /
        %x114B0-114C3 / %x115AF-115B5 / %x115B8-115C0 / %x115DC-115DD /
        %x11630-11640 / %x116AB-116B7 / %x1171D-1172B / %x1182C-1183A /
        %x11930-11935 / %x11937-11938 / %x1193B-1193E / %x11940 /
        %x11942-11943 / %x119D1-119D7 / %x119DA-119E0 / %x119E4 /
        %x11A01-11A0A / %x11A33-11A39 / %x11A3B-11A3E / %x11A47 /
        %x11A51-11A5B / %x11A8A-11A99 / %x11C2F-11C36 / %x11C38-11C3F /
        %x11C92-11CA7 / %x11CA9-11CB6 / %x11D31-11D36 / %x11D3A /
        %x11D3C-11D3D / %x11D3F-11D45 / %x11D47 / %x11D8A-11D8E /
        %x11D90-11D91 / %x11D93-11D97 / %x11EF3-11EF6 / %x16AF0-16AF4 /
        %x16B30-16B36 / %x16F4F / %x16F51-16F87 / %x16F8F-16F92 /
        %x16FE4 / %x16FF0-16FF1 / %x1BC9D-1BC9E / %x1D165-1D169 /
        %x1D16D-1D172 / %x1D17B-1D182 / %x1D185-1D18B / %x1D1AA-1D1AD /
        %x1D242-1D244 / %x1DA00-1DA36 / %x1DA3B-1DA6C / %x1DA75 /
        %x1DA84 / %x1DA9B-1DA9F / %x1DAA1-1DAAF / %x1E000-1E006 /
        %x1E008-1E018 / %x1E01B-1E021 / %x1E023-1E024 / %x1E026-1E02A /
        %x1E130-1E136 / %x1E2EC-1E2EF / %x1E8D0-1E8D6 / %x1E944-1E94A /
        %xE0100-E01EF
        ; 2282 codepoints in total

@ChristianSi
Copy link
Contributor

@marzer Great!

@pradyunsg Assuming that one of us prepares a PR, it there any change that this would be merged relatively quickly? Or does it have to wait until 1.0 is released in any case?

@lmna
Copy link

lmna commented Dec 16, 2019

This feels like a significant change to TOMLs interpretation of being "minimal". Maybe we should ask Tom himself to bless this change?

@marzer
Copy link
Contributor Author

marzer commented Dec 16, 2019

Is it though? The language itself will be just as minimal as before, since this change will be backwards-compatible. In fact it would actually increase the simplicity of TOML files since keys should work in a WYSIWYG way for more people, and only require quotes in very specific circumstances.

It will complicate it for implementers, sure, but not all that much.

@thoughtafter
Copy link

thoughtafter commented Feb 17, 2020

I'm coming to this as someone who is incorporating TOML into a project with keys that will often contain symbols/punctuation. I've read through this thread and I have not seen anyone propose that keys allow any valid unicode except the symbols needed by the TOML parser itself. That would:

  1. maximize permissiveness
  2. create an easy and small set of rules for when people need to quote keys (if the key contains space, dot, brackets, quotes, etc, then it must be quoted)
  3. not strictly require a unicode library for a parser
  4. Likely be faster than other options to parse

I'm not currently arguing this is the best approach but it seemed worth adding to the set of options in the discussion space.

@marzer
Copy link
Contributor Author

marzer commented Feb 22, 2020

If anyone is interested in playing around with a parser that supports this tentative feature (as specified in the OP, anyways), my C++ TOML library is now in a publishable state: https://marzer.github.io/tomlplusplus/

@thoughtafter It seems as though your suggestion is very much in-line with @abelbraaksma's (which from my reading, advocates including everything except syntactically-relevant/ambiguous characters).

@LongTengDao
Copy link
Contributor

LongTengDao commented Apr 11, 2020

In my opinion, we don't program in any language, including English. What we are coding is symbol. ASCII in programming is safe symbols, not English.

In high level languages, identifier could be defined as any charactor, because here is IDE and highlight.

But TOML is designed for ini file, usually no any extra support when editing.
That's also why (and the mainly reason why TOML exists) TOML is better than YAML—because we can't indent/deindent easily, for nested values or multiline strings.
In any other language, indent is better than non-indent design, we all know that.

So I think that's really dangerous to allow bare keys include non-ASCII charactor.

  1. If we support complex charactor range under sematic of Unicode, like JS (also HTML) does, then all languages users need bear the mental burden of all other languages which they not use and unfamiliar. It's not minimal.
  2. If we support all non-ASCII as identifier, like CSS does, then there could be a lot of invisible characters maybe written in key, which is not obvious. As ini files are usally used in low level symtem program config, it's so dangerous.

But, I think spec allow implementations to support user specified language bare keys support is good. What languages you are fimilar, you use that. For example, /[\u4e00-\u9fa5]/ is Chinese, so it can be easily supported and then easy to write bare key, and, safe. But who you know and care? But as specific language user, I know, I can pass range argument as an options value to parser, preserving highly controlled at the same time.

I think simplicity and nationality could be no conflict, not must one or the other—otherwise, absolute "fairness" will lead to widespread inefficiency.

@marzer
Copy link
Contributor Author

marzer commented Apr 11, 2020

@LongTengDao it's a config file format, not a database specification or real-time streaming format; I don't think 'efficiency' is all that relevant (if you mean the computational complexity of parsing, that is).

Unless you mean the efficiency of the actual implementing of the new functionality? As in, it will be a bit complex for implementers and maintainers to get this working in their parsers, thus being inefficient for them? If so, that's not even true. It's pretty easy to implement. I've done it myself, and provide relevant information in the original post.

I'm not sure what other sorts of efficiencies you could mean. It wouldn't make TOML any less efficient to write (if anything it would get simpler and easier to use as a result of this proposal).

@LongTengDao
Copy link
Contributor

LongTengDao commented Apr 11, 2020

@marzer I've never considered the difficulty of writing a parser is a hindrance, and it's not worth considering in the face of a perfectly formatted file design task. If anyone objects to this, I will be on your side.

I only mean the efficiency of writing and checking. Introducing special characters too broadly will make the process of reading and writing a file stressful again. Remember, Unicode doesn't just include characters in common languages like the ones you and me use (1en or 1em width). Instead, there are many combinations of display and invisible even right-to-left characters that are in character category, rather than punctuation or whitespace category. It's a nightmare, if you've ever developed an typesetting software like Office Word. After that, I have been frightened by words like "all valid Unicode".


But anyway, it doesn't affect my use. If spec said ASCII only, I will support user options to support any Unicode character range. If spec said any Unicode character is valid as your suggestion, I will support user options to limit ASCII only. I think this right belongs to user.

@marzer
Copy link
Contributor Author

marzer commented Apr 12, 2020

If spec said any Unicode character is valid as your suggestion

@LongTengDao to be clear, my proposal isn't to support "any Unicode character", as you seem to think. It's to support a subset (letters, numbers, and some combining marks).

Yeah there might be characters in those categories that are effectively garbage for our purposes but they can probably just be ignored; if it's not a character on a keyboard then someone has gone to effort to put it in their config, and if that breaks stuff then that's the life they chose. Parser library users can trivially add additional sanity-checking if they feel the need.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jul 2, 2020

Instead, there are many combinations of display and invisible even right-to-left characters that are in character category, rather than punctuation or whitespace category.

@LongTengDao I believe the opposite to be true. Limiting users that are accustomed to right-to-left writing means limiting over 1.7 billion people worldwide to a system that is not native to them. What is perhaps perceived by you as "a nightmare" is perceived by others as a nightmare if it isn't allowed. Not everyone speaks English or can write in their native tongue using only ASCII characters (in fact, it is a relative small share of the world population).

Inclusion of other cultures, languages and writing systems is a good thing, and although TOML is not a programming language, many well-known programming language embrace inclusion more than exclusion: C#, VB, F# (allows any character), Java (they allow a broader set than defined here), Ruby, Perl, XML/HTML tag names, CSS classes/id and there are many more.

Unicode even has a specific TR that describes the recommended way for allowing Unicode characters in identifiers: https://unicode.org/reports/tr31/.

Differences between languages will always exist, but the closer a language (or a spec like TOML) gets to TR31, the better it is for the worldwide community of thousands of languages that can write in their native tongue.

If any company or individual wishes to limit the allowed set of characters in identifiers, or in coding in general, they are of course free to do so, just like coding styles exist for many programming languages, you could limit your style to "only ASCII" or whatever you prefer.

And as already has been said, the proposal here is a safe subset of the Unicode language.

@abelbraaksma
Copy link
Contributor

abelbraaksma commented Jul 2, 2020

Regarding ranges full of marks, you mention that "they are not allowed as start character in XML and we should probably disallow it as well." Maybe you could investigated that further? It would be interesting to see which ranges are prohibited in NameStartChar, but allowed in NameChar – and why. If they are full of non-ASCII digits, we might want to allow them even at the start, but if they are full of marks, we certainly don't.

@ChristianSi, apologies for the wait, I forgot about your question here.

The precise definition of NameChar is that of NameStartChar with a few additions. These additions are therefore not allowed as a starting character:

NameChar ::=  NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

Let's split that up. According to this, the following are not allowed as a starting character (and I believe this mostly follows current practice in TOML as well):

I'm not sure why the "tie" is forbidden as starting character in XML names (it is not a combining tie, it is spaced), but the other ones seem sensible.

I can write up a TOML spec proposal for this set, and/or extend it to TR31 if that somehow makes sense, but I think it is easier for people in general to use the XML specification (without reference to XML, of course, as it is otherwise unrelated), since they already did the necessary research, it's concise, and it's trivial to implement. TR31 is quite hard to read and probably raises new questions again.

@pradyunsg
Copy link
Member

This looks like we're missing a PR for doing this. If someone wants to pick this up, and file a PR expanding the allowed bare keys syntax to include letters from the broader unicode spec, that'd be welcome!

@abelbraaksma
Copy link
Contributor

@pradyunsg, done, I've created a PR in #891. I tried to be both as inclusive as possible, while maintaining simplicity for parsers. Basically the rule is now: "Any Unicode letter, letterlike character or digit, except dot", as discussed above.

@abelbraaksma
Copy link
Contributor

This has now been merged. Thanks everyone for their support and insights!

arp242 added a commit to arp242/toml that referenced this issue Jun 2, 2023
This backs out the unicode bare keys from toml-lang#891.

This does *not* mean we can't include it in a future 1.2 (or 1.3, or
whatever); just that right now there doesn't seem to be a clear
consensus regarding to normalisation and which characters to include.
It's already the most discussed single issue in the history of TOML.

I kind of hate doing this as it seems a step backwards; in principle I
think we *should* have this so I'm not against the idea of the feature
as such, but things seem to be at a bit of a stalemate right now, and
this will allow TOML to move forward on other issues.

It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until
2019, and has only 11 upvotes. Other than that, the issue was raised
only once before in 2015 as far as I can find (toml-lang#337). I also can't
really find anyone asking for it in any of the HN threads on TOML.

All of this means we can push forward releasing TOML 1.1, giving people
access to the much more frequently requested relaxing of inline tables
(toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other
more minor things (e.g. `\e` has 12 upvotes in toml-lang#715).

Basically, a lot more people are waiting for this, and all things
considered this seems a better path forward for now, unless someone
comes up with a proposal which addresses all issues (I tried and thus
far failed).

I proposed this over here a few months ago, and the response didn't seem
too hostile to the idea:
toml-lang#966 (comment)
arp242 added a commit to arp242/toml that referenced this issue Jun 2, 2023
This backs out the unicode bare keys from toml-lang#891.

This does *not* mean we can't include it in a future 1.2 (or 1.3, or
whatever); just that right now there doesn't seem to be a clear
consensus regarding to normalisation and which characters to include.
It's already the most discussed single issue in the history of TOML.

I kind of hate doing this as it seems a step backwards; in principle I
think we *should* have this so I'm not against the idea of the feature
as such, but things seem to be at a bit of a stalemate right now, and
this will allow TOML to move forward on other fronts.

It hasn't come up *that* often; the issue (toml-lang#687) wasn't filed until
2019, and has only 11 upvotes. Other than that, the issue was raised
only once before in 2015 as far as I can find (toml-lang#337). I also can't
really find anyone asking for it in any of the HN threads on TOML.

Reverting this means we can go forward releasing TOML 1.1, giving people
access to the much more frequently requested relaxing of inline tables
(toml-lang#516, with 122 upvotes, and has come up on HN as well) and some other
more minor things (e.g. `\e` has 12 upvotes in toml-lang#715).

Basically, a lot more people are waiting for this, and all things
considered this seems a better path forward for now, unless someone
comes up with a proposal which addresses all issues (I tried and thus
far failed).

I proposed this over here a few months ago, and the responses didn't
seem too hostile to the idea:
toml-lang#966 (comment)
arp242 added a commit to arp242/toml that referenced this issue Sep 22, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and ANY solution is a
trade-off. That said, I do believe some trade-offs are better than
others, and after looking at a bunch of different options I believe this
is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is the strongest argument in favour of this and the biggest
  improvement: we can't really do anything wrong here in a way that we
  can't correct later. Being conservative is probably the right way
  forward.

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work", but "this
  character works fine, but this very similar doesn't". This shows up in
  a number of things:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  From the user's perspective this seems like a bug in the TOML parser.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the code adding multibyte support in the first case will
  probably be harder, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something "Extra Augmented BNF?"

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

[1]: Aside: I encountered this just the other day as I created a TOML
     file with all UK election results since 1945, which looks like:

         [1950]
         Labour       = [13_266_176, 315, 617]
         Conservative = [12_492_404, 298, 619]
         Liberal      = [ 2_621_487,   9, 475]
         Sinn_Fein    = [    23_362,   0,   2]

     That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just
     wrote it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later. Being conservative for these type of things is is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  It should either allow everything or nothing. This in-between is just
  horrible. From the user's perspective this seems like a bug in the
  TOML parser, but it's not: it's a bug in the specification.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the code adding multibyte support in the first case will
  probably be harder, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something "Extra Augmented BNF?"

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and after looking at a bunch of different options I believe
this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later. Being conservative for these type of things is is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them and the specification even strongly discourages people from
  using them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all.

  People don't read specifications, nor should they. People try
  something and sees if it works. Now it seems to work on first
  approximation, and then (possibly months later) it seems to "break".

  It should either allow everything or nothing. This in-between is just
  horrible. From the user's perspective this seems like a bug in the
  TOML parser, but it's not: it's a bug in the specification.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

- Maps to identifiers in more (though not all) languages. We discussed
  whether TOML keys are "strings" or "identifiers" last week in toml-lang#966 and
  while views differ (mostly because they're both) it seems to me that
  making it map *closer* is better. This is a minor issue, but it's
  nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
arp242 added a commit to arp242/toml that referenced this issue Sep 23, 2023
I believe this would greatly improve things and solves all the issues,
mostly. It's a bit more complex, but not overly so, and can be
implemented without a Unicode library without too much effort. It offers
a good middle ground, IMHO.

I don't think there are ANY perfect solutions here and that *anything*
will be a trade-off. That said, I do believe some trade-offs are better
than others, and I've made it no secret that I feel the current
trade-off is a bad one. After looking at a bunch of different options I
believe this is by far the best path for TOML.

Advantages:

- This is what I would consider the "minimal set" of characters we need
  to add for reasonable international support, meaning we can't really
  make a mistake with this by accidentally allowing too much.

  We can add new ranges in TOML 1.2 (or even change the entire approach,
  although I'd be very surprised if we need to), based on actual
  real-world feedback, but any approach we will take will need to
  include letters and digits from all scripts.

  This is a strong argument in favour of this and a huge improvement: we
  can't really do anything wrong here in a way that we can't correct
  later, unlike what we have now, which is "well I think it probably
  won't cause any problems, based on what these 5 European/American guys
  think, but if it does: we won't be able to correct it".

  Being conservative for these type of things is good!

- This solves the normalisation issues, since combining characters are
  no longer allowed in bare keys, so it becomes a moot point.

  For quoted keys normalisation is mostly a non-issue because few people
  use them, which is why this gone largely unnoticed and undiscussed
  before the "Unicode in bare keys" PR was merged.[1]

- It's consistent in what we allow: no "this character is allowed, but
  this very similar other thing isn't, what gives?!"

  Note that toml-lang#954 was NOT about "I want all emojis to work" per se, but
  "this character works fine, but this very similar doesn't". This shows
  up in a number of things aside from emojis:

      a.toml:
              Input:   ; = 42  # U+037E GREEK QUESTION MARK (Other_Punctuation)
              Error:   line 1: expected '.' or '=', but got ';' instead

      b.toml:
              Input:   · = 42  # # U+0387 GREEK ANO TELEIA (Other_Punctuation)
              Error:   (none)

      c.toml:
              Input:   – = 42  # U+2013 EN DASH (Dash_Punctuation)
              Error:   line 1: expected '.' or '=', but got '–' instead

      d.toml:
              Input:   ⁻ = 42  # U+207B SUPERSCRIPT MINUS (Math_Symbol)
              Error:   (none)

      e.toml:
              Input:   #x = "commented ... or is it?"  # U+FF03 FULLWIDTH NUMBER SIGN (Other_Punctuation)
              Error:   (none)

  "Some punctuation is allowed but some isn't" is hard to explain, and
  also not what the specification says: "Punctuation, spaces, arrows,
  box drawing and private use characters are not allowed." In reality, a
  lot of punctuation IS allowed, but not all (especially outside of the
  Latin character range by the way, which shows the Euro/US bias in how
  it's written).

  People don't read specifications in great detail, nor should they.
  People try something and sees if it works. Now it seems to work on
  first approximation, and then (possibly months or years later) it
  seems to "suddenly break". From the user's perspective this seems like
  a bug in the TOML parser, but it's not: it's a bug in the
  specification. It should either allow everything or nothing. This
  in-between is confusing and horrible.

  There is no good way to communicate this other than "these codepoints,
  which cover most of what you'd write in a sentence, except when it
  doesn't".

  In contrast, "we allow letters and digits" is simple to spec, simple
  to communicate, and should have a minimum potential for confusion. The
  current spec disallows some things seemingly almost arbitrary while
  allowing other very similar characters.

- This avoids a long list of confusable special TOML characters; some
  were mentioned above but there are many more:

      '#' U+FF03     FULLWIDTH NUMBER SIGN (Other_Punctuation)
      '"' U+FF02     FULLWIDTH QUOTATION MARK (Other_Punctuation)
      '﹟' U+FE5F     SMALL NUMBER SIGN (Other_Punctuation)
      '﹦' U+FE66     SMALL EQUALS SIGN (Math_Symbol)
      '﹐' U+FE50     SMALL COMMA (Other_Punctuation)
      '︲' U+FE32     PRESENTATION FORM FOR VERTICAL EN DASH (Dash_Punctuation)
      '˝'  U+02DD     DOUBLE ACUTE ACCENT (Modifier_Symbol)
      '՚'  U+055A     ARMENIAN APOSTROPHE (Other_Punctuation)
      '܂'  U+0702     SYRIAC SUBLINEAR FULL STOP (Other_Punctuation)
      'ᱹ'  U+1C79     OL CHIKI GAAHLAA TTUDDAAG (Modifier_Letter)
      '₌'  U+208C     SUBSCRIPT EQUALS SIGN (Math_Symbol)
      '⹀'  U+2E40     DOUBLE HYPHEN (Dash_Punctuation)
      '࠰'  U+0830     SAMARITAN PUNCTUATION NEQUDAA (Other_Punctuation)

  Is this a big problem? I guess it depends; I can certainly imagine an
  Armenian speaker accidentally leaving an Armenian apostrophe.

  Confusables is also an issue with different scripts (Latin and
  Cyrillic is well-known), but this is less of an issue since it's not
  syntax, and also something that's fundamentally unavoidable in any
  multi-script environment.

- Maps closer to identifiers in more (though not all) languages. We
  discussed whether TOML keys are "strings" or "identifiers" last week
  in toml-lang#966 and while views differ (mostly because they're both) it seems
  to me that making it map *closer* is better. This is a minor issue,
  but it's nice.

That does not mean it's perfect; as I mentioned all solutions come with
a trade-off. The ones made here are:

- The biggest issue by far is that the check to see if a character is
  valid may become more complex for some languages and environments that
  can't rely on a Unicode database being present.

  However, implementing this check is trivial logic-wise: it just needs
  to loop over every character and check if it's in a range table. You
  already need this with TOML 1.0, it's just that the range tables
  become larger.

  The downside is it needs a somewhat large-ish "allowed characters"
  table with 716 start/stop ranges, which is not ideal, but entirely
  doable and easily auto-generated. It's ~164 lines hard-wrapped at
  column 80 (or ~111 lines hard-wrapped at col 120). tomlc99 is 2,387
  lines, so that seems within the limits of reason (actually, reading
  through the tomlc99 code adding multibyte support at all will be the
  harder part, with this range table being a minor part).

- There's a new Unicode version roughly every year or so, and the way
  it's written now means it's "locked" to Unicode 9 or, optionally, a
  later version. This is probably fine: Apple's APFS filesystem (which
  does normalisation) is "locked" to Unicode 9.0; HFS+ was Unicode 3.2.
  Go is Unicode 8.0. etc. I don't think this is really much of an issue
  in practice.

  I choose Unicode 9 as everyone supports this; I doubted a long time
  over it, and we can also use a more recent version. I feel this gives
  us a nice balance between reasonable interoperability while also
  future-proofing things.

- ABNF doesn't support Unicode. This is a tooling issue, and in my
  opinion the tooling should adjust to how we want TOML to look like,
  rather than adjusting TOML to what tooling supports. AFAIK no one uses
  the ABNF directly in code, and it's merely "informational".

  I'm not happy with this, but personally I think this should be a
  non-issue when considering what to do here. We're not the only people
  running in to this limitation, and is really something that IETF
  should address in a new RFC or something ("Extra Augmented BNF"?)

Another solution I tried is restricting the code ranges; I twice tried
to do this (with some months in-between) and spent a long time looking
at Unicode blocks and ranges, and I found this impractical: we'll end up
with a long list which isn't all that different from what this proposal
adds.

Fixes toml-lang#954
Fixes toml-lang#966
Fixes toml-lang#979
Ref toml-lang#687
Ref toml-lang#891
Ref toml-lang#941

---

[1]:
Aside: I encountered this just the other day as I created a TOML file
with all UK election results since 1945, which looks like:

     [1950]
     Labour       = [13_266_176, 315, 617]
     Conservative = [12_492_404, 298, 619]
     Liberal      = [ 2_621_487,   9, 475]
     Sinn_Fein    = [    23_362,   0,   2]

That should be Sinn_Féin, but "Sinn_Féin" seemed ugly, so I just wrote
it as Sinn_Fein. This is what most people seem to do.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants