-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unique display names as variants #20
Conversation
How did you decide which token to include a conflicting |
I didn't; if there's a conflict then nobody gets it. |
My intention was for the This change makes a lot of sense, though, and now I think it's worth considering whether the |
Hmmm, I didn't think of that angle (suppose I could have dug back through our discussions on the spec). Deciding this downstream for the varslib tokenizer was my intention, but it felt a bit bodgy.
Good idea. It'd make sense to make an attribute though it's not pressing either way. |
So I proposed the the initial form of the current format here: <lang code="en">
<encodings>
<ti-ascii>Field A</ti-ascii>
<!-- are there any other viable encodings?
cause if not this does seem maybe a tad silly -->
</encodings>
<names>
<display>Field B</display>
<accessible>Field C</accessible>
<variant>Field D</variant>
<!-- more variants as need be -->
</names>
</lang> Tari then suggested we simplify the structure a bit by not having a separate <lang code="en" ti-ascii="Field A">
<display>Field B</display>
<accessible>Field C</accessible>
<variant>Field D</variant>
<!-- additional variants if applicable -->
</lang> Here's his reasoning, which at least partially resonated with everyone since it's now the format:
It's clear that it is worthwhile to distinguish encodings and names, named so literally in my proposal. Since there is only one encoding we care about, and the number of distinct names could be pretty big, it made sense to collapse everything down to a single containing tag, with Thus, there is due cause to keep
This bit received basically no reply, neither enthusiasm nor objection. Since <lang code="en" ti-ascii="..." display="..." accessible="...">
<name>...</name>
<name>...</name>
...
</lang> It's certainly "easier" to parse, though human readability starts to diminish as the attribute line can get quite long (e.g. |
I'm OK with having all the unique things as attributes as it makes it clearer that they are indeed unique (per We could still have something "visually easy" to parse by having them on separate lines if needed? <lang code="en"
ti-ascii="..."
display="..."
accessible="...">
<name>...</name>
<name>...</name>
...
</lang> |
I do rather like collapsing things down; simplifies parsing and validation for sure. However, thinking it over I'd vote for making In Pythonistic terms (if we pretend that versions and langs don't exist for a moment) you can write print(token.ti_ascii)
print(token.display) since they are attributes of the token, but you cannot form the dictionaries {token.ti_ascii: token for token in tokens}
{token.display: token for token in tokens} without collisions. Indeed, you can only do this with {name: token for token in tokens for name in token.names} This is what should be, IMO, the defining difference between attributes and child tags. |
Adds all
<display>
names as<variant>
names for the given token so long asBoth of those conditions are just the requirements for any variant name. Not every display name satisfies them (e.g. stats "e" and lowercase "e"), hence why display names are not used by
trie.py
(and hencetivars_lib_py
) for tokenization. Adding the ones that are unique as variants is the most straightforward way to make such names available.The new sheet passed validation on my end, though it's worth double-checking.