Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is not clear how to write a multipoint entity in your entity list #737

Open
StoneCypher opened this issue Oct 30, 2021 · 5 comments
Open
Assignees

Comments

@StoneCypher
Copy link
Contributor

Some HTML entities, such as nsubE, are represented as multiple unicode characters (in this case U+2AC5 U+0338.) This is particularly common in math symbols using the slash to strike through symbols.

It is not immediately clear to me how to represent that in the kramdown entity list.

If you could tell me how to represent that one case please, I would happily extend it to the remainder.

@gettalong
Copy link
Owner

The conversion from codepoint to character string is done like this:

[code_point].pack('U*')

This will create the correct string representation for any Unicode codepoint. So as long as the entity consists of a single code point, this will work.

Does that clear it up?

@StoneCypher
Copy link
Contributor Author

StoneCypher commented Oct 31, 2021

I apologize. That isn't what I meant.

      ENTITY_TABLE = [
        [913, 'Alpha'],
        [914, 'Beta'],
        [915, 'Gamma'],

...

        [213, 'Otilde'],
        [214, 'Ouml'],
        [215, 'times'],

Please pretend for a moment that there was no dedicated capital-O umlaut Ö character. There is, of course; it's U+00D6, represented here as decimal 214. But let's pretend there wasn't.

In Unicode, there is a dedicated combining diaresis, and you can attach it to other characters to construct the character you need. As such, you could make the character with capital O O U+004F then combining diaresis ◌̈ U+0308. We prefer the pre-combined O because fonts trying to typeset symbols above letters typically do a bad job, and sorting is a nightmare, and etc, but, you can actually have an umlaut over whatever, including the poop emoji, if you really want to.

So for a moment, pretend please that I want to rewrite your Ouml rule to emit two codepoints, and construct the Ö instead of using the real one. In this case it's silly, but this is legitimately how quite a few entities (particularly in math) are written. By example, ⫅̸ - Not subset-equal - is written as U+2288, the dedicated math symbol, but really should be written as U+10949 subset equal U+338 negating slash (the logic symbol) instead.

And that's hard to think about, so we're lying, and talking about O umlaut.

If for some stupid reason I wanted to emit U+004F U+0308 for Ouml in this table, how would I do it?

@gettalong
Copy link
Owner

gettalong commented Nov 1, 2021

I see. This is not possible with how the entities are implemented in kramdown though it is easily doable by just doing [code_point1, code_point2].pack('U*').

As far as I can see, however, all the HTML5 entities are just single-codepoint entities? So this should not be a problem here.

Edit: Sorry, I just looked at the PR and not at the original issue - there you also listed entities with two codepoints. Supporting those entails revamping the entity implementation.

@StoneCypher
Copy link
Contributor Author

There are a few.

Name Symbol Codepoint
ncongdot ⩭̸ U+2A6D (10861), U+0338 (824)
nleqslant, nles, NotLessSlantEqual ⩽̸ U+2A7D (10877), U+0338 (824)
ngeqslant, nges, NotGreaterSlantEqual ⩾̸ U+2A7E (10878), U+0338 (824)

There are 65 other than these three.

@StoneCypher
Copy link
Contributor Author

Edit: Sorry, I just looked at the PR and not at the original issue - there you also listed entities with two codepoints. Supporting those entails revamping the entity implementation.

❤️ ❤️ ❤️

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants