Suggestion: Include HTML character entity reference names in output and in search #22

ctsrc · 2019-10-26T22:01:48Z

With your tool it is possible to look up unicode characters by various criteria as you've stated in your readme, including "unicode name" and "also known as".

In HTML, named character escape sequences are available for things like the less than and the greater than signs, but also for quite a few other characters.

Back in the day, before UTF-8 encoding support was widespread, we'd use the ISO-8859-1 encoding for our HTML and we'd use named character escape sequences for characters like æ, ø, å for example.

Some of those names stuck with me and I sometimes search for those characters by those names on Google if I am on a machine where inputing said characters directly is not possible or just too cumbersome.

Even on my MacBook Air, where I can generally long-press certain keys to access other characters, some applications implement text input that does not support the long-press functionality, so I go to some other window on-screen and either long-press there, or search for it on Google whichever is most convenient at the time (convenience in this case is determined by which other windows I happen to have on screen at that moment).

I pretty much always have at least one terminal window open at any time, and if I don't then opening the terminal is fast and simple.

Prior to purchasing my MacBook Air, when I was running Linux on a ThinkPad, I made a few simple shellscripts that were named after the HTML character entity references for the characters that I most commonly needed; æ, ø, å, Æ, Ø, Å; aelig, oslash, aring, AElig, Oslash, Aring. When executed they would spit out the corresponding UTF-8 encoded byte sequence for the character in question.

oslash

ø

A full list of all HTML character entity references can be found at https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML

Most notably for me personally, aside from the six mentioned above are laquo, raquo, ndash, mdash, eacute and Eacute, but they are all useful IMO and anyway if you agree to include the HTML character entity reference names then it would make the most sense to include them all I think.

So to get to the point, my suggestion is that based upon the table at https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML, an additional field be added for applicable characters in the output for chars.

Some examples of what the output of chars would look like:

Example 1

chars U+002A

ASCII 2/a,  42, 0x2a, 0052, bits 00101010
Width: 1, prints as *
Unicode name: ASTERISK
Also known as: Star, Splat, Aster, Times, Gear, Dingle, Bug, Twinkle, Glob
HTML entity names: ast, midast

Example 2

chars U+00AE

LATIN1 ae, 174, 0xae, 0256, bits 10101110
Width: 1 (2 in CJK context), prints as ®
Quotes as \u{ae}
Unicode name: REGISTERED SIGN
HTML entity names: reg, circledR, REG

Example 3

chars U+00C6

LATIN1 c6, 198, 0xc6, 0306, bits 11000110
Width: 1 (2 in CJK context), prints as Æ
Upper case. Downcases to æ
Quotes as \u{c6}
Unicode name: LATIN CAPITAL LETTER AE
HTML entity name: AElig

In the examples above, a field named "HTML entity names" (where multiple names exist) or "HTML entity name" (where only one name exists) has been added.

Furthermore, I request that case-sensitive search is performed on this field where present, so that one can search for them and get results like shown in the following examples:

Example 1

chars Oslash

LATIN1 d8, 216, 0xd8, 0330, bits 11011000
Width: 1 (2 in CJK context), prints as Ø
Upper case. Downcases to ø
Quotes as \u{d8}
Unicode name: LATIN CAPITAL LETTER O WITH STROKE
HTML entity name: Oslash

Example 2

chars oslash

LATIN1 f8, 248, 0xf8, 0370, bits 11111000
Width: 1 (2 in CJK context), prints as ø
Lower case. Upcases to Ø
Quotes as \u{f8}
Unicode name: LATIN SMALL LETTER O WITH STROKE
HTML entity name: oslash

The text was updated successfully, but these errors were encountered:

ctsrc · 2019-10-26T22:26:17Z

Better than the Wikipedia list I initially linked to would be to use the official list of character entities at https://html.spec.whatwg.org/multipage/named-characters.html

23: Add data file and retrieval script for character reference names supported by HTML r=antifuchs a=ctsrc This PR relates to issue #22 and is a first step towards the request I made in that issue. Co-authored-by: Erik Nordstrøm <erik@nordstroem.no>

ctsrc mentioned this issue Oct 26, 2019

Add data file and retrieval script for character reference names supported by HTML #23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Include HTML character entity reference names in output and in search #22

Suggestion: Include HTML character entity reference names in output and in search #22

ctsrc commented Oct 26, 2019

ctsrc commented Oct 26, 2019

Suggestion: Include HTML character entity reference names in output and in search #22

Suggestion: Include HTML character entity reference names in output and in search #22

Comments

ctsrc commented Oct 26, 2019

ctsrc commented Oct 26, 2019