Ranges in DerivedNames for Rugrep #2

noraj · 2024-08-01T23:05:37Z

The issue is there are several ranges in DerivedName.txt

➜ cat data/DerivedName.txt | grep '\.\.'                                                                  │
3400..4DBF    ; CJK UNIFIED IDEOGRAPH-*                                                                   │
4E00..9FFF    ; CJK UNIFIED IDEOGRAPH-*                                                                   │
F900..FA6D    ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
FA70..FAD9    ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
17000..187F7  ; TANGUT IDEOGRAPH-*                                                                        │
18B00..18CD5  ; KHITAN SMALL SCRIPT CHARACTER-*                                                           │
18D00..18D08  ; TANGUT IDEOGRAPH-*                                                                        │
1B170..1B2FB  ; NUSHU CHARACTER-*                                                                         │
20000..2A6DF  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2A700..2B739  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2B740..2B81D  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2B820..2CEA1  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2CEB0..2EBE0  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2EBF0..2EE5D  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2F800..2FA1D  ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
30000..3134A  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
31350..323AF  ; CJK UNIFIED IDEOGRAPH-*

actually this code was casting the hex code point to decimal code point

https://github.com/Acceis/unisec/blob/6ba37eaa22cefa1995dba8312d6cdbc4f1234904/lib/unisec/rugrep.rb#L41

which is ignoring ranges

irb(main):001:0> '2CEB0..2EBE0'.to_i(16)
=> 183984
irb(main):002:0> '2CEB0'.to_i(16)
=> 183984

So ranges are displayed as a single code point

➜ unisec grep '' | grep 'NUSHU'
U+16FE1 𖿡    NUSHU ITERATION MARK
U+1B170 𛅰    NUSHU CHARACTER-*

Solutions :

Parse this better to display ranges with a horizontal ellipsis
- Pros: keep one command
- Cons: add code complexity, output is inconsistent (bad for piping to other commands)
Add a sub-command named ranges
- Pros: keep consistent output for the grep command
- Cons: split in several commands
Pad range end to the name, eg. U+1B170 𛅰 NUSHU CHARACTER-* (up to U+1B2FB)
- Pros: keep on command, code point column is consistent
- Cons: name column becomes unreliable (information appended to the name)
Expending the name dynamically
- Pros: no inconsistency, no unreliable column
- Cons: for matching result the output will be quite large for not so much value and become unreadable
Adding a third field for comments
- New behavior just for a few exceptions

Eg. of name expansion for idea n°4 http://www.unicode.org/charts/beta/nameslist/n_F900.html

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ranges in DerivedNames for Rugrep #2

Ranges in DerivedNames for Rugrep #2

noraj commented Aug 1, 2024

Ranges in DerivedNames for Rugrep #2

Ranges in DerivedNames for Rugrep #2

Comments

noraj commented Aug 1, 2024