Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ranges in DerivedNames for Rugrep #2

Open
noraj opened this issue Aug 1, 2024 · 0 comments
Open

Ranges in DerivedNames for Rugrep #2

noraj opened this issue Aug 1, 2024 · 0 comments

Comments

@noraj
Copy link
Owner

noraj commented Aug 1, 2024

The issue is there are several ranges in DerivedName.txt

➜ cat data/DerivedName.txt | grep '\.\.'                                                                  │
3400..4DBF    ; CJK UNIFIED IDEOGRAPH-*                                                                   │
4E00..9FFF    ; CJK UNIFIED IDEOGRAPH-*                                                                   │
F900..FA6D    ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
FA70..FAD9    ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
17000..187F7  ; TANGUT IDEOGRAPH-*                                                                        │
18B00..18CD5  ; KHITAN SMALL SCRIPT CHARACTER-*                                                           │
18D00..18D08  ; TANGUT IDEOGRAPH-*                                                                        │
1B170..1B2FB  ; NUSHU CHARACTER-*                                                                         │
20000..2A6DF  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2A700..2B739  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2B740..2B81D  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2B820..2CEA1  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2CEB0..2EBE0  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2EBF0..2EE5D  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2F800..2FA1D  ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
30000..3134A  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
31350..323AF  ; CJK UNIFIED IDEOGRAPH-*    

actually this code was casting the hex code point to decimal code point

https://github.com/Acceis/unisec/blob/6ba37eaa22cefa1995dba8312d6cdbc4f1234904/lib/unisec/rugrep.rb#L41

which is ignoring ranges

irb(main):001:0> '2CEB0..2EBE0'.to_i(16)
=> 183984
irb(main):002:0> '2CEB0'.to_i(16)
=> 183984

So ranges are displayed as a single code point

➜ unisec grep '' | grep 'NUSHU'
U+16FE1 𖿡    NUSHU ITERATION MARK
U+1B170 𛅰    NUSHU CHARACTER-*

Solutions :

  1. Parse this better to display ranges with a horizontal ellipsis
    • Pros: keep one command
    • Cons: add code complexity, output is inconsistent (bad for piping to other commands)
  2. Add a sub-command named ranges
    • Pros: keep consistent output for the grep command
    • Cons: split in several commands
  3. Pad range end to the name, eg. U+1B170 𛅰 NUSHU CHARACTER-* (up to U+1B2FB)
    • Pros: keep on command, code point column is consistent
    • Cons: name column becomes unreliable (information appended to the name)
  4. Expending the name dynamically
    • Pros: no inconsistency, no unreliable column
    • Cons: for matching result the output will be quite large for not so much value and become unreadable
  5. Adding a third field for comments
    • New behavior just for a few exceptions

Eg. of name expansion for idea n°4 http://www.unicode.org/charts/beta/nameslist/n_F900.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant