-
-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode dump #6
Comments
WRT the right side display: currently more than 2/3 of byte values show up as |
Thank you very much for the feedback.
I also think this would be really cool! I actually started writing this tool after implementing this feature in my
Obviously this would only work if the file actually contains Unicode text in some form of encoding. So I think this would need to be behind a command-line flag, right? Slightly related: do you know a monospace font / terminal that properly keeps Unicode symbols such as |
A text file, or strings inside an executable, would be Unicode. A piece of non-text on the other hand has quite meaningless right side, thus it'll have a jumble of (accidentally) valid Unicode and bad chars. The latter are usually represented by U+FFFD
When a given font lacks a glyph you want, fontconfig will borrow it from some other font — but alas, it doesn't know how to choose well. That indeed often results in the character spilling. I've once written a tool to trigger any such spillage bugs (they usually get affected by the order you write characters in), but I know of no way to actually avoid them. One idea would be to make a font with cell aspect ratio 1:2 (any modern font has this proportion or something very close as CJK folks are really upset if they get anything else (misalignment with their characters) while the rest of us don't care about aspect) — and give that font every non-letter symbol (like by copying from random fonts) correctly placed within such bounds. Such a font could be set to a high priority so fontconfig would prefer it for fallback glyphs. But, that idea would require 1. actually making such a font, 2. installing it on everyone's machine — thus it won't help us in the immediate timeframe. |
But what kind of encoding? Would you just try to interpret everything as UTF-8? Wouldn't that fail in the string-within-executable-example because of the random bytes before the string starts? |
Outside of Windows, any encoding but UTF-8 is long dead, and even Windows is finally undergoing the migration (you can actually set system locale to UTF-8 now). Here's some statistics from two years ago. In theory, one could check the system locale (LC_CTYPE/LANG/LC_ALL), but by now, I wouldn't bother and just assume UTF-8. If the data doesn't form valid UTF-8, the tool would show such bytes as invalid. Just like today it does for everything 0..31, 127-255. |
I somehow thought that the UTF-8 decoding process could be messed up if you were to put random bytes in front of a valid UTF-8 sequence, but apparently UTF-8 has some really nice properties (it is self-synchronizing) which does not allow this to happen. |
While I agree that the vast majority of files in practice are likely to be UTF-8 I think that for a tool that's showing hex dumps it's likely to be useful to view other encodings in practice, and potentially harmful to show data in an incorrect encoding. For example, at a previous job I routinely wound up looking at hex dumps of minidump files which included UTF-16LE strings. I think there are some default behaviors you could take that'd be very reasonable:
|
I am going to close this ticket, as I don't really see this happening inside |
A version of hexdump I wanted to write but never got around to, would do:
display Unicode characters on the right side, plus a colored filler (
…
?␣
?) for the space not taken (for wcwidth()==1 it's 1 extra character for <U+0800, 2 extra for <=U+FFFF, 3 extra for non-BMP; likewise for wcwidth()==2). Obviously, a CJK (width 2) character might go over the right edge but as it always takes at least three bytes, there'll be enough space in the next line.display controls using appropriate symbols — Unicode provides a set for this exact task at U+2400. Some specific characters could be better shown using more readable symbols:
♪
for 07,↵
for 0a,⎋
for 1b,⌫
for 08,⌦
for 7f, and especially one of for⌀
⌾
⍉
␥
for 00. You do get a lot of nulls, newlines and escapes in dumps...with an option, display Unicode code points rather than individual bytes on the left side — ie, "U+FFFD" instead of "ef bf bd".
Sounds like your hexyl would be a perfect place to implement the above...
The text was updated successfully, but these errors were encountered: