Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode dump #6

Closed
kilobyte opened this issue Nov 15, 2018 · 8 comments
Closed

Unicode dump #6

kilobyte opened this issue Nov 15, 2018 · 8 comments
Labels
enhancement New feature or request

Comments

@kilobyte
Copy link

A version of hexdump I wanted to write but never got around to, would do:

  • display Unicode characters on the right side, plus a colored filler (? ?) for the space not taken (for wcwidth()==1 it's 1 extra character for <U+0800, 2 extra for <=U+FFFF, 3 extra for non-BMP; likewise for wcwidth()==2). Obviously, a CJK (width 2) character might go over the right edge but as it always takes at least three bytes, there'll be enough space in the next line.

  • display controls using appropriate symbols — Unicode provides a set for this exact task at U+2400. Some specific characters could be better shown using more readable symbols: for 07, for 0a, for 1b, for 08, for 7f, and especially one of for for 00. You do get a lot of nulls, newlines and escapes in dumps...

  • with an option, display Unicode code points rather than individual bytes on the left side — ie, "U+FFFD" instead of "ef bf bd".

Sounds like your hexyl would be a perfect place to implement the above...

@kilobyte
Copy link
Author

WRT the right side display: currently more than 2/3 of byte values show up as .. It'd be good to use it for 2e and nothing else.

@sharkdp
Copy link
Owner

sharkdp commented Nov 19, 2018

Thank you very much for the feedback.

  • display controls using appropriate symbols — Unicode provides a set for this exact task at U+2400. Some specific characters could be better shown using more readable symbols: for 07, for 0a, for 1b, for 08, for 7f, and especially one of for for 00. You do get a lot of nulls, newlines and escapes in dumps...

I also think this would be really cool! I actually started writing this tool after implementing this feature in my bat tool which does something similar for text-based files.

  • display Unicode characters on the right side, plus a colored filler (? ?) for the space not taken (for wcwidth()==1 it's 1 extra character for <U+0800, 2 extra for <=U+FFFF, 3 extra for non-BMP; likewise for wcwidth()==2). Obviously, a CJK (width 2) character might go over the right edge but as it always takes at least three bytes, there'll be enough space in the next line.

Obviously this would only work if the file actually contains Unicode text in some form of encoding. So I think this would need to be behind a command-line flag, right?

Slightly related: do you know a monospace font / terminal that properly keeps Unicode symbols such as or in a single "cell" in the terminal? On my terminal (terminator with Fira Code), it looks like this:
image

@sharkdp sharkdp added the enhancement New feature or request label Nov 19, 2018
@kilobyte
Copy link
Author

• display Unicode characters on the right side, plus a colored filler (…? ␣?) for the space not taken

Obviously this would only work if the file actually contains Unicode text in some form of encoding. So I think this would need to be behind a command-line flag, right?

A text file, or strings inside an executable, would be Unicode. A piece of non-text on the other hand has quite meaningless right side, thus it'll have a jumble of (accidentally) valid Unicode and bad chars. The latter are usually represented by U+FFFD — or, in earlier conventions, ? or .. No need for a flag as there's no regression from current state.

Slightly related: do you know a monospace font / terminal that properly keeps Unicode symbols such as ␊ or ↵ in a single "cell" in the terminal?

When a given font lacks a glyph you want, fontconfig will borrow it from some other font — but alas, it doesn't know how to choose well. That indeed often results in the character spilling.

I've once written a tool to trigger any such spillage bugs (they usually get affected by the order you write characters in), but I know of no way to actually avoid them.

One idea would be to make a font with cell aspect ratio 1:2 (any modern font has this proportion or something very close as CJK folks are really upset if they get anything else (misalignment with their characters) while the rest of us don't care about aspect) — and give that font every non-letter symbol (like by copying from random fonts) correctly placed within such bounds. Such a font could be set to a high priority so fontconfig would prefer it for fallback glyphs.

But, that idea would require 1. actually making such a font, 2. installing it on everyone's machine — thus it won't help us in the immediate timeframe.

@sharkdp
Copy link
Owner

sharkdp commented Nov 19, 2018

A text file, or strings inside an executable, would be Unicode.

But what kind of encoding? Would you just try to interpret everything as UTF-8? Wouldn't that fail in the string-within-executable-example because of the random bytes before the string starts?

@kilobyte
Copy link
Author

Outside of Windows, any encoding but UTF-8 is long dead, and even Windows is finally undergoing the migration (you can actually set system locale to UTF-8 now). Here's some statistics from two years ago.

In theory, one could check the system locale (LC_CTYPE/LANG/LC_ALL), but by now, I wouldn't bother and just assume UTF-8.

If the data doesn't form valid UTF-8, the tool would show such bytes as invalid. Just like today it does for everything 0..31, 127-255.

@sharkdp
Copy link
Owner

sharkdp commented Nov 19, 2018

I somehow thought that the UTF-8 decoding process could be messed up if you were to put random bytes in front of a valid UTF-8 sequence, but apparently UTF-8 has some really nice properties (it is self-synchronizing) which does not allow this to happen.

@luser
Copy link

luser commented Jan 10, 2020

In theory, one could check the system locale (LC_CTYPE/LANG/LC_ALL), but by now, I wouldn't bother and just assume UTF-8.

While I agree that the vast majority of files in practice are likely to be UTF-8 I think that for a tool that's showing hex dumps it's likely to be useful to view other encodings in practice, and potentially harmful to show data in an incorrect encoding. For example, at a previous job I routinely wound up looking at hex dumps of minidump files which included UTF-16LE strings.

I think there are some default behaviors you could take that'd be very reasonable:

  • Check for a BOM at the start of the file and honor that encoding
  • Check for filetype-specific encoding markers (for example, HTML has a defined set of steps browsers can use to look for <meta charset=, Python has a defined way to specify the file encoding, etc.
  • Allow explicitly specifying an encoding as a commandline option

@sharkdp
Copy link
Owner

sharkdp commented May 26, 2020

I am going to close this ticket, as I don't really see this happening inside hexyl. Let's keep hexyl focused on viewing generic binary files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants