Unicode dump #6

kilobyte · 2018-11-15T11:32:47Z

A version of hexdump I wanted to write but never got around to, would do:

display Unicode characters on the right side, plus a colored filler (…? ␣?) for the space not taken (for wcwidth()==1 it's 1 extra character for <U+0800, 2 extra for <=U+FFFF, 3 extra for non-BMP; likewise for wcwidth()==2). Obviously, a CJK (width 2) character might go over the right edge but as it always takes at least three bytes, there'll be enough space in the next line.
display controls using appropriate symbols — Unicode provides a set for this exact task at U+2400. Some specific characters could be better shown using more readable symbols: ♪ for 07, ↵ for 0a, ⎋ for 1b, ⌫ for 08, ⌦ for 7f, and especially one of for ⌀ ⌾ ⍉ ␥ for 00. You do get a lot of nulls, newlines and escapes in dumps...
with an option, display Unicode code points rather than individual bytes on the left side — ie, "U+FFFD" instead of "ef bf bd".

Sounds like your hexyl would be a perfect place to implement the above...

The text was updated successfully, but these errors were encountered:

kilobyte · 2018-11-15T12:05:46Z

WRT the right side display: currently more than 2/3 of byte values show up as .. It'd be good to use it for 2e and nothing else.

sharkdp · 2018-11-19T18:03:59Z

Thank you very much for the feedback.

display controls using appropriate symbols — Unicode provides a set for this exact task at U+2400. Some specific characters could be better shown using more readable symbols: ♪ for 07, ↵ for 0a, ⎋ for 1b, ⌫ for 08, ⌦ for 7f, and especially one of for ⌀ ⌾ ⍉ ␥ for 00. You do get a lot of nulls, newlines and escapes in dumps...

I also think this would be really cool! I actually started writing this tool after implementing this feature in my bat tool which does something similar for text-based files.

display Unicode characters on the right side, plus a colored filler (…? ␣?) for the space not taken (for wcwidth()==1 it's 1 extra character for <U+0800, 2 extra for <=U+FFFF, 3 extra for non-BMP; likewise for wcwidth()==2). Obviously, a CJK (width 2) character might go over the right edge but as it always takes at least three bytes, there'll be enough space in the next line.

Obviously this would only work if the file actually contains Unicode text in some form of encoding. So I think this would need to be behind a command-line flag, right?

Slightly related: do you know a monospace font / terminal that properly keeps Unicode symbols such as ␊ or ↵ in a single "cell" in the terminal? On my terminal (terminator with Fira Code), it looks like this:

kilobyte · 2018-11-19T18:32:30Z

• display Unicode characters on the right side, plus a colored filler (…? ␣?) for the space not taken

Obviously this would only work if the file actually contains Unicode text in some form of encoding. So I think this would need to be behind a command-line flag, right?

A text file, or strings inside an executable, would be Unicode. A piece of non-text on the other hand has quite meaningless right side, thus it'll have a jumble of (accidentally) valid Unicode and bad chars. The latter are usually represented by U+FFFD � — or, in earlier conventions, ? or .. No need for a flag as there's no regression from current state.

Slightly related: do you know a monospace font / terminal that properly keeps Unicode symbols such as ␊ or ↵ in a single "cell" in the terminal?

When a given font lacks a glyph you want, fontconfig will borrow it from some other font — but alas, it doesn't know how to choose well. That indeed often results in the character spilling.

I've once written a tool to trigger any such spillage bugs (they usually get affected by the order you write characters in), but I know of no way to actually avoid them.

One idea would be to make a font with cell aspect ratio 1:2 (any modern font has this proportion or something very close as CJK folks are really upset if they get anything else (misalignment with their characters) while the rest of us don't care about aspect) — and give that font every non-letter symbol (like by copying from random fonts) correctly placed within such bounds. Such a font could be set to a high priority so fontconfig would prefer it for fallback glyphs.

But, that idea would require 1. actually making such a font, 2. installing it on everyone's machine — thus it won't help us in the immediate timeframe.

sharkdp · 2018-11-19T18:40:01Z

A text file, or strings inside an executable, would be Unicode.

But what kind of encoding? Would you just try to interpret everything as UTF-8? Wouldn't that fail in the string-within-executable-example because of the random bytes before the string starts?

kilobyte · 2018-11-19T18:52:19Z

Outside of Windows, any encoding but UTF-8 is long dead, and even Windows is finally undergoing the migration (you can actually set system locale to UTF-8 now). Here's some statistics from two years ago.

In theory, one could check the system locale (LC_CTYPE/LANG/LC_ALL), but by now, I wouldn't bother and just assume UTF-8.

If the data doesn't form valid UTF-8, the tool would show such bytes as invalid. Just like today it does for everything 0..31, 127-255.

sharkdp · 2018-11-19T22:56:32Z

I somehow thought that the UTF-8 decoding process could be messed up if you were to put random bytes in front of a valid UTF-8 sequence, but apparently UTF-8 has some really nice properties (it is self-synchronizing) which does not allow this to happen.

luser · 2020-01-10T21:40:26Z

In theory, one could check the system locale (LC_CTYPE/LANG/LC_ALL), but by now, I wouldn't bother and just assume UTF-8.

While I agree that the vast majority of files in practice are likely to be UTF-8 I think that for a tool that's showing hex dumps it's likely to be useful to view other encodings in practice, and potentially harmful to show data in an incorrect encoding. For example, at a previous job I routinely wound up looking at hex dumps of minidump files which included UTF-16LE strings.

I think there are some default behaviors you could take that'd be very reasonable:

Check for a BOM at the start of the file and honor that encoding
Check for filetype-specific encoding markers (for example, HTML has a defined set of steps browsers can use to look for <meta charset=, Python has a defined way to specify the file encoding, etc.
Allow explicitly specifying an encoding as a commandline option

sharkdp · 2020-05-26T08:08:58Z

I am going to close this ticket, as I don't really see this happening inside hexyl. Let's keep hexyl focused on viewing generic binary files.

sharkdp added the enhancement New feature or request label Nov 19, 2018

kilobyte mentioned this issue Jan 13, 2019

Add --color flag to disable ANSI sequences #30

Merged

sharkdp mentioned this issue Jul 9, 2019

Code point output #65

Closed

sharkdp closed this as completed May 26, 2020

sharkdp mentioned this issue Jun 26, 2022

Feature Request: Show byte 00~1f and 7f in UnixP form in text area #163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode dump #6

Unicode dump #6

kilobyte commented Nov 15, 2018

kilobyte commented Nov 15, 2018

sharkdp commented Nov 19, 2018

kilobyte commented Nov 19, 2018

sharkdp commented Nov 19, 2018

kilobyte commented Nov 19, 2018

sharkdp commented Nov 19, 2018

luser commented Jan 10, 2020

sharkdp commented May 26, 2020

Unicode dump #6

Unicode dump #6

Comments

kilobyte commented Nov 15, 2018

kilobyte commented Nov 15, 2018

sharkdp commented Nov 19, 2018

kilobyte commented Nov 19, 2018

sharkdp commented Nov 19, 2018

kilobyte commented Nov 19, 2018

sharkdp commented Nov 19, 2018

luser commented Jan 10, 2020

sharkdp commented May 26, 2020