Possible to deal with East Asian wide character? #500

chuanconggao · 2016-01-23T04:16:06Z

In East Asian languages like Chinese, Jpanese, etc., some wide characters are twice the width. For example, given this CSV:

"C1","C2","C3"
"吃饭","睡觉","打豆豆"
"","睡觉","打人"

CSVLook draws the following table:

|-----+----+------|
|  C1 | C2 | C3   |
|-----+----+------|
|  吃饭 | 睡觉 | 打豆豆  |
|     | 睡觉 | 打人   |
|-----+----+------|

The text was updated successfully, but these errors were encountered:

onyxfish · 2016-01-23T04:21:52Z

I'd love to find a way to support these languages better, but I'm guessing this is impossible to fix without going char by char and testing where it falls in the unicode set. That sounds... very slow.

If any has a suggestion for a performant (and ideally simple) way to fix this I'm all ears.

chuanconggao · 2016-01-23T04:36:36Z

According to Google, the standard library already has a solution.

unicodedata.east_asian_width(unichr)

Returns the east asian width assigned to the Unicode character unichr as string.

This StackOverflow post covers a lot more details:
http://stackoverflow.com/questions/23058564/checking-a-character-is-fullwidth-or-halfwidth-in-python

onyxfish · 2016-01-23T06:04:30Z

Right, but you still have to iterate over, call that method and reformat every single character you write out, which could be extremely cost prohibitive. Then again, if we're talking mainly about csvlook that might be a reasonable trade-off, since its only meant for displaying a subset of rows anyway...

chuanconggao · 2016-01-23T06:25:46Z

I do not think the utilities other than csvlook would need to consider this special case, as it is only for rendering. This would add a constant factor to the runtime.

I just did some simple test.

s = "今天你吃饭饭了么"

def f1():
    for i in xrange(10000000):
        len(s)

def f2():
    for i in xrange(10000000):
        for c in s:
            unicodedata.east_asian_width(c)

f1() would take around 600 ms, while f2() would take around 17 sec. However, in real situation, a LRU cache (with the cell content as key and its on-screen length as value) may make some improvement, as there would usually be duplicate cell contents.

I think maybe we can add an argument to handle this special case, while not lose the performance for general cases.

onyxfish · 2016-01-23T16:46:44Z

Yeah, I'd be happy to have a flag for this, at least for csvlook and maybe for other utilities (csvstat, etc.)

If you want to send a pull request that would be great!

chuanconggao · 2016-01-23T17:00:17Z

Sure. I will see what I can do. BTW, it seems csvkit is under heavy refactoring right now. When will the code be relatively stable?

onyxfish · 2016-01-23T17:24:12Z

I'm glad you said that, because actually this change should be made in agate and then #515 should pull it into csvkit. Agate is pretty stable, so if you wanted to look at making the change there, it'll get pulled into csvkit when we refactor csvlook.

chuanconggao · 2016-01-23T17:31:40Z

Sure. I am just learning agate now. It is a quite interesting package. I will contribute the code to agate first. Thanks for the help.

jpmckinney · 2016-01-30T02:54:05Z

Cool - closing as the issue should be re-created on agate.

onyxfish added feature Low Priority labels Jan 23, 2016

jpmckinney closed this as completed Jan 30, 2016

lcorbasson pushed a commit to lcorbasson/csvkit that referenced this issue Sep 7, 2020

Add option to use printable characters on print_bars. wireservice#500

3567b33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible to deal with East Asian wide character? #500

Possible to deal with East Asian wide character? #500

chuanconggao commented Jan 23, 2016

onyxfish commented Jan 23, 2016

chuanconggao commented Jan 23, 2016

onyxfish commented Jan 23, 2016

chuanconggao commented Jan 23, 2016

onyxfish commented Jan 23, 2016

chuanconggao commented Jan 23, 2016

onyxfish commented Jan 23, 2016

chuanconggao commented Jan 23, 2016

jpmckinney commented Jan 30, 2016

Possible to deal with East Asian wide character? #500

Possible to deal with East Asian wide character? #500

Comments

chuanconggao commented Jan 23, 2016

onyxfish commented Jan 23, 2016

chuanconggao commented Jan 23, 2016

onyxfish commented Jan 23, 2016

chuanconggao commented Jan 23, 2016

onyxfish commented Jan 23, 2016

chuanconggao commented Jan 23, 2016

onyxfish commented Jan 23, 2016

chuanconggao commented Jan 23, 2016

jpmckinney commented Jan 30, 2016