Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to deal with East Asian wide character? #500

Closed
chuanconggao opened this issue Jan 23, 2016 · 9 comments
Closed

Possible to deal with East Asian wide character? #500

chuanconggao opened this issue Jan 23, 2016 · 9 comments
Labels

Comments

@chuanconggao
Copy link

In East Asian languages like Chinese, Jpanese, etc., some wide characters are twice the width. For example, given this CSV:

"C1","C2","C3"
"吃饭","睡觉","打豆豆"
"","睡觉","打人"

CSVLook draws the following table:

|-----+----+------|
|  C1 | C2 | C3   |
|-----+----+------|
|  吃饭 | 睡觉 | 打豆豆  |
|     | 睡觉 | 打人   |
|-----+----+------|
@onyxfish
Copy link
Collaborator

I'd love to find a way to support these languages better, but I'm guessing this is impossible to fix without going char by char and testing where it falls in the unicode set. That sounds... very slow.

If any has a suggestion for a performant (and ideally simple) way to fix this I'm all ears.

@chuanconggao
Copy link
Author

According to Google, the standard library already has a solution.

unicodedata.east_asian_width(unichr)

Returns the east asian width assigned to the Unicode character unichr as string.

This StackOverflow post covers a lot more details:
http://stackoverflow.com/questions/23058564/checking-a-character-is-fullwidth-or-halfwidth-in-python

@onyxfish
Copy link
Collaborator

Right, but you still have to iterate over, call that method and reformat every single character you write out, which could be extremely cost prohibitive. Then again, if we're talking mainly about csvlook that might be a reasonable trade-off, since its only meant for displaying a subset of rows anyway...

@chuanconggao
Copy link
Author

I do not think the utilities other than csvlook would need to consider this special case, as it is only for rendering. This would add a constant factor to the runtime.

I just did some simple test.

s = "今天你吃饭饭了么"

def f1():
    for i in xrange(10000000):
        len(s)

def f2():
    for i in xrange(10000000):
        for c in s:
            unicodedata.east_asian_width(c)

f1() would take around 600 ms, while f2() would take around 17 sec. However, in real situation, a LRU cache (with the cell content as key and its on-screen length as value) may make some improvement, as there would usually be duplicate cell contents.

I think maybe we can add an argument to handle this special case, while not lose the performance for general cases.

@onyxfish
Copy link
Collaborator

Yeah, I'd be happy to have a flag for this, at least for csvlook and maybe for other utilities (csvstat, etc.)

If you want to send a pull request that would be great!

@chuanconggao
Copy link
Author

Sure. I will see what I can do. BTW, it seems csvkit is under heavy refactoring right now. When will the code be relatively stable?

@onyxfish
Copy link
Collaborator

I'm glad you said that, because actually this change should be made in agate and then #515 should pull it into csvkit. Agate is pretty stable, so if you wanted to look at making the change there, it'll get pulled into csvkit when we refactor csvlook.

@chuanconggao
Copy link
Author

Sure. I am just learning agate now. It is a quite interesting package. I will contribute the code to agate first. Thanks for the help.

@jpmckinney
Copy link
Member

Cool - closing as the issue should be re-created on agate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants