-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible to deal with East Asian wide character? #500
Comments
I'd love to find a way to support these languages better, but I'm guessing this is impossible to fix without going char by char and testing where it falls in the unicode set. That sounds... very slow. If any has a suggestion for a performant (and ideally simple) way to fix this I'm all ears. |
According to Google, the standard library already has a solution. unicodedata.east_asian_width(unichr) Returns the east asian width assigned to the Unicode character unichr as string. This StackOverflow post covers a lot more details: |
Right, but you still have to iterate over, call that method and reformat every single character you write out, which could be extremely cost prohibitive. Then again, if we're talking mainly about |
I do not think the utilities other than csvlook would need to consider this special case, as it is only for rendering. This would add a constant factor to the runtime. I just did some simple test. s = "今天你吃饭饭了么"
def f1():
for i in xrange(10000000):
len(s)
def f2():
for i in xrange(10000000):
for c in s:
unicodedata.east_asian_width(c)
I think maybe we can add an argument to handle this special case, while not lose the performance for general cases. |
Yeah, I'd be happy to have a flag for this, at least for If you want to send a pull request that would be great! |
Sure. I will see what I can do. BTW, it seems csvkit is under heavy refactoring right now. When will the code be relatively stable? |
Sure. I am just learning agate now. It is a quite interesting package. I will contribute the code to agate first. Thanks for the help. |
Cool - closing as the issue should be re-created on agate. |
In East Asian languages like Chinese, Jpanese, etc., some wide characters are twice the width. For example, given this CSV:
CSVLook draws the following table:
The text was updated successfully, but these errors were encountered: