-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
International characters are removed. #99
Comments
I've knocked together a similar case but can't reproduce the bug: saved_file: code: for row in row_set: output: which seems right to me. (This happens regardless of whether the file is
Dave. On Tue, Oct 22, 2013 at 8:57 AM, bu1g notifications@github.com wrote:
|
It's currently an issue with DataStorer in CKAN: http://data.kk.dk/dataset/betalingszoner/resource/cde21ea2-6f87-46e1-be1f-f7a0d2cfc985 Debugging it resolved in the characters being removed (not gibberish but stripped) just after: table_sets = any_tableset( |
Can you 100% confirm that your file is utf-8? Does chardet think your file is utf-8? |
This is the output: after this: |
I don't think the file is UTF-8: (python 2)
Given that this is a bytestring, it shouldn't contain any single non-ASCII 'file' agrees: Strange that chardet thinks it's UTF-8, given that UTF-8 is one of the On Tue, Oct 22, 2013 at 2:29 PM, bu1g notifications@github.com wrote:
|
commas.py 24: self.reader = codecs.getreader(encoding)(f, 'ignore') Also, only checking the start of the file causes misdetection in chardet,
We could fall back to a possibly-mangling latin-1 instead of an On Tue, Oct 22, 2013 at 4:30 PM, Dave McKee dragon@scraperwiki.com wrote:
|
My result of the following: |
But yes, it does says ascii when using read(2000)... |
Ah; I was in an interactive Python session - so |
When I try to open the file as UTF-8: /usr/lib/ckan/default/local/lib/python2.7/site-packages/chardet/universaldetector.py:90: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal Any idea on how to fix this? UTF-8 is a superset of ASCII. |
? I think I got lost somewhere... I'm kinda lost if I should fis the file, messytables or...? |
The error message: "'ascii' codec can't encode..." is caused by something trying to convert a character outside of ASCII (e.g. a byte bigger than 127 in ISO-8859, a unicode code point beyond 127) to ASCII; ASCII doesn't have a character to represent that character. Yes, messytables should handle ISO-8859-1 files, however it is not correctly detecting this file due to truncating detection at 2k. I believe this would be fixed if:
|
Removing the 2K limit fixed the issue... And thank you for patience with me. I really do appreciate it. |
Glad to hear the specific problem is fixed! Making it permament is a little more complicated; there's a need for some users to not load the whole file in multiple times. We'll have to think about it. |
There are plans to solve this problem. I am quite interested in solving this problem. |
The code below removes international characters, or so it seems. When I open the saved_file, which is UTF-8, and print out the content all my international characters are correctly shown. After the any_tableset is run and I print out all the rows, all the international characters are removed.
The text was updated successfully, but these errors were encountered: