International characters are removed. #99

cphsolutionslab · 2013-10-22T07:57:15Z

The code below removes international characters, or so it seems. When I open the saved_file, which is UTF-8, and print out the content all my international characters are correctly shown. After the any_tableset is run and I print out all the rows, all the international characters are removed.

    f = open(result['saved_file'], 'rb')
    try:
        table_sets = any_tableset(
            f,
            mimetype=content_type,
            extension=resource['format'].lower()
        )
        # only first sheet in xls for time being
        row_set = table_sets.tables[0]
        offset, headers = headers_guess(row_set.sample)

The text was updated successfully, but these errors were encountered:

scraperdragon · 2013-10-22T11:05:13Z

I've knocked together a similar case but can't reproduce the bug:

saved_file:
cat,dog,mouseß
æï,ñø,åÅ

code:
from messytables import any_tableset, headers_guess
for i in (1,):
f = open('saved_file', 'rb')
table_sets = any_tableset(f)
row_set = table_sets.tables[0]
print list(row_set)
offset, headers = headers_guess(row_set.sample)
print offset, headers

for row in row_set:
for cell in row:
print cell.value

output:
[[<Cell(String:u'cat'>, <Cell(String:u'dog'>, <Cell(String:u'mouse\xdf'>],
[<Cell(String:u'\xe6\xef'>, <Cell(String:u'\xf1\xf8'>,
<Cell(String:u'\xe5\xc5'>]]
0 [u'cat', u'dog', u'mouse\xdf']
cat
dog
mouseß
æï
ñø
åÅ

which seems right to me. (This happens regardless of whether the file is
CSV or XLS)

Is it stripping characters or replacing them with gibberish?
Can you provide the values of content_type and
resource['format'].lower()?
Could you run this code with your file and let us know whether it
produces the expected strings?
If your spreadsheet isn't sensitive, could you let us see it?

Dave.

On Tue, Oct 22, 2013 at 8:57 AM, bu1g notifications@github.com wrote:

The code below removes international characters, or so it seems. When I
open the saved_file, which is UTF-8, and print out the content all my
international characters are correctly shown. After the any_tableset is run
and I print out all the rows, all the international characters are removed.
f = open(result['saved_file'], 'rb')
try:
    table_sets = any_tableset(
        f,
        mimetype=content_type,
        extension=resource['format'].lower()
    )
    # only first sheet in xls for time being
    row_set = table_sets.tables[0]
    offset, headers = headers_guess(row_set.sample)
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/99
.

cphsolutionslab · 2013-10-22T13:18:46Z

It's currently an issue with DataStorer in CKAN: http://data.kk.dk/dataset/betalingszoner/resource/cde21ea2-6f87-46e1-be1f-f7a0d2cfc985

Debugging it resolved in the characters being removed (not gibberish but stripped) just after:

table_sets = any_tableset(
f,
mimetype=content_type,
extension=resource['format'].lower()
)

rossjones · 2013-10-22T13:25:49Z

Can you 100% confirm that your file is utf-8? Does chardet think your file is utf-8?

cphsolutionslab · 2013-10-22T13:29:22Z

This is the output:
{'confidence': 0.99, 'encoding': 'utf-8'}

after this:
f = open(result['saved_file'], 'rb')
print chardet.detect(f.read())
try:
table_sets = any_tableset(
f,
mimetype=content_type,
extension=resource['format'].lower()
)

scraperdragon · 2013-10-22T15:30:16Z

I don't think the file is UTF-8:

(python 2)

f= open("ows.csv", "r").read()
f[-100:]
'03651972, 12.579757761196365 55.66998698041308))",Gr\xf8n,Gr\xf8n
betalingszone,2010-02-22,2013-07-17,21\r\n'

Given that this is a bytestring, it shouldn't contain any single non-ASCII
characters (since all UTF8 characters that aren't ASCII are multibyte)

'file' agrees:
$ file ows.csv
ows.csv: ISO-8859 text, with very long lines, with CRLF line terminators

Strange that chardet thinks it's UTF-8, given that UTF-8 is one of the
easiest things to prove some text isn't.

On Tue, Oct 22, 2013 at 2:29 PM, bu1g notifications@github.com wrote:

This is the output:
{'confidence': 0.99, 'encoding': 'utf-8'}

after this:

f = open(result['saved_file'], 'rb')
print chardet.detect(f.read())

try:
table_sets = any_tableset(
f,
mimetype=content_type,
extension=resource['format'].lower()
)

—

Reply to this email directly or view it on GitHubhttps://github.com//issues/99#issuecomment-26802438
.

scraperdragon · 2013-10-22T15:56:24Z

commas.py 24: self.reader = codecs.getreader(encoding)(f, 'ignore')
... I'm not sure silently ignoring characters that don't decode will ever
be the correct behaviour.

Also, only checking the start of the file causes misdetection in chardet,
due to the big polygons

chardet.detect(open("/home/dragon/ows.csv").read())
{'confidence': 0.766658867395801, 'encoding': 'ISO-8859-2'}
chardet.detect(open("/home/dragon/ows.csv").read(2000))
{'confidence': 1.0, 'encoding': 'ascii'}

We could fall back to a possibly-mangling latin-1 instead of an
always-wrong UTF-8 minus the bad bits. It'd be good to warn when this had
occurred.

On Tue, Oct 22, 2013 at 4:30 PM, Dave McKee dragon@scraperwiki.com wrote:

I don't think the file is UTF-8:

(python 2)

f= open("ows.csv", "r").read()
f[-100:]
'03651972, 12.579757761196365 55.66998698041308))",Gr\xf8n,Gr\xf8n
betalingszone,2010-02-22,2013-07-17,21\r\n'

Given that this is a bytestring, it shouldn't contain any single non-ASCII
characters (since all UTF8 characters that aren't ASCII are multibyte)

'file' agrees:
$ file ows.csv
ows.csv: ISO-8859 text, with very long lines, with CRLF line terminators

Strange that chardet thinks it's UTF-8, given that UTF-8 is one of the
easiest things to prove some text isn't.

On Tue, Oct 22, 2013 at 2:29 PM, bu1g notifications@github.com wrote:

This is the output:
{'confidence': 0.99, 'encoding': 'utf-8'}

after this:

f = open(result['saved_file'], 'rb')
print chardet.detect(f.read())

try:
table_sets = any_tableset(
f,
mimetype=content_type,
extension=resource['format'].lower()
)

—

Reply to this email directly or view it on GitHubhttps://github.com//issues/99#issuecomment-26802438
.

cphsolutionslab · 2013-10-24T09:04:22Z

My result of the following:
f = open(result['saved_file'], 'rb')
f_test = f.read()
print f_test[-100:]
gives my this:
651972, 12.579757761196365 55.66998698041308))",Grøn,Grøn betalingszone,2010-02-22,2013-07-17,21

cphsolutionslab · 2013-10-24T09:09:25Z

But yes, it does says ascii when using read(2000)...

scraperdragon · 2013-10-24T13:29:38Z

Ah; I was in an interactive Python session - so
f[-100:]
is equivalent to
print repr(f[-100:])

cphsolutionslab · 2013-10-29T10:11:39Z

When I try to open the file as UTF-8:
f = codecs.open(result['saved_file'], 'rb', 'utf-8')
I get the error:

/usr/lib/ckan/default/local/lib/python2.7/site-packages/chardet/universaldetector.py:90: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if aBuf[:len(chunk)] == chunk:
2013-10-29 11:09:22,057 ERROR [root] 'ascii' codec can't encode character u'\xd8' in position 1456: ordinal not in range(128)
...

Any idea on how to fix this? UTF-8 is a superset of ASCII.

cphsolutionslab · 2013-10-29T10:21:05Z

? I think I got lost somewhere...
Should messytables be able to handle ISO-8859-1 encoded files?
Should messytables be able to handle ascii characters?

I'm kinda lost if I should fis the file, messytables or...?

scraperdragon · 2013-10-29T10:25:55Z

The error message: "'ascii' codec can't encode..." is caused by something trying to convert a character outside of ASCII (e.g. a byte bigger than 127 in ISO-8859, a unicode code point beyond 127) to ASCII; ASCII doesn't have a character to represent that character.

Yes, messytables should handle ISO-8859-1 files, however it is not correctly detecting this file due to truncating detection at 2k.
I'm not sure why chardet is trying to coerce data into ASCII. It shouldn't be trying to do that.
Your data isn't UTF-8. Attempting to decode it as such is bound to fail.

I believe this would be fixed if:

we didn't truncate the analysis at 2k
we fell back to ISO-8859-1, not UTF-8 if the file can't be decoded correctly

cphsolutionslab · 2013-10-29T11:10:09Z

Removing the 2K limit fixed the issue...
Should this be a permanent solution?

And thank you for patience with me. I really do appreciate it.

scraperdragon · 2013-10-29T11:53:48Z

Glad to hear the specific problem is fixed!

Making it permament is a little more complicated; there's a need for some users to not load the whole file in multiple times. We'll have to think about it.

guibos · 2020-11-25T14:36:55Z

The error message: "'ascii' codec can't encode..." is caused by something trying to convert a character outside of ASCII (e.g. a byte bigger than 127 in ISO-8859, a unicode code point beyond 127) to ASCII; ASCII doesn't have a character to represent that character.

Yes, messytables should handle ISO-8859-1 files, however it is not correctly detecting this file due to truncating detection at 2k.
I'm not sure why chardet is trying to coerce data into ASCII. It shouldn't be trying to do that.
Your data isn't UTF-8. Attempting to decode it as such is bound to fail.

I believe this would be fixed if:
1. we didn't truncate the analysis at 2k

2. we fell back to ISO-8859-1, not UTF-8 if the file can't be decoded correctly

There are plans to solve this problem. I am quite interested in solving this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

International characters are removed. #99

International characters are removed. #99

cphsolutionslab commented Oct 22, 2013

scraperdragon commented Oct 22, 2013

cphsolutionslab commented Oct 22, 2013

rossjones commented Oct 22, 2013

cphsolutionslab commented Oct 22, 2013

scraperdragon commented Oct 22, 2013

scraperdragon commented Oct 22, 2013

cphsolutionslab commented Oct 24, 2013

cphsolutionslab commented Oct 24, 2013

scraperdragon commented Oct 24, 2013

cphsolutionslab commented Oct 29, 2013

cphsolutionslab commented Oct 29, 2013

scraperdragon commented Oct 29, 2013

cphsolutionslab commented Oct 29, 2013

scraperdragon commented Oct 29, 2013

guibos commented Nov 25, 2020

International characters are removed. #99

International characters are removed. #99

Comments

cphsolutionslab commented Oct 22, 2013

scraperdragon commented Oct 22, 2013

cphsolutionslab commented Oct 22, 2013

rossjones commented Oct 22, 2013

cphsolutionslab commented Oct 22, 2013

scraperdragon commented Oct 22, 2013

scraperdragon commented Oct 22, 2013

cphsolutionslab commented Oct 24, 2013

cphsolutionslab commented Oct 24, 2013

scraperdragon commented Oct 24, 2013

cphsolutionslab commented Oct 29, 2013

cphsolutionslab commented Oct 29, 2013

scraperdragon commented Oct 29, 2013

cphsolutionslab commented Oct 29, 2013

scraperdragon commented Oct 29, 2013

guibos commented Nov 25, 2020