read_csv fails for UTF-16 with BOM (maybe also other encodings with BOM) and skiprows #2298

gerigk · 2012-11-20T09:27:11Z

Unfortunately I can't send the file but from the output of head filename -n 7

��Name  Ad performance report                           
Type    Ad                                  
Frequency   One time                            
Date range  Custom date range                       
Dates   Sep 19, 2012-Nov 19, 2012                       
Account Day Campaign    Ad group    Ad ID   Client name Destination URL Impressions Clicks  Cost    Avg. position   Status  Conv. (1-per-click)
Categories 2    15.11.2012  something: ��;�C�7�:�8� [somethinglse]{test}: ��;�C�7�:�8�  16902484818 Categories 2    http://www.someurl?ad=291012    333 2   4.7 5.5 approved    0

I guess that the beginning of the file is the BOM and that this causes problems when skipping the rows. Without skiprows everything gets read into one row with the first column containing the BOM.

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns:
��Name\tAd performance report\t\t\t\t\t\t\t\t\t\t\t
...

The error raised is:

pd.read_csv('/home/arthur/Desktop/client 139 - ads report/test_pandas.csv', sep='\t', skiprows=5)
/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, header, index_col, names, skiprows, skipfooter, skip_footer, na_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    361                     buffer_lines=buffer_lines)
    362 
--> 363         return _read(filepath_or_buffer, kwds)
    364 
    365     parser_f.__name__ = name

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    185 
    186     # Create the parser.
--> 187     parser = TextFileReader(filepath_or_buffer, **kwds)
    188 
    189     if nrows is not None:

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    465         self.options, self.engine = self._clean_options(options, engine)
    466 
--> 467         self._make_engine(self.engine)
    468 
    469     def _get_options_with_defaults(self, engine):

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in _make_engine(self, engine)
    567     def _make_engine(self, engine='c'):
    568         if engine == 'c':
--> 569             self._engine = CParserWrapper(self.f, **self.options)
    570         else:
    571             if engine == 'python':

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/io/parsers.pyc in __init__(self, src, **kwds)
    787         ParserBase.__init__(self, kwds)
    788 
--> 789         self._reader = _parser.TextReader(src, **kwds)
    790 
    791         # XXX

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/_parser.so in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3579)()

/usr/local/lib/python2.7/dist-packages/pandas-0.9.2.dev_b8dae94-py2.7-linux-x86_64.egg/pandas/_parser.so in pandas._parser.TextReader._get_header (pandas/src/parser.c:4590)()

CParserError: Passed header=0 but only 0 lines in file

The text was updated successfully, but these errors were encountered:

ghost · 2012-12-02T05:29:33Z

you should specify the encoding arg explicily when reading in non-ascii files, but even with that
it's not functioning, so this is a bug.
as a temporary workaround, you can try

pd.read_csv('data.csv', sep='\t', skiprows=5,encoding='utf-16le',engine='python')

it'll use the slower python parser, but should work.

ghost · 2012-12-02T05:49:20Z

df=pd.read_csv("2.csv",sep=u'\t'.encode('utf-16'),encoding='utf-16')

comes close, but the column names are not properly decoded into unicode.
if you set them manually to ascii/unicode values, the dataframe is fine.

there's work to do here obviously.

edit: looks like the index is not decoded properly as well.
with the hack in 3e76878

wesm · 2012-12-02T18:50:09Z

I wrote a unit test to try to replicate. Encoding unicode with utf-16 adds the BOM, and it seems it can be successfully read using read_csv(path, encoding='utf-16', skiprows=n). ?

ghost · 2012-12-02T20:38:26Z

the tests passes, but the following raises yet another error in ipython:

import  pandas as pd
import random
import pandas.util.testing as tm
import os

data = u"""skip this
skip this too
A,B,C
1,2,3
4,5,6"""
path = '/tmp/1.csv'
enc='utf-16'
bytes = data.encode(enc)
with open(path, 'wb') as f:
    f.write(bytes)

result = pd.read_csv(path, encoding=enc, skiprows=2)
#    expected = pd.read_csv(path,encoding=enc, skiprows=2,engine='python')
#   tm.assert_frame_equal(result, expected)

works with enc='ascii' though. Am I missing something?
once that's working, I think comparing engine:c and engine:python
will surface the issue.

gerigk · 2012-12-02T21:35:24Z

I still get an error for a file that I am directly downloading from Google AdWords (the format is called CSV for Excel in case you have an accessible account).
The BOM is

'\xff\xfe'

and with the code above it fails.
If I use the hint of y-p with the encoded separator

pd.read_csv(paths,sep=u'\t'.encode('utf-16le'), skiprows=5, encoding='utf-16le')

the file is read correctly but the cyrillic letters aren't printed correctly (in IPython I get some non-meaningful latin letters and in the standard python shell I get weird boxes �).
Libre Calc opens the file without problems and shows the letters correctly.
Also Pandas 0.9.1 and 0.10dev work fine with (adding engine='python' for 0.10dev)

pd.read_csv(path, sep='\t', skiprows=5, encoding='utf-16le')

but not with sep = u'\t".encode('utf16-le') (using this addition I get the same weird characters where cyrillic characters are expected).

wesm · 2012-12-02T21:46:06Z

Can you produce a sufficiently obfuscated output (pls the exact unicode or bytes literal that can be copy-pasted into Python to be passed in a StringIO) for me to see where the decoding is going wrong?

gerigk · 2012-12-02T23:53:45Z

I sent you an email.

wesm · 2012-12-03T01:09:37Z

Thanks, I see the problem. The problem is that for little-endian UTF-16, the null byte \x00 falls after ASCII characters like the delimiter. To properly parse this data in C, you'd need to write a custom UTF-16 tokenizer. I think the best approach is probably to transcode the data as UTF-8 and feed that to the parser. I'll take a look this week sometime

ghost · 2012-12-03T01:18:56Z

detecting the BOM at the start of the file might also be workable.
There are just a small number of possible values.

wesm · 2012-12-06T18:37:09Z

Looking good now. Arthur, your test case from your e-mail works fine now (do NOT do u'\t'.encode('utf-16le') though because it adds a BOM to the delimiter and confuses the CSV reader)

wesm added a commit that referenced this issue Dec 2, 2012

TST: unit test to try to replicate #2298

7e2a47d

wesm added a commit that referenced this issue Dec 6, 2012

BUG: fix utf-16 handling re: #2418, #2298

f985aa1

wesm added a commit that referenced this issue Dec 6, 2012

BUG: move utf8 recoding elsewhere, add test case to deal with #2298

4e49d29

wesm closed this as completed Dec 6, 2012

cowlicks mentioned this issue Sep 21, 2015

Fixing issues with UTF-16/32 encodings and dataframe.read_csv dask/dask#739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv fails for UTF-16 with BOM (maybe also other encodings with BOM) and skiprows #2298

read_csv fails for UTF-16 with BOM (maybe also other encodings with BOM) and skiprows #2298

gerigk commented Nov 20, 2012

ghost commented Dec 2, 2012

ghost commented Dec 2, 2012

wesm commented Dec 2, 2012

ghost commented Dec 2, 2012

gerigk commented Dec 2, 2012

wesm commented Dec 2, 2012

gerigk commented Dec 2, 2012

wesm commented Dec 3, 2012

ghost commented Dec 3, 2012

wesm commented Dec 6, 2012

read_csv fails for UTF-16 with BOM (maybe also other encodings with BOM) and skiprows #2298

read_csv fails for UTF-16 with BOM (maybe also other encodings with BOM) and skiprows #2298

Comments

gerigk commented Nov 20, 2012

ghost commented Dec 2, 2012

ghost commented Dec 2, 2012

wesm commented Dec 2, 2012

ghost commented Dec 2, 2012

gerigk commented Dec 2, 2012

wesm commented Dec 2, 2012

gerigk commented Dec 2, 2012

wesm commented Dec 3, 2012

ghost commented Dec 3, 2012

wesm commented Dec 6, 2012