-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv fails for UTF-16 with BOM (maybe also other encodings with BOM) and skiprows #2298
Comments
you should specify the pd.read_csv('data.csv', sep='\t', skiprows=5,encoding='utf-16le',engine='python') it'll use the slower python parser, but should work. |
df=pd.read_csv("2.csv",sep=u'\t'.encode('utf-16'),encoding='utf-16') comes close, but the column names are not properly decoded into unicode. there's work to do here obviously. edit: looks like the index is not decoded properly as well. |
I wrote a unit test to try to replicate. Encoding unicode with |
the tests passes, but the following raises yet another error in ipython: import pandas as pd
import random
import pandas.util.testing as tm
import os
data = u"""skip this
skip this too
A,B,C
1,2,3
4,5,6"""
path = '/tmp/1.csv'
enc='utf-16'
bytes = data.encode(enc)
with open(path, 'wb') as f:
f.write(bytes)
result = pd.read_csv(path, encoding=enc, skiprows=2)
# expected = pd.read_csv(path,encoding=enc, skiprows=2,engine='python')
# tm.assert_frame_equal(result, expected) works with enc='ascii' though. Am I missing something? |
I still get an error for a file that I am directly downloading from Google AdWords (the format is called CSV for Excel in case you have an accessible account).
and with the code above it fails.
the file is read correctly but the cyrillic letters aren't printed correctly (in IPython I get some non-meaningful latin letters and in the standard python shell I get weird boxes �).
but not with sep = u'\t".encode('utf16-le') (using this addition I get the same weird characters where cyrillic characters are expected). |
Can you produce a sufficiently obfuscated output (pls the exact unicode or bytes literal that can be copy-pasted into Python to be passed in a StringIO) for me to see where the decoding is going wrong? |
I sent you an email. |
Thanks, I see the problem. The problem is that for little-endian UTF-16, the null byte |
detecting the BOM at the start of the file might also be workable. |
Looking good now. Arthur, your test case from your e-mail works fine now (do NOT do |
Unfortunately I can't send the file but from the output of head filename -n 7
I guess that the beginning of the file is the BOM and that this causes problems when skipping the rows. Without skiprows everything gets read into one row with the first column containing the BOM.
The error raised is:
The text was updated successfully, but these errors were encountered: