Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv does not parse in header with BOM utf-8 #4793

Closed
johnclinaa opened this issue Sep 9, 2013 · 27 comments
Closed

read_csv does not parse in header with BOM utf-8 #4793

johnclinaa opened this issue Sep 9, 2013 · 27 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@johnclinaa
Copy link

I am using Pandas version 0.12.0 on a Mac.

I noticed that when there is a BOM utf-8 file, and if the header row is in the first line, the read_csv() method will leave a leading quotation mark in the first column's name. However, if the header row is further down the file and I use the "header=" option, then the whole header row gets parsed correctly.

Here is an example code:

bing_kw = pd.read_csv('../../data/sem/Bing-Keyword_daily.csv', header=9, thousands=',', encoding='utf-8')

Parses the header correctly.

bing_kw = pd.read_csv('../../data/sem/Bing-Keyword_daily.csv', thousands=',', encoding='utf-8')

Parses the first header column name incorrectly by leaving the leading quotation mark.

@jtratner
Copy link
Contributor

jtratner commented Sep 9, 2013

which version of Python are you using? (python --version)

@johnclinaa
Copy link
Author

Python 2.7.5 :: Anaconda 1.6.1 (x86_64)

And the code is written in a python notebook using ipython 1.0.0.

@jreback
Copy link
Contributor

jreback commented Sep 28, 2013

@john-orange-aa can you provide a reproducible example (link to a file if you need to)

@johnclinaa
Copy link
Author

@jreblack, here is the link to the folder:
https://drive.google.com/folderview?id=0BwxOyJG828PySFFQVlBSUEdlcEk&usp=sharing

BOM-temp.csv is the offending file. BOM-temp2.csv is the same file with headers removed. The "pandas BOM utf-8 bug.ipynb" is the ipython notebook that illustrates the bug.

On Sep 28, 2013, at 12:50 PM, jreback notifications@github.com wrote:

@john-orange-aa can you provide a reproducible example (link to a file if you need to)


Reply to this email directly or view it on GitHub.

@jseabold
Copy link
Contributor

This isn't exactly the same issue, but I'm also having trouble with BOMs. File with a utf-8 BOM here.

    [~/work/]
    [1]: pd.version.version
    [1]: '0.12.0-1149-g141e93a'

    [~/work/]
    [2]: dta = pd.read_csv("bom_file.csv", encoding='utf-8')

    [~/work/]
    [3]: dta.columns[0]
    [3]: u'\ufeffDate (MMM-YY)'

@jseabold
Copy link
Contributor

It looks like you should use 'utf-8-sig' as the encoding for utf-8 files with a BOM, so my comment is likely invalid.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 28, 2014
@filmackay
Copy link

Is it possible for Pandas to infer the encoding from the BOM automatically - or is it really required to pass this information through in the encoding? I think the purpose of the BOM was to provide this kind of capability?

@ayee
Copy link

ayee commented Jul 19, 2014

'utf-8-sig' does not resolve the issue. I faced the same issue but using 'utf-8-sig' just got me another decoding problem
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 14: invalid start byte

@misea
Copy link

misea commented Feb 20, 2015

I ran into this problem as well (version 0.15.2). I tried 'utf-8-sig' encoding, and though I didn't see an error, the result was not quite right as the first key is quoted and none of the other keys are, though all column headers/values are quoted throughout the file.

f = open('data/bomex.csv')
header = f.read(10)
f.close()
header

out: '\xef\xbb\xbf"Name",'

f = codecs.open('data/bomex.csv', encoding='utf-8-sig')
header = f.read(10)
f.close()
header

out: u'"Name","Team","G"'

df = pd.read_csv('data/bomex.csv', encoding='utf-8-sig')
df.keys()[0]

out: u'"Name"'

df.keys()[1]

out: u'Team'

Note the extra set of quotes on the first key

@mizzao
Copy link

mizzao commented Mar 3, 2015

With 0.15.2, I am able to use encoding="utf-8-sig" and the BOM disappears from the first column header.

@misea
Copy link

misea commented Mar 3, 2015

In 0.15.2, I find that the BOM disappears, but the quotes around the first column header are erroneously preserved, while quotes around all other column headers (and all other values) are stripped.

So the problem with utf-8-sig seems to only affect quoted column headers. Here's an example file to try

https://dl.dropboxusercontent.com/u/27287953/bom.csv

...

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@zelite
Copy link

zelite commented Mar 26, 2015

Hi,

I'm using pandas 0.16 in python 3.4 (anaconda distro).
I was having a problem with some files, and it seems to be related with the BOM of the file.
I have this file: https://www.dropbox.com/s/nced7whmt2rr0c8/sample_file2.txt

I'm reading the file with:

import pandas as pd
test = pd.read_table("sample_file2.txt", decimal = ",", 
                     parse_dates = True)

When I print the column names:

>>> print(test.columns)
Index(['TimeStamp', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14',
 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29',
 'V30', 'V31'], dtype='object')

So, there seems to be a 'TimeStamp' column. Let me check what is in there:

>>> test["TimeStamp"] 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1787, in __getitem__
[truncated]
File "pandas\hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item
(pandas\hashtable.c:12300)
KeyError: 'TimeStamp'

And, I don't seem to be able to find out what is the real name of the TimeStamp column =/ I had this in a longer script and it took me forever to understand where the problem came from. Adding the encoding = "utf-8-sig" to the read_table solves the issue. Can pandas figure out the enconding on its own, to avoid such problems?

@shoyer
Copy link
Member

shoyer commented Mar 26, 2015

@zelite In this example, your column was actually spelled 'TimeStamp' (yes, with the proceeding space). I don't know if it's possible to detect encodings like this automatically... in general, that can be a hard problem.

@mizzao
Copy link

mizzao commented Mar 26, 2015

@shoyer I don't think that was a space thing; it was definitely the BOM. If you do something like df.columns[0] you will see the utf-8 characters.

@shoyer
Copy link
Member

shoyer commented Mar 26, 2015

OK, I'm clearly lost. Just looked up exactly what "BOM" is :).

@Zeugma4242
Copy link

I can reproduce the issue with Pandas 0.16.0. I can get the same error on first column name with read_csv , or get the first row of the first column erroneous too with a file without columns on row 1 and with names= argument in the read_csv call.

It sounds like read_csv interprets correctly encoding='utf-8-sig' by skipping the 3 first characters of the file, then interpreting the file as UTF8. However, the bug experienced makes me think that Pandas "forgets" to skip the first 3 characters when it starts to parse the file and create the dataframe. Something like the offset of the beginning of the effective data in the file didn't get +len(UTF8_BOM), thus leading to have the BOM included in the first column name or in the first cell of the dataframe. The most misleading part is that the characters do not print naturally when the dataframe of the column names are displayed in ipython, but the BOM is clearly kept and behind that cell string as pointed out in a previous comment above.

HTH

@dr-leo
Copy link
Contributor

dr-leo commented Jun 23, 2016

With encoding 'utf-8-sig', the BOM is correctly skipped rather than prepending it to the first column label. However, as described by others, the quotes around the first label remain.

The zip archive at
https://www.dropbox.com/s/kcbh7fbsj9fwh13/sample.zip?dl=0

contains a script for demonstration as well as 2 csv files that differ only by having / not having a BOM. The one without BOM is parsed correctly, with BOM the first label remains quoted.

Python 3.5.1_x86, PD 18.1, Win7x64

@gfyoung
Copy link
Member

gfyoung commented Aug 2, 2016

A minimal example for future reference:

>>> from pandas.compat import BytesIO
>>> from pandas import read_csv
>>> import codecs
>>>
>>> BOM = codecs.BOM_UTF8
>>> data = '"name"\n"foo"'.encode('utf-8')
>>>
>>> read_csv(BytesIO(data), encoding='utf-8', engine='c')
# same result if engine='python'
  name
0  foo
>>>
>>> read_csv(BytesIO(BOM + data), encoding='utf-8', engine='c')
# same result if engine='python'
  "name"
0     foo

While I agree that there is a bug in the C engine, I don't believe the same can be said with the Python engine, as csv.reader (the foundation of the Python engine) cannot parse the BOM correctly (cannot run this if using Python 2.x):

>>> from io import TextIOWrapper
>>> from csv import reader
>>> for row in reader(BytesIO(BOM + data), encoding='utf-8'): print(row)
['\ufeff"name"']
['foo']

Since the Python engine failure is beyond our control, the question is then can this issue be closed if we can patch the C engine?

@jreback
Copy link
Contributor

jreback commented Aug 2, 2016

this is mainly an issue on windows, where these BOM markers can easily be put in files, so if possible to patch would be good.

gfyoung added a commit to forking-repos/pandas that referenced this issue Aug 5, 2016
@jreback jreback modified the milestones: 0.19.0, Next Major Release Aug 5, 2016
@jreback jreback closed this as completed in e5ee5d2 Aug 5, 2016
@OlafEichstaedt
Copy link

OlafEichstaedt commented Feb 1, 2017

Sorry for my ignorance, not sure why this issue is closed.

I have a file UTF-8 with BOM. The (quoted) content is:

'"node_id";"name_en"\n'
'"";"English name"\n'
'"39252";"TEST USE CASE 1"\n'
x=pd.read_csv('UTF8withBOM.csv',encoding='utf-8-sig',delimiter=";",quotechar='"')
print(x)
  "node_id"                                    name_en
0        NaN                           English name
1    39252.0  TEST USE CASE 1

... with the problem that the first header 'node_ID' has kept its quotes, same as dr-leo's comment on Jun 23, 2016 above. All other quotes were removed correctly.

If I use a UTF-8 file without BOM and

x=pd.read_csv('UTF8withoutBOM.csv',encoding='utf-8',delimiter=";",quotechar='"')
print(x)
   node_id                                    name_en
0      NaN                           English name
1  39252.0  TEST USE CASE 1

With the quotes correctly removed from the firste header.

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Feb 1, 2017

the issue was closed in 0.19.0

try using a more recent version

@jcrotinger
Copy link

I'm using pandas.read_fwf in 0.19.0 and seeing a similar issue.

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.3.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.3.2
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: 0.7.9.None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

@dr-leo
Copy link
Contributor

dr-leo commented Oct 18, 2017 via email

@Bhaskar-Arya
Copy link

use sep=' \t '
df=pd.read_csv('titanic.csv',sep='\t')

@ingomueller-net
Copy link

ingomueller-net commented Feb 1, 2019

I am still experiencing this problem on Windows with Pandas 0.24.0.

(False alarm -- I was opening the file with argparse.)

@jaspergrond
Copy link

I ran into this problem as well. I exported a table as a .csv file from DBeaver. There is an option to "Insert BOM" that was selected. The header was not read in properly by read_csv() (only the first column name was read). When I exported with "Insert BOM" un-selected read_csv() worked correctly.

I am running Python 3.7.2 with Pandas 0.23.4 on Linux Mint 19.1.

@fentonscode
Copy link

fentonscode commented May 22, 2020

Bug is still persistent in 1.0.3 as of today. Replicated with all combinations of the Python UTF 8 encoding string with or without hyphens, underscores and "sig" extensions. Stepping through some of the code was showing that the encoding was stuck in code-point 1252 for a very long time before it became read as UTF8 - Is it not being set early enough?

  • OS: Windows 10 x64
  • Python: 3.7.4
  • Version: pandas 1.0.3, installed via pip 20.1.1

ETA: Nothing online appears to catch this but it appears this can be replicated (and solved) as follows:

import pandas as pd
...
with open(filename, encoding="xxx") as f_handle:
    data = pd.read_csv(f_handle, encoding="yyy")

It was not clear from the documentation that encoding xxx stamps over encoding yyy when reading the names field, but does not stamp on yyy when reading the rows. This is important as not setting encoding xxx will lead to the default value overwriting yyy I might have missed it in the docs, if I haven't then this should be added as a note somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests