read_csv does not parse in header with BOM utf-8 #4793

johnclinaa · 2013-09-09T23:37:06Z

I am using Pandas version 0.12.0 on a Mac.

I noticed that when there is a BOM utf-8 file, and if the header row is in the first line, the read_csv() method will leave a leading quotation mark in the first column's name. However, if the header row is further down the file and I use the "header=" option, then the whole header row gets parsed correctly.

Here is an example code:

bing_kw = pd.read_csv('../../data/sem/Bing-Keyword_daily.csv', header=9, thousands=',', encoding='utf-8')

Parses the header correctly.

bing_kw = pd.read_csv('../../data/sem/Bing-Keyword_daily.csv', thousands=',', encoding='utf-8')

Parses the first header column name incorrectly by leaving the leading quotation mark.

jtratner · 2013-09-09T23:39:30Z

which version of Python are you using? (python --version)

johnclinaa · 2013-09-09T23:42:14Z

Python 2.7.5 :: Anaconda 1.6.1 (x86_64)

And the code is written in a python notebook using ipython 1.0.0.

jreback · 2013-09-28T19:50:38Z

@john-orange-aa can you provide a reproducible example (link to a file if you need to)

johnclinaa · 2013-09-28T23:17:31Z

@jreblack, here is the link to the folder:
https://drive.google.com/folderview?id=0BwxOyJG828PySFFQVlBSUEdlcEk&usp=sharing

BOM-temp.csv is the offending file. BOM-temp2.csv is the same file with headers removed. The "pandas BOM utf-8 bug.ipynb" is the ipython notebook that illustrates the bug.

On Sep 28, 2013, at 12:50 PM, jreback notifications@github.com wrote:

@john-orange-aa can you provide a reproducible example (link to a file if you need to)

—
Reply to this email directly or view it on GitHub.

jseabold · 2013-12-23T14:33:59Z

This isn't exactly the same issue, but I'm also having trouble with BOMs. File with a utf-8 BOM here.

    [~/work/]
    [1]: pd.version.version
    [1]: '0.12.0-1149-g141e93a'

    [~/work/]
    [2]: dta = pd.read_csv("bom_file.csv", encoding='utf-8')

    [~/work/]
    [3]: dta.columns[0]
    [3]: u'\ufeffDate (MMM-YY)'

jseabold · 2013-12-23T17:14:43Z

It looks like you should use 'utf-8-sig' as the encoding for utf-8 files with a BOM, so my comment is likely invalid.

filmackay · 2014-04-16T02:54:42Z

Is it possible for Pandas to infer the encoding from the BOM automatically - or is it really required to pass this information through in the encoding? I think the purpose of the BOM was to provide this kind of capability?

ayee · 2014-07-19T02:12:20Z

'utf-8-sig' does not resolve the issue. I faced the same issue but using 'utf-8-sig' just got me another decoding problem
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 14: invalid start byte

misea · 2015-02-20T23:32:10Z

I ran into this problem as well (version 0.15.2). I tried 'utf-8-sig' encoding, and though I didn't see an error, the result was not quite right as the first key is quoted and none of the other keys are, though all column headers/values are quoted throughout the file.

f = open('data/bomex.csv')
header = f.read(10)
f.close()
header

out: '\xef\xbb\xbf"Name",'

f = codecs.open('data/bomex.csv', encoding='utf-8-sig')
header = f.read(10)
f.close()
header

out: u'"Name","Team","G"'

df = pd.read_csv('data/bomex.csv', encoding='utf-8-sig')
df.keys()[0]

out: u'"Name"'

df.keys()[1]

out: u'Team'

Note the extra set of quotes on the first key

mizzao · 2015-03-03T21:20:09Z

With 0.15.2, I am able to use encoding="utf-8-sig" and the BOM disappears from the first column header.

misea · 2015-03-03T22:51:39Z

In 0.15.2, I find that the BOM disappears, but the quotes around the first column header are erroneously preserved, while quotes around all other column headers (and all other values) are stripped.

So the problem with utf-8-sig seems to only affect quoted column headers. Here's an example file to try

https://dl.dropboxusercontent.com/u/27287953/bom.csv

...

zelite · 2015-03-26T13:26:24Z

Hi,

I'm using pandas 0.16 in python 3.4 (anaconda distro).
I was having a problem with some files, and it seems to be related with the BOM of the file.
I have this file: https://www.dropbox.com/s/nced7whmt2rr0c8/sample_file2.txt

I'm reading the file with:

import pandas as pd
test = pd.read_table("sample_file2.txt", decimal = ",", 
                     parse_dates = True)

When I print the column names:

>>> print(test.columns)
Index(['TimeStamp', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14',
 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29',
 'V30', 'V31'], dtype='object')

So, there seems to be a 'TimeStamp' column. Let me check what is in there:

>>> test["TimeStamp"] 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1787, in __getitem__
[truncated]
File "pandas\hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item
(pandas\hashtable.c:12300)
KeyError: 'TimeStamp'

And, I don't seem to be able to find out what is the real name of the TimeStamp column =/ I had this in a longer script and it took me forever to understand where the problem came from. Adding the encoding = "utf-8-sig" to the read_table solves the issue. Can pandas figure out the enconding on its own, to avoid such problems?

shoyer · 2015-03-26T19:56:58Z

@zelite In this example, your column was actually spelled 'TimeStamp' (yes, with the proceeding space). I don't know if it's possible to detect encodings like this automatically... in general, that can be a hard problem.

mizzao · 2015-03-26T20:01:04Z

@shoyer I don't think that was a space thing; it was definitely the BOM. If you do something like df.columns[0] you will see the utf-8 characters.

shoyer · 2015-03-26T20:04:01Z

OK, I'm clearly lost. Just looked up exactly what "BOM" is :).

Zeugma4242 · 2015-04-06T04:40:36Z

I can reproduce the issue with Pandas 0.16.0. I can get the same error on first column name with read_csv , or get the first row of the first column erroneous too with a file without columns on row 1 and with names= argument in the read_csv call.

It sounds like read_csv interprets correctly encoding='utf-8-sig' by skipping the 3 first characters of the file, then interpreting the file as UTF8. However, the bug experienced makes me think that Pandas "forgets" to skip the first 3 characters when it starts to parse the file and create the dataframe. Something like the offset of the beginning of the effective data in the file didn't get +len(UTF8_BOM), thus leading to have the BOM included in the first column name or in the first cell of the dataframe. The most misleading part is that the characters do not print naturally when the dataframe of the column names are displayed in ipython, but the BOM is clearly kept and behind that cell string as pointed out in a previous comment above.

HTH

dr-leo · 2016-06-23T20:28:23Z

With encoding 'utf-8-sig', the BOM is correctly skipped rather than prepending it to the first column label. However, as described by others, the quotes around the first label remain.

The zip archive at
https://www.dropbox.com/s/kcbh7fbsj9fwh13/sample.zip?dl=0

contains a script for demonstration as well as 2 csv files that differ only by having / not having a BOM. The one without BOM is parsed correctly, with BOM the first label remains quoted.

Python 3.5.1_x86, PD 18.1, Win7x64

gfyoung · 2016-08-02T08:38:41Z

A minimal example for future reference:

>>> from pandas.compat import BytesIO
>>> from pandas import read_csv
>>> import codecs
>>>
>>> BOM = codecs.BOM_UTF8
>>> data = '"name"\n"foo"'.encode('utf-8')
>>>
>>> read_csv(BytesIO(data), encoding='utf-8', engine='c')
# same result if engine='python'
  name
0  foo
>>>
>>> read_csv(BytesIO(BOM + data), encoding='utf-8', engine='c')
# same result if engine='python'
  "name"
0     foo

While I agree that there is a bug in the C engine, I don't believe the same can be said with the Python engine, as csv.reader (the foundation of the Python engine) cannot parse the BOM correctly (cannot run this if using Python 2.x):

>>> from io import TextIOWrapper
>>> from csv import reader
>>> for row in reader(BytesIO(BOM + data), encoding='utf-8'): print(row)
['\ufeff"name"']
['foo']

Since the Python engine failure is beyond our control, the question is then can this issue be closed if we can patch the C engine?

jreback · 2016-08-02T10:11:40Z

this is mainly an issue on windows, where these BOM markers can easily be put in files, so if possible to patch would be good.

Closes pandas-devgh-4793.

OlafEichstaedt · 2017-02-01T11:59:12Z

Sorry for my ignorance, not sure why this issue is closed.

I have a file UTF-8 with BOM. The (quoted) content is:

'"node_id";"name_en"\n'
'"";"English name"\n'
'"39252";"TEST USE CASE 1"\n'

x=pd.read_csv('UTF8withBOM.csv',encoding='utf-8-sig',delimiter=";",quotechar='"')
print(x)
  "node_id"                                    name_en
0        NaN                           English name
1    39252.0  TEST USE CASE 1

... with the problem that the first header 'node_ID' has kept its quotes, same as dr-leo's comment on Jun 23, 2016 above. All other quotes were removed correctly.

If I use a UTF-8 file without BOM and

x=pd.read_csv('UTF8withoutBOM.csv',encoding='utf-8',delimiter=";",quotechar='"')
print(x)
   node_id                                    name_en
0      NaN                           English name
1  39252.0  TEST USE CASE 1

With the quotes correctly removed from the firste header.

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

jreback · 2017-02-01T12:21:12Z

the issue was closed in 0.19.0

try using a more recent version

jcrotinger · 2017-10-17T21:40:38Z

I'm using pandas.read_fwf in 0.19.0 and seeing a similar issue.

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.3.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.3.2
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: 0.7.9.None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

dr-leo · 2017-10-18T07:05:41Z

Try with pandas 20.3.

…

Am 17.10.2017 um 23:41 schrieb sf_jac ***@***.***>: I'm using pandas.read_fwf in 0.19.0 and seeing a similar issue. INSTALLED VERSIONS commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.19.0 nose: 1.3.7 pip: 9.0.1 setuptools: 28.8.0 Cython: 0.25.2 numpy: 1.11.2 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.3.0 sphinx: 1.4.6 patsy: 0.4.1 dateutil: 2.5.1 pytz: 2016.7 blosc: None bottleneck: 1.1.0 tables: 3.3.0 numexpr: 2.6.1 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.6.0 bs4: 4.3.2 html5lib: 0.9999999 httplib2: None apiclient: None sqlalchemy: 1.1.5 pymysql: 0.7.9.None psycopg2: 2.6.1 (dt dec pq3 ext lo64) jinja2: 2.8 boto: 2.39.0 pandas_datareader: None — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Bhaskar-Arya · 2018-08-15T16:55:44Z

use sep=' \t '
df=pd.read_csv('titanic.csv',sep='\t')

ingomueller-net · 2019-02-01T11:48:00Z

~~I am still experiencing this problem on Windows with Pandas 0.24.0.~~

(False alarm -- I was opening the file with argparse.)

jaspergrond · 2019-03-01T16:05:35Z

I ran into this problem as well. I exported a table as a .csv file from DBeaver. There is an option to "Insert BOM" that was selected. The header was not read in properly by read_csv() (only the first column name was read). When I exported with "Insert BOM" un-selected read_csv() worked correctly.

I am running Python 3.7.2 with Pandas 0.23.4 on Linux Mint 19.1.

fentonscode · 2020-05-22T12:23:31Z

Bug is still persistent in 1.0.3 as of today. Replicated with all combinations of the Python UTF 8 encoding string with or without hyphens, underscores and "sig" extensions. Stepping through some of the code was showing that the encoding was stuck in code-point 1252 for a very long time before it became read as UTF8 - Is it not being set early enough?

OS: Windows 10 x64
Python: 3.7.4
Version: pandas 1.0.3, installed via pip 20.1.1

ETA: Nothing online appears to catch this but it appears this can be replicated (and solved) as follows:

import pandas as pd
...
with open(filename, encoding="xxx") as f_handle:
    data = pd.read_csv(f_handle, encoding="yyy")

It was not clear from the documentation that encoding xxx stamps over encoding yyy when reading the names field, but does not stamp on yyy when reading the rows. This is important as not setting encoding xxx will lead to the default value overwriting yyy I might have missed it in the docs, if I haven't then this should be added as a note somewhere?

jreback modified the milestones: 0.15.0, 0.14.0 Mar 28, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

kapoost mentioned this issue Jul 3, 2015

DataFrames with headers that contain BOM utf-8 data are freezing/crashing Spyder (in Python 2) spyder-ide/spyder#2514

Closed

TomAugspurger mentioned this issue Jun 22, 2016

read_csv includes the BOM of an utf8 file into the first column label #13497

Closed

gfyoung mentioned this issue Aug 3, 2016

BUG: Ignore the BOM in UTF8 BOM CSV files #13885

Closed

gfyoung added a commit to forking-repos/pandas that referenced this issue Aug 5, 2016

BUG: Ignore the BOM in BOM UTF-8 CSV files

34bc8e5

Closes pandas-devgh-4793.

jreback modified the milestones: 0.19.0, Next Major Release Aug 5, 2016

jreback closed this as completed in e5ee5d2 Aug 5, 2016

OlafEichstaedt mentioned this issue Feb 1, 2017

read_csv does not correclty parse in header with BOM utf-8 #15280

Closed

moi90 mentioned this issue Sep 20, 2021

Switch all encoding to UTF-8 ecotaxa/ecotaxa_front#256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv does not parse in header with BOM utf-8 #4793

read_csv does not parse in header with BOM utf-8 #4793

johnclinaa commented Sep 9, 2013

jtratner commented Sep 9, 2013

johnclinaa commented Sep 9, 2013

jreback commented Sep 28, 2013

johnclinaa commented Sep 28, 2013

jseabold commented Dec 23, 2013

jseabold commented Dec 23, 2013

filmackay commented Apr 16, 2014

ayee commented Jul 19, 2014

misea commented Feb 20, 2015

mizzao commented Mar 3, 2015

misea commented Mar 3, 2015

zelite commented Mar 26, 2015

shoyer commented Mar 26, 2015

mizzao commented Mar 26, 2015

shoyer commented Mar 26, 2015

Zeugma4242 commented Apr 6, 2015

dr-leo commented Jun 23, 2016

gfyoung commented Aug 2, 2016

jreback commented Aug 2, 2016

OlafEichstaedt commented Feb 1, 2017 •

edited

Loading

jreback commented Feb 1, 2017

jcrotinger commented Oct 17, 2017

dr-leo commented Oct 18, 2017 via email

Bhaskar-Arya commented Aug 15, 2018

ingomueller-net commented Feb 1, 2019 •

edited

Loading

jaspergrond commented Mar 1, 2019

fentonscode commented May 22, 2020 •

edited

Loading

read_csv does not parse in header with BOM utf-8 #4793

read_csv does not parse in header with BOM utf-8 #4793

Comments

johnclinaa commented Sep 9, 2013

jtratner commented Sep 9, 2013

johnclinaa commented Sep 9, 2013

jreback commented Sep 28, 2013

johnclinaa commented Sep 28, 2013

jseabold commented Dec 23, 2013

jseabold commented Dec 23, 2013

filmackay commented Apr 16, 2014

ayee commented Jul 19, 2014

misea commented Feb 20, 2015

mizzao commented Mar 3, 2015

misea commented Mar 3, 2015

zelite commented Mar 26, 2015

shoyer commented Mar 26, 2015

mizzao commented Mar 26, 2015

shoyer commented Mar 26, 2015

Zeugma4242 commented Apr 6, 2015

dr-leo commented Jun 23, 2016

gfyoung commented Aug 2, 2016

jreback commented Aug 2, 2016

OlafEichstaedt commented Feb 1, 2017 • edited Loading

jreback commented Feb 1, 2017

jcrotinger commented Oct 17, 2017

INSTALLED VERSIONS

dr-leo commented Oct 18, 2017 via email

Bhaskar-Arya commented Aug 15, 2018

ingomueller-net commented Feb 1, 2019 • edited Loading

jaspergrond commented Mar 1, 2019

fentonscode commented May 22, 2020 • edited Loading

OlafEichstaedt commented Feb 1, 2017 •

edited

Loading

ingomueller-net commented Feb 1, 2019 •

edited

Loading

fentonscode commented May 22, 2020 •

edited

Loading