read_csv includes the BOM of an utf8 file into the first column label #13497

dr-leo · 2016-06-22T20:42:58Z

Consider the following script:

import pandas as PD

df = PD.read_csv('sample.csv',
encoding='utf8')
print(df.columns[:5])

dr-leo · 2016-06-22T20:50:51Z

Attached is an utf8 or ios-8859-1 encoded csv file. It is part of IMF financial data from data.imf.org.

The following code shows that the first column label starts with the file's BOM and includes the '"' even though all other labels rightly do not contain '"' as it is the separator.

import pandas as PD

df = PD.read_csv('sample.csv',
encoding='utf8')
print(df.columns[:5])

The BOM is in fact prepended to the first column label even if utf8 is specified.

sample.zip

dr-leo · 2016-06-22T20:53:54Z

I should add that this bug occurs on Win7x64, Python 3.5.1, 32bit, pandas 0.18.1.

TomAugspurger · 2016-06-22T23:49:39Z

This looks like a duplicate of #4793. Any interest in submitting a fix?

Also @dr-leo, for this to work you would need pd.read_csv('sample.csv', encoding='utf-8-sig'), correct?

dr-leo · 2016-06-23T05:16:10Z

When I use utf-8-sig, the BOM is gone, but the first column label is
'"Country Name"' instead of 'Country Name'. The others are ok though.

So I'll open a new issue. Or is there a workaround?

Am 23.06.2016 um 01:49 schrieb Tom Augspurger:

Closed #13497 #13497.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#13497 (comment), or
mute the thread
https://github.com/notifications/unsubscribe/ADA58nPmOz0z-z4oykJ0nIpQzdoCwJIsks5qOcokgaJpZM4I8MzU.

jreback · 2016-06-23T05:24:15Z

this is the same exact issue as #4793

dr-leo · 2016-06-23T08:17:44Z

Right. Tom has spotted this as well. I had only searched the open
issues for duplicates. I'll make a PR to improve the docs once I find
some time.

Using utf-8-sig uncovers another issue: Separators are not removed
from the first column Name starting right after the BOM.

Leo

On 23/06/2016, Jeff Reback notifications@github.com wrote:

this is the same exact issue as #4793

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#13497 (comment)

TomAugspurger closed this as completed Jun 22, 2016

TomAugspurger added Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv labels Jun 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv includes the BOM of an utf8 file into the first column label #13497

read_csv includes the BOM of an utf8 file into the first column label #13497

dr-leo commented Jun 22, 2016

dr-leo commented Jun 22, 2016

dr-leo commented Jun 22, 2016

TomAugspurger commented Jun 22, 2016

dr-leo commented Jun 23, 2016

jreback commented Jun 23, 2016

dr-leo commented Jun 23, 2016

read_csv includes the BOM of an utf8 file into the first column label #13497

read_csv includes the BOM of an utf8 file into the first column label #13497

Comments

dr-leo commented Jun 22, 2016

dr-leo commented Jun 22, 2016

dr-leo commented Jun 22, 2016

TomAugspurger commented Jun 22, 2016

dr-leo commented Jun 23, 2016

jreback commented Jun 23, 2016

dr-leo commented Jun 23, 2016