Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv includes the BOM of an utf8 file into the first column label #13497

Closed
dr-leo opened this issue Jun 22, 2016 · 6 comments
Closed

read_csv includes the BOM of an utf8 file into the first column label #13497

dr-leo opened this issue Jun 22, 2016 · 6 comments
Labels
Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv

Comments

@dr-leo
Copy link
Contributor

dr-leo commented Jun 22, 2016

Consider the following script:

import pandas as PD

df = PD.read_csv('sample.csv',
encoding='utf8')
print(df.columns[:5])

@dr-leo
Copy link
Contributor Author

dr-leo commented Jun 22, 2016

Attached is an utf8 or ios-8859-1 encoded csv file. It is part of IMF financial data from data.imf.org.

The following code shows that the first column label starts with the file's BOM and includes the '"' even though all other labels rightly do not contain '"' as it is the separator.

import pandas as PD

df = PD.read_csv('sample.csv',
encoding='utf8')
print(df.columns[:5])

The BOM is in fact prepended to the first column label even if utf8 is specified.

sample.zip

@dr-leo
Copy link
Contributor Author

dr-leo commented Jun 22, 2016

I should add that this bug occurs on Win7x64, Python 3.5.1, 32bit, pandas 0.18.1.

@TomAugspurger
Copy link
Contributor

This looks like a duplicate of #4793. Any interest in submitting a fix?

Also @dr-leo, for this to work you would need pd.read_csv('sample.csv', encoding='utf-8-sig'), correct?

@TomAugspurger TomAugspurger added Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv labels Jun 22, 2016
@dr-leo
Copy link
Contributor Author

dr-leo commented Jun 23, 2016

When I use utf-8-sig, the BOM is gone, but the first column label is
'"Country Name"' instead of 'Country Name'. The others are ok though.

So I'll open a new issue. Or is there a workaround?

Am 23.06.2016 um 01:49 schrieb Tom Augspurger:

Closed #13497 #13497.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#13497 (comment), or
mute the thread
https://github.com/notifications/unsubscribe/ADA58nPmOz0z-z4oykJ0nIpQzdoCwJIsks5qOcokgaJpZM4I8MzU.

@jreback
Copy link
Contributor

jreback commented Jun 23, 2016

this is the same exact issue as #4793

@dr-leo
Copy link
Contributor Author

dr-leo commented Jun 23, 2016

Right. Tom has spotted this as well. I had only searched the open
issues for duplicates. I'll make a PR to improve the docs once I find
some time.

Using utf-8-sig uncovers another issue: Separators are not removed
from the first column Name starting right after the BOM.

Leo

On 23/06/2016, Jeff Reback notifications@github.com wrote:

this is the same exact issue as #4793


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#13497 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants