Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Ignore the BOM in UTF8 BOM CSV files #13885

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Aug 3, 2016

Title is self-explanatory. Closes #4793.

# BOM character (byte order mark)
# This exists at the beginning of a file to indicate endianness
# of a file (stream). Unfortunately, this marker causes parsing
# to screw up parsing, so we need to remove it if we see it.
Copy link
Member Author

@gfyoung gfyoung Aug 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"causes parsing to screw up parsing" --> "screws up parsing"

@jreback jreback added IO CSV read_csv, to_csv Bug labels Aug 3, 2016
not the middle of it.
"""
# first_row will be a list, so we need to check
# that that list is not empty before proceeding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems amazingly complicated!

Copy link
Member Author

@gfyoung gfyoung Aug 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, more than I thought. Currently, tests are failing for Python 2.x because it doesn't see \ufeff as a single character. However, the quickfix of importing unicode_literals from __future__ breaks everything in parsers...

Converting the BOM character to unicode (it then is correctly seen as one character) fails Python's csv can't seem to read those characters at the moment...

@codecov-io
Copy link

codecov-io commented Aug 4, 2016

Current coverage is 85.30% (diff: 80.64%)

Merging #13885 into master will decrease coverage by <.01%

@@             master     #13885   diff @@
==========================================
  Files           139        139          
  Lines         50108      50138    +30   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42744      42768    +24   
- Misses         7364       7370     +6   
  Partials          0          0          

Powered by Codecov. Last update 2beab41...34bc8e5

@gfyoung gfyoung changed the title BUG: Ignore the BOM in BOM CSV files BUG: Ignore the BOM in UTF8 BOM CSV files Aug 4, 2016
@@ -704,6 +704,11 @@ static int parser_buffer_bytes(parser_t *self, size_t nbytes) {
self->datapos = i; \
TRACE(("_TOKEN_CLEANUP: datapos: %d, datalen: %d\n", self->datapos, self->datalen));

#define CHECK_FOR_BOM() \
if (*buf == '\xef' && *(buf + 1) == '\xbb' && *(buf + 2) == '\xbf') { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you have to handle the quote char like you do in the python parser?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we parse char by char, so the quotation mark will be handled in _tokenize_bytes like any other character.

@jreback
Copy link
Contributor

jreback commented Aug 4, 2016

was just about to merge (testing on macosx / windows). and on mac got this warning (though no error)?

\[jreback-~/pandas] nosetests  -A 'not slow and not network' pandas/io/tests/parser/
.......................................................................................................................................................................................S.............S........................................................................................................................................................................................S.............S...S......................................................................................................................................................................./Users/jreback/pandas/pandas/io/parsers.py:2192: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if not first_row[0] or first_row[0][0] != _BOM:
.S.............S..S...............S.............................
----------------------------------------------------------------------
Ran 632 tests in 63.239s

@gfyoung
Copy link
Member Author

gfyoung commented Aug 5, 2016

@jreback : It's a Python 2.x thing because you're comparing with a string that cannot be cast to Unicode. I made an explicit check for it that should hopefully get rid of the warning.

@jreback jreback added this to the 0.19.0 milestone Aug 5, 2016
@jreback
Copy link
Contributor

jreback commented Aug 5, 2016

thanks!

@jreback
Copy link
Contributor

jreback commented Aug 5, 2016

closed by e5ee5d2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants