BUG: Ignore the BOM in UTF8 BOM CSV files #13885

gfyoung · 2016-08-03T07:35:54Z

Title is self-explanatory. Closes #4793.

gfyoung · 2016-08-03T07:37:18Z

pandas/io/parsers.py

+# BOM character (byte order mark)
+# This exists at the beginning of a file to indicate endianness
+# of a file (stream). Unfortunately, this marker causes parsing
+# to screw up parsing, so we need to remove it if we see it.


"causes parsing to screw up parsing" --> "screws up parsing"

jreback · 2016-08-03T10:20:13Z

pandas/io/parsers.py

+        not the middle of it.
+        """
+        # first_row will be a list, so we need to check
+        # that that list is not empty before proceeding.


seems amazingly complicated!

Indeed, more than I thought. Currently, tests are failing for Python 2.x because it doesn't see \ufeff as a single character. However, the quickfix of importing unicode_literals from __future__ breaks everything in parsers...

Converting the BOM character to unicode (it then is correctly seen as one character) fails Python's csv can't seem to read those characters at the moment...

codecov-io · 2016-08-04T04:55:13Z

Current coverage is 85.30% (diff: 80.64%)

Merging #13885 into master will decrease coverage by <.01%

@@             master     #13885   diff @@
==========================================
  Files           139        139          
  Lines         50108      50138    +30   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42744      42768    +24   
- Misses         7364       7370     +6   
  Partials          0          0

Powered by Codecov. Last update 2beab41...34bc8e5

jreback · 2016-08-04T10:29:28Z

pandas/src/parser/tokenizer.c

@@ -704,6 +704,11 @@ static int parser_buffer_bytes(parser_t *self, size_t nbytes) {
    self->datapos = i;                                                  \
    TRACE(("_TOKEN_CLEANUP: datapos: %d, datalen: %d\n", self->datapos, self->datalen));

+#define CHECK_FOR_BOM()                                                   \
+    if (*buf == '\xef' && *(buf + 1) == '\xbb' && *(buf + 2) == '\xbf') { \


why don't you have to handle the quote char like you do in the python parser?

Because we parse char by char, so the quotation mark will be handled in _tokenize_bytes like any other character.

jreback · 2016-08-04T21:20:25Z

was just about to merge (testing on macosx / windows). and on mac got this warning (though no error)?

\[jreback-~/pandas] nosetests  -A 'not slow and not network' pandas/io/tests/parser/
.......................................................................................................................................................................................S.............S........................................................................................................................................................................................S.............S...S......................................................................................................................................................................./Users/jreback/pandas/pandas/io/parsers.py:2192: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if not first_row[0] or first_row[0][0] != _BOM:
.S.............S..S...............S.............................
----------------------------------------------------------------------
Ran 632 tests in 63.239s

Closes pandas-devgh-4793.

gfyoung · 2016-08-05T05:36:44Z

@jreback : It's a Python 2.x thing because you're comparing with a string that cannot be cast to Unicode. I made an explicit check for it that should hopefully get rid of the warning.

jreback · 2016-08-05T10:35:52Z

thanks!

jreback · 2016-08-05T10:37:31Z

closed by e5ee5d2

gfyoung reviewed Aug 3, 2016
View reviewed changes

jreback added IO CSV read_csv, to_csv Bug labels Aug 3, 2016

jreback reviewed Aug 3, 2016
View reviewed changes

gfyoung force-pushed the bom-read-csv branch from 2e3a9fb to cef94f8 Compare August 4, 2016 04:55

gfyoung changed the title ~~BUG: Ignore the BOM in BOM CSV files~~ BUG: Ignore the BOM in UTF8 BOM CSV files Aug 4, 2016

jreback reviewed Aug 4, 2016
View reviewed changes

BUG: Ignore the BOM in BOM UTF-8 CSV files

34bc8e5

Closes pandas-devgh-4793.

gfyoung force-pushed the bom-read-csv branch from cef94f8 to 34bc8e5 Compare August 5, 2016 05:35

jreback added this to the 0.19.0 milestone Aug 5, 2016

jreback closed this Aug 5, 2016

gfyoung deleted the bom-read-csv branch August 5, 2016 15:47

TomAugspurger mentioned this pull request Aug 17, 2016

groupby count throws an encoding error: can't decode byte at position 0 dask/dask#1476

Closed

TomAugspurger mentioned this pull request Jun 13, 2017

KeyError from pandas DataFrame groupby for Windows based csv files #16690

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Ignore the BOM in UTF8 BOM CSV files #13885

BUG: Ignore the BOM in UTF8 BOM CSV files #13885

gfyoung commented Aug 3, 2016

gfyoung Aug 3, 2016 •

edited

Loading

jreback Aug 3, 2016

gfyoung Aug 3, 2016 •

edited

Loading

codecov-io commented Aug 4, 2016 •

edited

Loading

jreback Aug 4, 2016

gfyoung Aug 4, 2016

jreback commented Aug 4, 2016

gfyoung commented Aug 5, 2016

jreback commented Aug 5, 2016

jreback commented Aug 5, 2016

BUG: Ignore the BOM in UTF8 BOM CSV files #13885

BUG: Ignore the BOM in UTF8 BOM CSV files #13885

Conversation

gfyoung commented Aug 3, 2016

gfyoung Aug 3, 2016 • edited Loading

Choose a reason for hiding this comment

jreback Aug 3, 2016

Choose a reason for hiding this comment

gfyoung Aug 3, 2016 • edited Loading

Choose a reason for hiding this comment

codecov-io commented Aug 4, 2016 • edited Loading

Current coverage is 85.30% (diff: 80.64%)

jreback Aug 4, 2016

Choose a reason for hiding this comment

gfyoung Aug 4, 2016

Choose a reason for hiding this comment

jreback commented Aug 4, 2016

gfyoung commented Aug 5, 2016

jreback commented Aug 5, 2016

jreback commented Aug 5, 2016

gfyoung Aug 3, 2016 •

edited

Loading

gfyoung Aug 3, 2016 •

edited

Loading

codecov-io commented Aug 4, 2016 •

edited

Loading