-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Ignore the BOM in UTF8 BOM CSV files #13885
Conversation
# BOM character (byte order mark) | ||
# This exists at the beginning of a file to indicate endianness | ||
# of a file (stream). Unfortunately, this marker causes parsing | ||
# to screw up parsing, so we need to remove it if we see it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"causes parsing to screw up parsing" --> "screws up parsing"
not the middle of it. | ||
""" | ||
# first_row will be a list, so we need to check | ||
# that that list is not empty before proceeding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems amazingly complicated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, more than I thought. Currently, tests are failing for Python 2.x because it doesn't see \ufeff
as a single character. However, the quickfix of importing unicode_literals
from __future__
breaks everything in parsers
...
Converting the BOM character to unicode
(it then is correctly seen as one character) fails Python's csv
can't seem to read those characters at the moment...
Current coverage is 85.30% (diff: 80.64%)@@ master #13885 diff @@
==========================================
Files 139 139
Lines 50108 50138 +30
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 42744 42768 +24
- Misses 7364 7370 +6
Partials 0 0
|
@@ -704,6 +704,11 @@ static int parser_buffer_bytes(parser_t *self, size_t nbytes) { | |||
self->datapos = i; \ | |||
TRACE(("_TOKEN_CLEANUP: datapos: %d, datalen: %d\n", self->datapos, self->datalen)); | |||
|
|||
#define CHECK_FOR_BOM() \ | |||
if (*buf == '\xef' && *(buf + 1) == '\xbb' && *(buf + 2) == '\xbf') { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you have to handle the quote char like you do in the python parser?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we parse char by char, so the quotation mark will be handled in _tokenize_bytes
like any other character.
was just about to merge (testing on macosx / windows). and on mac got this warning (though no error)?
|
@jreback : It's a Python 2.x thing because you're comparing with a string that cannot be cast to |
thanks! |
closed by e5ee5d2 |
Title is self-explanatory. Closes #4793.