-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(csv): multiple fixes #129
Conversation
Codecov Report
@@ Coverage Diff @@
## version/0.7 #129 +/- ##
=============================================
Coverage 100.00% 100.00%
=============================================
Files 18 18
Lines 770 777 +7
=============================================
+ Hits 770 777 +7
Continue to review full report at Codecov.
|
@@ -80,7 +80,7 @@ def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int: | |||
def csv_meta( | |||
filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any] | |||
) -> Dict[str, Any]: | |||
total_rows = _line_count(filepath_or_buffer) | |||
total_rows = _line_count(filepath_or_buffer, reader_kwargs.get("encoding")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this will work only when encoding is explicitly defined, not going through any auto detection step ?
(guess it's fine but need to keep that in mind if any other issue arise)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the open
docs: In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
So I guess we just let Python do the auto-detect for us ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so but it doesn't :/ It still tries to decode the file as utf-8 and miserably fails.
The only logic it seems to have is to use the sys.getdefaultencoding()
, which is what pandas also does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tests! LGTM except a missing init of a variable which may break if the file is empty I suppose
peakina/readers/csv.py
Outdated
total_rows = _line_count(filepath_or_buffer) | ||
total_rows = _line_count(filepath_or_buffer, reader_kwargs.get("encoding")) | ||
|
||
if not reader_kwargs.get("names") and (total_rows > 0): # No header row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not reader_kwargs.get("names") and (total_rows > 0): # No header row | |
if "names" not in reader_kwargs and total_rows > 0: # No header row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
peakina/readers/csv.py
Outdated
def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int: | ||
with open(filepath_or_buffer) as f: | ||
def _line_count(filepath_or_buffer: "FilePathOrBuffer", encoding: Optional[str]) -> int: | ||
with open(filepath_or_buffer, encoding=encoding) as f: | ||
lines = 0 | ||
buf_size = 1024 * 1024 | ||
read_f = f.read # loop optimization | ||
|
||
buf = read_f(buf_size) | ||
while buf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while buf: | |
finish_by_line_break = False | |
while buf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -80,7 +80,7 @@ def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int: | |||
def csv_meta( | |||
filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any] | |||
) -> Dict[str, Any]: | |||
total_rows = _line_count(filepath_or_buffer) | |||
total_rows = _line_count(filepath_or_buffer, reader_kwargs.get("encoding")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the open
docs: In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
So I guess we just let Python do the auto-detect for us ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
and deduplicate buffer reading and rename to trailing newline
Fix csv metadata:
Done on the 0.7 branch, should be forward-ported afterwards