fix(csv): multiple fixes #129

davinov · 2022-04-11T14:38:24Z

Fix csv metadata:

trailing line should not alter the count
CSVs with no header column should have the correct number of lines
a custom encoding should not cause a failure in counting the lines

Done on the 0.7 branch, should be forward-ported afterwards

codecov · 2022-04-11T14:39:43Z

Codecov Report

Merging #129 (b9934b5) into version/0.7 (622524c) will not change coverage.
The diff coverage is 100.00%.

❗ Current head b9934b5 differs from pull request most recent head ad56924. Consider uploading reports for the commit ad56924 to get more accurate results

@@              Coverage Diff              @@
##           version/0.7      #129   +/-   ##
=============================================
  Coverage       100.00%   100.00%           
=============================================
  Files               18        18           
  Lines              770       777    +7     
=============================================
+ Hits               770       777    +7

Impacted Files	Coverage Δ
peakina/readers/csv.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 622524c...ad56924. Read the comment docs.

austil · 2022-04-11T14:45:54Z

peakina/readers/csv.py

@@ -80,7 +80,7 @@ def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int:
 def csv_meta(
    filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any]
 ) -> Dict[str, Any]:
-    total_rows = _line_count(filepath_or_buffer)
+    total_rows = _line_count(filepath_or_buffer, reader_kwargs.get("encoding"))


So this will work only when encoding is explicitly defined, not going through any auto detection step ?
(guess it's fine but need to keep that in mind if any other issue arise)

According to the open docs: In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

So I guess we just let Python do the auto-detect for us ?

I thought so but it doesn't :/ It still tries to decode the file as utf-8 and miserably fails.
The only logic it seems to have is to use the sys.getdefaultencoding(), which is what pandas also does.

PrettyWood

Thanks for the tests! LGTM except a missing init of a variable which may break if the file is empty I suppose

PrettyWood · 2022-04-11T14:48:50Z

peakina/readers/csv.py

-    total_rows = _line_count(filepath_or_buffer)
+    total_rows = _line_count(filepath_or_buffer, reader_kwargs.get("encoding"))
+
+    if not reader_kwargs.get("names") and (total_rows > 0):  # No header row


Suggested change

if not reader_kwargs.get("names") and (total_rows > 0): # No header row

if "names" not in reader_kwargs and total_rows > 0: # No header row

PrettyWood · 2022-04-11T14:54:46Z

peakina/readers/csv.py

-def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int:
-    with open(filepath_or_buffer) as f:
+def _line_count(filepath_or_buffer: "FilePathOrBuffer", encoding: Optional[str]) -> int:
+    with open(filepath_or_buffer, encoding=encoding) as f:
        lines = 0
        buf_size = 1024 * 1024
        read_f = f.read  # loop optimization

        buf = read_f(buf_size)
        while buf:


Suggested change

while buf:

finish_by_line_break = False

while buf:

tests/fixtures/trailing_line_empty.csv

lukapeschke · 2022-04-11T14:53:31Z

peakina/readers/csv.py

@@ -80,7 +80,7 @@ def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int:
 def csv_meta(
    filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any]
 ) -> Dict[str, Any]:
-    total_rows = _line_count(filepath_or_buffer)
+    total_rows = _line_count(filepath_or_buffer, reader_kwargs.get("encoding"))


According to the open docs: In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

So I guess we just let Python do the auto-detect for us ?

peakina/readers/csv.py

lukapeschke

LGTM 👍

and deduplicate buffer reading and rename to trailing newline

davinov added 3 commits April 11, 2022 16:36

fix(csv): total rows with non utf-8 encoding

00d5a19

fix(csv): don't count header row in total number of rows

2c41064

fix(csv): don't count last empty line

9360ed9

davinov added the bug Something isn't working label Apr 11, 2022

davinov self-assigned this Apr 11, 2022

davinov changed the title ~~fix(csv):~~ fix(csv): multiple fixes Apr 11, 2022

davinov changed the base branch from main to version/0.7 April 11, 2022 14:44

austil reviewed Apr 11, 2022

View reviewed changes

PrettyWood requested changes Apr 11, 2022

View reviewed changes

lukapeschke reviewed Apr 11, 2022

View reviewed changes

davinov force-pushed the fix/meta-csv branch from 04006e8 to 8b49029 Compare April 11, 2022 15:17

lukapeschke approved these changes Apr 11, 2022

View reviewed changes

davinov requested a review from PrettyWood April 11, 2022 15:32

davinov added 2 commits April 11, 2022 17:41

style(csv): style of reader_kwargs access

561f8d6

fix(csv): prevent undef var if file is empty

9846dc5

and deduplicate buffer reading and rename to trailing newline

davinov force-pushed the fix/meta-csv branch from 8b49029 to b9934b5 Compare April 11, 2022 15:41

davinov added 2 commits April 11, 2022 18:17

fix: auto-detect encoding for metadata

b1287a4

chore: v0.7.14

ad56924

davinov force-pushed the fix/meta-csv branch from b9934b5 to ad56924 Compare April 11, 2022 16:17

PrettyWood approved these changes Apr 11, 2022

View reviewed changes

davinov merged commit ad2ee24 into version/0.7 Apr 11, 2022

davinov deleted the fix/meta-csv branch April 11, 2022 16:33

This was referenced Apr 11, 2022

fix(csv): multiple fixes #131

Closed

fix(csv): use proper encoding and fix lines number in get_metadata #132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(csv): multiple fixes #129

fix(csv): multiple fixes #129

davinov commented Apr 11, 2022 •

edited

Loading

codecov bot commented Apr 11, 2022 •

edited

Loading

austil Apr 11, 2022

lukapeschke Apr 11, 2022

davinov Apr 11, 2022

PrettyWood left a comment

PrettyWood Apr 11, 2022

davinov Apr 11, 2022 •

edited

Loading

PrettyWood Apr 11, 2022

davinov Apr 11, 2022 •

edited

Loading

lukapeschke Apr 11, 2022

lukapeschke left a comment

	if not reader_kwargs.get("names") and (total_rows > 0): # No header row
	if "names" not in reader_kwargs and total_rows > 0: # No header row

fix(csv): multiple fixes #129

fix(csv): multiple fixes #129

Conversation

davinov commented Apr 11, 2022 • edited Loading

codecov bot commented Apr 11, 2022 • edited Loading

Codecov Report

austil Apr 11, 2022

Choose a reason for hiding this comment

lukapeschke Apr 11, 2022

Choose a reason for hiding this comment

davinov Apr 11, 2022

Choose a reason for hiding this comment

PrettyWood left a comment

Choose a reason for hiding this comment

PrettyWood Apr 11, 2022

Choose a reason for hiding this comment

davinov Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

PrettyWood Apr 11, 2022

Choose a reason for hiding this comment

davinov Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

lukapeschke Apr 11, 2022

Choose a reason for hiding this comment

lukapeschke left a comment

Choose a reason for hiding this comment

davinov commented Apr 11, 2022 •

edited

Loading

codecov bot commented Apr 11, 2022 •

edited

Loading

davinov Apr 11, 2022 •

edited

Loading

davinov Apr 11, 2022 •

edited

Loading