Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(csv): multiple fixes #129

Merged
merged 7 commits into from
Apr 11, 2022
Merged

fix(csv): multiple fixes #129

merged 7 commits into from
Apr 11, 2022

Conversation

davinov
Copy link
Member

@davinov davinov commented Apr 11, 2022

Fix csv metadata:

  • trailing line should not alter the count
  • CSVs with no header column should have the correct number of lines
  • a custom encoding should not cause a failure in counting the lines

Done on the 0.7 branch, should be forward-ported afterwards

@davinov davinov added the bug Something isn't working label Apr 11, 2022
@davinov davinov self-assigned this Apr 11, 2022
@codecov
Copy link

codecov bot commented Apr 11, 2022

Codecov Report

Merging #129 (b9934b5) into version/0.7 (622524c) will not change coverage.
The diff coverage is 100.00%.

❗ Current head b9934b5 differs from pull request most recent head ad56924. Consider uploading reports for the commit ad56924 to get more accurate results

@@              Coverage Diff              @@
##           version/0.7      #129   +/-   ##
=============================================
  Coverage       100.00%   100.00%           
=============================================
  Files               18        18           
  Lines              770       777    +7     
=============================================
+ Hits               770       777    +7     
Impacted Files Coverage Δ
peakina/readers/csv.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 622524c...ad56924. Read the comment docs.

@davinov davinov changed the title fix(csv): fix(csv): multiple fixes Apr 11, 2022
@davinov davinov changed the base branch from main to version/0.7 April 11, 2022 14:44
@@ -80,7 +80,7 @@ def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int:
def csv_meta(
filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any]
) -> Dict[str, Any]:
total_rows = _line_count(filepath_or_buffer)
total_rows = _line_count(filepath_or_buffer, reader_kwargs.get("encoding"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will work only when encoding is explicitly defined, not going through any auto detection step ?
(guess it's fine but need to keep that in mind if any other issue arise)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the open docs: In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

So I guess we just let Python do the auto-detect for us ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so but it doesn't :/ It still tries to decode the file as utf-8 and miserably fails.
The only logic it seems to have is to use the sys.getdefaultencoding(), which is what pandas also does.

Copy link
Member

@PrettyWood PrettyWood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests! LGTM except a missing init of a variable which may break if the file is empty I suppose

total_rows = _line_count(filepath_or_buffer)
total_rows = _line_count(filepath_or_buffer, reader_kwargs.get("encoding"))

if not reader_kwargs.get("names") and (total_rows > 0): # No header row
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not reader_kwargs.get("names") and (total_rows > 0): # No header row
if "names" not in reader_kwargs and total_rows > 0: # No header row

Copy link
Member Author

@davinov davinov Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int:
with open(filepath_or_buffer) as f:
def _line_count(filepath_or_buffer: "FilePathOrBuffer", encoding: Optional[str]) -> int:
with open(filepath_or_buffer, encoding=encoding) as f:
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization

buf = read_f(buf_size)
while buf:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while buf:
finish_by_line_break = False
while buf:

Copy link
Member Author

@davinov davinov Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/fixtures/trailing_line_empty.csv Outdated Show resolved Hide resolved
@@ -80,7 +80,7 @@ def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int:
def csv_meta(
filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any]
) -> Dict[str, Any]:
total_rows = _line_count(filepath_or_buffer)
total_rows = _line_count(filepath_or_buffer, reader_kwargs.get("encoding"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the open docs: In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

So I guess we just let Python do the auto-detect for us ?

peakina/readers/csv.py Outdated Show resolved Hide resolved
Copy link
Contributor

@lukapeschke lukapeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@davinov davinov requested a review from PrettyWood April 11, 2022 15:32
davinov added 2 commits April 11, 2022 17:41
and deduplicate buffer reading
and rename to trailing newline
@davinov davinov merged commit ad2ee24 into version/0.7 Apr 11, 2022
@davinov davinov deleted the fix/meta-csv branch April 11, 2022 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants