Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(csv): multiple fixes #129

Merged
merged 7 commits into from
Apr 11, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions peakina/readers/csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,8 @@ def read_csv(
)


def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int:
with open(filepath_or_buffer) as f:
def _line_count(filepath_or_buffer: "FilePathOrBuffer", encoding: Optional[str]) -> int:
with open(filepath_or_buffer, encoding=encoding) as f:
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
Expand All @@ -80,7 +80,7 @@ def _line_count(filepath_or_buffer: "FilePathOrBuffer") -> int:
def csv_meta(
filepath_or_buffer: "FilePathOrBuffer", reader_kwargs: Dict[str, Any]
) -> Dict[str, Any]:
total_rows = _line_count(filepath_or_buffer)
total_rows = _line_count(filepath_or_buffer, reader_kwargs.get("encoding"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will work only when encoding is explicitly defined, not going through any auto detection step ?
(guess it's fine but need to keep that in mind if any other issue arise)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the open docs: In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

So I guess we just let Python do the auto-detect for us ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so but it doesn't :/ It still tries to decode the file as utf-8 and miserably fails.
The only logic it seems to have is to use the sys.getdefaultencoding(), which is what pandas also does.


if "nrows" in reader_kwargs:
return {
Expand Down
3 changes: 3 additions & 0 deletions tests/fixtures/encoded_western_clrf_short.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
"aaaa";"aaaaa";"aaaaaaa";"aa";"aaaaaaa_aa";"aaaa_aa";"aaaa-aaaa_aa";"aaa_aaaaaaaaaa";"aaaaaaa_aaaaaaaaaa";"aaa_aaaaa";"aaaaaaa_aaaaa";"aaaa";"aaaaaaaaa";"aaaaaaa";"aaa-aaa";"aaa_aaaaaaaaa";"aaaaaaa_aaaaaaaaa";"aaaaaa";"aaaaaaaaaaa"
"aaaa-aa-aa aa:aa:aa";"aaaa";"aaaa-aaa";"aaaa_a";"aaaaaaaaaaa aaaaaaaaaaa";"aaa";"aaa aaaaaa";"aaaaaaaa aaaa";"aaaaaaaa aaaaaaa";"a_aaaaa_aa";"aaa a aaa";"aaaaa";"aaaaaaaaaa";"aaaaaaaaaa";"aaa";"aaaaaa aa";"aaaaaa aa";"aaaa";" "
"aaaa-aa-aa aa:aa:aa";"aaaa";"aaaa-aaa";"aaaa_a";"aaaaaaaaaa a�aaaaa�a aa";"aaa";"";"aaaaaaaa aaaa";"aaaaaaaa aaaaaaa";"a_aaaaa_aa";"aaa a aaa";"aaaaaaaaaaaaa";"aaaaaa";"aaaaaaaaaa";"";"aaaaaa aa";"aaaaaa aa";"aa.aa";""
3 changes: 3 additions & 0 deletions tests/fixtures/encoded_western_short.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
"aaaa";"aaaaa";"aaaaaaa";"aa";"aaaaaaa_aa";"aaaa_aa";"aaaa-aaaa_aa";"aaa_aaaaaaaaaa";"aaaaaaa_aaaaaaaaaa";"aaa_aaaaa";"aaaaaaa_aaaaa";"aaaa";"aaaaaaaaa";"aaaaaaa";"aaa-aaa";"aaa_aaaaaaaaa";"aaaaaaa_aaaaaaaaa";"aaaaaa";"aaaaaaaaaaa"
"aaaa-aa-aa aa:aa:aa";"aaaa";"aaaa-aaa";"aaaa_a";"aaaaaaaaaaa aaaaaaaaaaa";"aaa";"aaa aaaaaa";"aaaaaaaa aaaa";"aaaaaaaa aaaaaaa";"a_aaaaa_aa";"aaa a aaa";"aaaaa";"aaaaaaaaaa";"aaaaaaaaaa";"aaa";"aaaaaa aa";"aaaaaa aa";"aaaa";" "
"aaaa-aa-aa aa:aa:aa";"aaaa";"aaaa-aaa";"aaaa_a";"aaaaaaaaaa a�aaaaa�a aa";"aaa";"";"aaaaaaaa aaaa";"aaaaaaaa aaaaaaa";"a_aaaaa_aa";"aaa a aaa";"aaaaaaaaaaaaa";"aaaaaa";"aaaaaaaaaa";"";"aaaaaa aa";"aaaaaa aa";"aa.aa";""
20 changes: 20 additions & 0 deletions tests/test_datasource.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,26 @@ def test_csv_default_encoding(path):
assert df.shape == (486, 19)


def test_csv_western_encoding(path):
"""
It should be able to use a specific encoding
"""
ds = DataSource(path("encoded_western_short.csv"), reader_kwargs={"encoding": "windows-1252"})
df = ds.get_df()
assert df.shape == (2, 19)
df_meta = ds.get_metadata()
assert df_meta == {"df_rows": 2, "total_rows": 2}

# with CLRF line-endings
ds = DataSource(
path("encoded_western_clrf_short.csv"), reader_kwargs={"encoding": "windows-1252"}
)
df = ds.get_df()
assert df.shape == (2, 19)
df_meta = ds.get_metadata()
assert df_meta == {"df_rows": 2, "total_rows": 2}


def test_csv_with_sep_and_encoding(path):
"""It should be able to detect everything"""
ds = DataSource(path("latin_1_sep.csv"))
Expand Down