Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#3884: Fix read_excel() dropping empty rows #4161

Merged
merged 16 commits into from
Feb 23, 2022
Merged
2 changes: 2 additions & 0 deletions docs/release_notes/release_notes-0.14.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Key Features and Updates
* FIX-#4177: Support read_feather from pathlike objects (#4177)
* FIX-#4234: Upgrade pandas to 1.4.1 (#4235)
* FIX-#4057: Allow reading an empty parquet file (#4075)
* FIX-#3884: Fix read_excel() dropping empty rows (#4161)
* Performance enhancements
* FIX-#4138, FIX-#4009: remove redundant sorting in the internal '.mask()' flow (#4140)
* Benchmarking enhancements
Expand Down Expand Up @@ -63,3 +64,4 @@ Contributors
@dchigarev
@Garra1980
@mvashishtha
@naren-ponder
10 changes: 6 additions & 4 deletions modin/core/storage_formats/pandas/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -564,12 +564,14 @@ def update_row_nums(match):
has_index_names=is_list_like(header) and len(header) > 1,
skiprows=skiprows,
usecols=usecols,
skip_blank_lines=False,
**kwargs,
)
# In excel if you create a row with only a border (no values), this parser will
# interpret that as a row of NaN values. pandas discards these values, so we
# also must discard these values.
pandas_df = parser.read().dropna(how="all")
pandas_df = parser.read()
if len(pandas_df) > 1 and pandas_df.isnull().all().all():
naren-ponder marked this conversation as resolved.
Show resolved Hide resolved
naren-ponder marked this conversation as resolved.
Show resolved Hide resolved
# Drop NaN rows at the end of the DataFrame
pandas_df = pandas.DataFrame(columns=pandas_df.columns)

# Since we know the number of rows that occur before this partition, we can
# correctly assign the index in cases of RangeIndex. If it is not a RangeIndex,
# the index is already correct because it came from the data.
Expand Down
Binary file added modin/pandas/test/data/every_other_row_nan.xlsx
Binary file not shown.
Binary file added modin/pandas/test/data/test_border_rows.xlsx
Binary file not shown.
Binary file added modin/pandas/test/data/test_empty_rows.xlsx
Binary file not shown.
24 changes: 24 additions & 0 deletions modin/pandas/test/test_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -1600,6 +1600,30 @@ def test_excel_empty_line(self):
modin_df = pd.read_excel(path)
assert str(modin_df)

@check_file_leaks
def test_read_excel_empty_rows(self):
# Test parsing empty rows in middle of excel dataframe as NaN values
eval_io(
fn_name="read_excel",
io="modin/pandas/test/data/test_empty_rows.xlsx",
)

@check_file_leaks
def test_read_excel_border_rows(self):
naren-ponder marked this conversation as resolved.
Show resolved Hide resolved
naren-ponder marked this conversation as resolved.
Show resolved Hide resolved
# Test parsing border rows as NaN values in excel dataframe
eval_io(
fn_name="read_excel",
io="modin/pandas/test/data/test_border_rows.xlsx",
)

@check_file_leaks
def test_read_excel_every_other_nan(self):
# Test for reading excel dataframe with every other row as a NaN value
eval_io(
fn_name="read_excel",
io="modin/pandas/test/data/every_other_row_nan.xlsx",
)

@pytest.mark.parametrize(
"sheet_name",
[
Expand Down