pl.read_excel should have a skip_rows argument for "openpyxl" engine #13879

ldacey · 2024-01-20T21:48:33Z

Description

I attached an example file that I needed to use pandas instead of polars for.

If I use the default engine with read_csv_options to skip rows, the issue is that the HH:MM:SS columns are truncated somehow.

pl.read_excel(path, read_csv_options={"skip_rows": 9} (for some reason we only skip 9 rows with this engine instead of 10 rows?)

For example, 08:45:13 is only 08:45 when the data is read into memory. I need to convert these values to seconds so I can't lose informations. Also, the timestamp/date columns are read as text whereas "openpyxl" reads the dates natively - not a huge deal since I can parse these. Here is a screenshot of the data using "read_csv_options":

If I read the file with pl.read_excel(path, engine="openpyxl") the dataframe only includes the summary table up top. This is a system generated file that I need to ingest. Here is a screenshot:

Ultimately, the only way to get the data in the format I need was to use pd.read_excel(path, skip_rows=10, dtype=str) and convert that to polars. (I have to use dtype=str or else the columns are inferred to be pl.Time but they can exceed 24 hours, the actual data is more of a duration type in HH:MM:SS format). Ideally, I would like to use polars from end to end.

The text was updated successfully, but these errors were encountered:

alexander-beedie · 2024-01-21T18:03:23Z

Yup, though rather than special-case this param/engine combination, we should look to better unify the common/generic options for all of the supported engines; I'll add it to the pile 😉

ldacey · 2024-01-22T15:15:06Z

Is it an xlsx2csv bug that it is somehow truncating the seconds values from those strings and only return HH:MM? It seems like weird behavior.

ldacey · 2024-01-27T16:22:12Z

@alexander-beedie

df = pl.read_excel(
    "example-skip-rows.xlsx",
    engine="calamine",
    engine_options={"skip_rows": 10},
)

TypeError: read_excel() got an unexpected keyword argument 'skip_rows'

Is that how we pass read options to fastexcel?

I am able to read the file with fastexcel directly like this:

excel_reader = fastexcel.read_excel(local_path)
pd_df = excel_reader.load_sheet(idx_or_name=0, header_row=9).to_pandas()
pl_df = excel_reader.load_sheet(idx_or_name=0, header_row=9).to_polars()

(Unfortunately it seems to be returning my duration columns as nulls for some reason, unlike pandas with openpyxl which returns the correct data)

I tried to use the "openpyxl" engine again to pass engine_options, but there does not appear to be a direct skiprows argument. It seems like I need to use pandas for this particular file for the time being.

alexander-beedie · 2024-01-27T20:15:32Z

Is that how we pass read options to fastexcel?

@ldacey: I'll be exposing this feature properly in an upcoming PR (that's largely done; expecting to commit tomorrow). Also planning to unify/improve some of the most common options at the top level instead of having to pass down kwargs, which should help (but will come after tomorrow's PR).

Update: see #14039, which now exposes a read_options parameter - this is where you'd pass the specific sheet-reading options for a given engine (including calamine).

PrettyWood · 2024-02-14T17:37:44Z

@ldacey we released fastexcel 0.9.0 that uses a bigger sample to determine the schema of the columns. Does everything work as expected? If not do not hesitate to open a new issue on fastexcel side with an example file and I'll have a look

ldacey · 2024-02-19T15:51:53Z

Nice - it works as expected for most files. I ran into an issue just now while testing though. I'll comment on the fastexcel original issue right now with a sample.

ldacey added the enhancement New feature or an improvement of an existing feature label Jan 20, 2024

ldacey mentioned this issue Jan 20, 2024

pl.read_excel with openpyxl engine removes integer values in mixed columns #13462

Closed

2 tasks

stinodego added the A-io Area: reading and writing data label Jan 21, 2024

alexander-beedie self-assigned this Jan 21, 2024

alexander-beedie added the accepted Ready for implementation label Jan 21, 2024

stinodego added A-io-excel Area: reading/writing Excel files and removed A-io Area: reading and writing data labels Jan 21, 2024

ldacey closed this as completed Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pl.read_excel should have a skip_rows argument for "openpyxl" engine #13879

pl.read_excel should have a skip_rows argument for "openpyxl" engine #13879

ldacey commented Jan 20, 2024

alexander-beedie commented Jan 21, 2024 •

edited

Loading

ldacey commented Jan 22, 2024

ldacey commented Jan 27, 2024 •

edited

Loading

alexander-beedie commented Jan 27, 2024 •

edited

Loading

PrettyWood commented Feb 14, 2024 •

edited

Loading

ldacey commented Feb 19, 2024

pl.read_excel should have a skip_rows argument for "openpyxl" engine #13879

pl.read_excel should have a skip_rows argument for "openpyxl" engine #13879

Comments

ldacey commented Jan 20, 2024

Description

alexander-beedie commented Jan 21, 2024 • edited Loading

ldacey commented Jan 22, 2024

ldacey commented Jan 27, 2024 • edited Loading

alexander-beedie commented Jan 27, 2024 • edited Loading

PrettyWood commented Feb 14, 2024 • edited Loading

ldacey commented Feb 19, 2024

alexander-beedie commented Jan 21, 2024 •

edited

Loading

ldacey commented Jan 27, 2024 •

edited

Loading

alexander-beedie commented Jan 27, 2024 •

edited

Loading

PrettyWood commented Feb 14, 2024 •

edited

Loading