Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.read_excel should have a skip_rows argument for "openpyxl" engine #13879

Closed
ldacey opened this issue Jan 20, 2024 · 6 comments
Closed

pl.read_excel should have a skip_rows argument for "openpyxl" engine #13879

ldacey opened this issue Jan 20, 2024 · 6 comments
Assignees
Labels
A-io-excel Area: reading/writing Excel files accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@ldacey
Copy link

ldacey commented Jan 20, 2024

Description

example-skip-rows.xlsx

I attached an example file that I needed to use pandas instead of polars for.

If I use the default engine with read_csv_options to skip rows, the issue is that the HH:MM:SS columns are truncated somehow.

pl.read_excel(path, read_csv_options={"skip_rows": 9} (for some reason we only skip 9 rows with this engine instead of 10 rows?)

For example, 08:45:13 is only 08:45 when the data is read into memory. I need to convert these values to seconds so I can't lose informations. Also, the timestamp/date columns are read as text whereas "openpyxl" reads the dates natively - not a huge deal since I can parse these. Here is a screenshot of the data using "read_csv_options":

image

If I read the file with pl.read_excel(path, engine="openpyxl") the dataframe only includes the summary table up top. This is a system generated file that I need to ingest. Here is a screenshot:

image

Ultimately, the only way to get the data in the format I need was to use pd.read_excel(path, skip_rows=10, dtype=str) and convert that to polars. (I have to use dtype=str or else the columns are inferred to be pl.Time but they can exceed 24 hours, the actual data is more of a duration type in HH:MM:SS format). Ideally, I would like to use polars from end to end.

@ldacey ldacey added the enhancement New feature or an improvement of an existing feature label Jan 20, 2024
@stinodego stinodego added the A-io Area: reading and writing data label Jan 21, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 21, 2024

Yup, though rather than special-case this param/engine combination, we should look to better unify the common/generic options for all of the supported engines; I'll add it to the pile 😉

@alexander-beedie alexander-beedie self-assigned this Jan 21, 2024
@alexander-beedie alexander-beedie added the accepted Ready for implementation label Jan 21, 2024
@stinodego stinodego added A-io-excel Area: reading/writing Excel files and removed A-io Area: reading and writing data labels Jan 21, 2024
@ldacey
Copy link
Author

ldacey commented Jan 22, 2024

Is it an xlsx2csv bug that it is somehow truncating the seconds values from those strings and only return HH:MM? It seems like weird behavior.

@ldacey
Copy link
Author

ldacey commented Jan 27, 2024

@alexander-beedie

df = pl.read_excel(
    "example-skip-rows.xlsx",
    engine="calamine",
    engine_options={"skip_rows": 10},
)

TypeError: read_excel() got an unexpected keyword argument 'skip_rows'

Is that how we pass read options to fastexcel?

I am able to read the file with fastexcel directly like this:

excel_reader = fastexcel.read_excel(local_path)
pd_df = excel_reader.load_sheet(idx_or_name=0, header_row=9).to_pandas()
pl_df = excel_reader.load_sheet(idx_or_name=0, header_row=9).to_polars()

(Unfortunately it seems to be returning my duration columns as nulls for some reason, unlike pandas with openpyxl which returns the correct data)
image

I tried to use the "openpyxl" engine again to pass engine_options, but there does not appear to be a direct skiprows argument. It seems like I need to use pandas for this particular file for the time being.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 27, 2024

Is that how we pass read options to fastexcel?

@ldacey: I'll be exposing this feature properly in an upcoming PR (that's largely done; expecting to commit tomorrow). Also planning to unify/improve some of the most common options at the top level instead of having to pass down kwargs, which should help (but will come after tomorrow's PR).

Update: see #14039, which now exposes a read_options parameter - this is where you'd pass the specific sheet-reading options for a given engine (including calamine).

@PrettyWood
Copy link

PrettyWood commented Feb 14, 2024

@ldacey we released fastexcel 0.9.0 that uses a bigger sample to determine the schema of the columns. Does everything work as expected? If not do not hesitate to open a new issue on fastexcel side with an example file and I'll have a look

@ldacey
Copy link
Author

ldacey commented Feb 19, 2024

Nice - it works as expected for most files. I ran into an issue just now while testing though. I'll comment on the fastexcel original issue right now with a sample.

@ldacey ldacey closed this as completed Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-excel Area: reading/writing Excel files accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

No branches or pull requests

4 participants