-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pl.read_excel should have a skip_rows argument for "openpyxl" engine #13879
Comments
Yup, though rather than special-case this param/engine combination, we should look to better unify the common/generic options for all of the supported engines; I'll add it to the pile 😉 |
Is it an xlsx2csv bug that it is somehow truncating the seconds values from those strings and only return HH:MM? It seems like weird behavior. |
@ldacey: I'll be exposing this feature properly in an upcoming PR (that's largely done; expecting to commit tomorrow). Also planning to unify/improve some of the most common options at the top level instead of having to pass down kwargs, which should help (but will come after tomorrow's PR). Update: see #14039, which now exposes a |
@ldacey we released fastexcel 0.9.0 that uses a bigger sample to determine the schema of the columns. Does everything work as expected? If not do not hesitate to open a new issue on fastexcel side with an example file and I'll have a look |
Nice - it works as expected for most files. I ran into an issue just now while testing though. I'll comment on the fastexcel original issue right now with a sample. |
Description
example-skip-rows.xlsx
I attached an example file that I needed to use pandas instead of polars for.
If I use the default engine with read_csv_options to skip rows, the issue is that the HH:MM:SS columns are truncated somehow.
pl.read_excel(path, read_csv_options={"skip_rows": 9}
(for some reason we only skip 9 rows with this engine instead of 10 rows?)For example, 08:45:13 is only 08:45 when the data is read into memory. I need to convert these values to seconds so I can't lose informations. Also, the timestamp/date columns are read as text whereas "openpyxl" reads the dates natively - not a huge deal since I can parse these. Here is a screenshot of the data using "read_csv_options":
If I read the file with
pl.read_excel(path, engine="openpyxl")
the dataframe only includes the summary table up top. This is a system generated file that I need to ingest. Here is a screenshot:Ultimately, the only way to get the data in the format I need was to use
pd.read_excel(path, skip_rows=10, dtype=str)
and convert that to polars. (I have to use dtype=str or else the columns are inferred to bepl.Time
but they can exceed 24 hours, the actual data is more of a duration type in HH:MM:SS format). Ideally, I would like to use polars from end to end.The text was updated successfully, but these errors were encountered: