Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame.write_excel #5568

Closed
rth opened this issue Nov 21, 2022 · 23 comments · Fixed by #6183 or #7251
Closed

Add DataFrame.write_excel #5568

rth opened this issue Nov 21, 2022 · 23 comments · Fixed by #6183 or #7251
Assignees
Labels
enhancement New feature or an improvement of an existing feature

Comments

@rth
Copy link

rth commented Nov 21, 2022

Problem description

The read_excel file was added in #3567 and it would be nice if there was also a DataFrame.write_excel function, maybe using xlsxwriter-rs.

@rth rth added the enhancement New feature or an improvement of an existing feature label Nov 21, 2022
@rth rth changed the title Add write_excel Add DataFrame.write_excel Nov 21, 2022
@dominikpeter
Copy link

We have this requirement as well. Currently, we convert the Polars dataframe to Pandas and export it from there. Would be great, of course, if we didn't have to take this detour.

@jmcnamara
Copy link

jmcnamara commented Jan 11, 2023

Another option would be to use the pure rust rust_xlsxwriter library. The rust_xlsxwriter roadmap explains the rationale and the current features.

I wrote the initial xlsxwriter integration to Pandas so I could try a PR with some input from the core devs.

@bvanelli
Copy link
Contributor

bvanelli commented Jan 11, 2023

Hello,

I'm currently working on this feature using a wrapper around opepyxl. MR should come still this week, as all is working and I'm fixing tests.

I'm also trying to leave the API exatcly the same as write_csv with the same transformations.

Currently the blocking point is that the read_excel previously implemented cannot handle the inputs well.

@bvanelli
Copy link
Contributor

Made a draft merge request, if any of you can take a look and suggest changes/improvements.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 12, 2023

Another option would be to use the pure rust rust_xlsxwriter library. The rust_xlsxwriter roadmap explains the rationale and the current features.

I wrote the initial xlsxwriter integration to Pandas so I could try a PR with some input from the core devs.

@jmcnamara / @ritchie46: I believe I have a one line PR in the python XlsxWriter repository, heh ;) Kudos on making a Rust port - the python version is excellent! I used it to add a surprisingly comprehensive Excel export option to one of our major internal data APIs when I worked back at JPMorgan, and it was very well thought of.

I'd certainly be interested in helping shape our usage, having done it once before - could start with the python API, and once that looks good we could think about how best to adapt it on the Rust side. Having the same core library features available in both languages seems like a win, and I can speak to the quality/utility of XlsxWriter.

@jmcnamara
Copy link

I believe I have a one line PR in the python XlsxWriter repository, heh ;)

Cool. :-)

I'd certainly be interested in helping shape our usage, having done it once before - could start prototyping something with the python API, and once that looks good we can see how best to adapt it on the Rust side.

That sounds like a good approach.

@alexander-beedie alexander-beedie self-assigned this Feb 24, 2023
@alexander-beedie
Copy link
Collaborator

Finally started on this (in Python) ...

@jmcnamara
Copy link

@alexander-beedie good news. If there are any enhancements to XlsxWriter (within reason) that would make integrations with Polars easier/better let me know.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 26, 2023

@jmcnamara: so far it's a breeze, much as I remember ;)

Have integrated dtype and/or per-column formatting, float precision, conditional formatting (all flavours), table styling, total row, autoformat/autofit, and so on... Full xlsxwriter API can be employed if setting-up the Workbook object outside of the polars write_excel call, for more advanced use-cases.

Almost ready for a first cut; need to polish-up what's there and then take care of docs and do some more validation/testing. Sample output from a single call to df.write_excel(...)

polars_excel_integration

(Sparklines can probably wait for a second iteration, though I definitely want to integrate those too).

@jmcnamara
Copy link

Wow. Looks great. I'm looking forward to it. :-)

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 28, 2023

Wow. Looks great. I'm looking forward to it. :-)

Added a few last features today, and polished it all up along with reasonably detailed docstrings and some tests... First iteration is ready to ship: #7251 :)

@jmcnamara
Copy link

@alexander-beedie That is great work. Really strong option support from the start.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Mar 1, 2023

@jmcnamara: I am a conduit for your remarkable library, orchestrating access across the breadth of the xlsxwriter API... ;)

Once it has settled in a version or two, what would you think about adding some Polars-specific help pages to the xlsxwriter site, equivalent to the existing Pandas ones? (I'd be more than happy to write/commit them, assuming that the docs are part of the repository).

@jmcnamara
Copy link

what would you think about adding some Polars-specific help pages to the xlsxwriter site, equivalent to the existing Pandas ones?

Absolutely. I had actually typed a suggestion like that with my previous comment and then thought that might be insensitive because you have already provided some nice examples in your polar docs. :-) From my point of view the Working with Python Pandas and XlsxWriter were necessary because I kept seeing/answering the same types of questions on StackOverflow.

I'll take a stab at creating a "Working with Polars and XlsxWriter" chaper in the next week or two and hook you in. When do you think this feature will be in a public Polars release?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Mar 1, 2023

I'll take a stab at creating a "Working with Polars and XlsxWriter" chaper in the next week or two and hook you in. When do you think this feature will be in a public Polars release?

Perfect; shouldn't be more than a few days until 0.16.10 drops 👍

@alexander-beedie
Copy link
Collaborator

@jmcnamara: it's out now - just had time to squeeze-in sparkline support too ;)

@jmcnamara
Copy link

@alexander-beedie Excellent. I'll start on the docs and hook you in once I have a basic framework (in the next couple of days).

@jmcnamara
Copy link

I've added initial docs for this at Working with Polars and XlsxWriter in the main documentation. See also jmcnamara/XlsxWriter#961

@leonkosak
Copy link

Any performance measurements against xlsxwriter (in pandas) and especially against PyExcelerate?
Python really needs a way faster xlsx export library.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Apr 27, 2023

Any performance measurements against xlsxwriter (in pandas) and especially against PyExcelerate?

@leonkosak: Feel free to run some and let us know how it goes ;)
(If there's anything obviously suboptimal I can try and optimise further).

Update: a quick & dirty timing check vs pandas shows we're roughly 50-60% faster to write the same amount of data (with default settings) while also creating a "real" Excel table object (with autofilter/etc) and adding default column formats, which pandas doesn't seem to do. (I made sure not to write the extra index col from pandas, so as to make it a fair comparison).

from codetiming import Timer
from datetime import date
import polars as pl

# quickly spin-up a 1,000,000 element DataFrame
df = pl.DataFrame({
    "idx": range(250_000),
    "x": 123.456789, 
    "y": date.today(), 
    "z":"testing,1.2.3.4."}
)

# export to Excel from polars
with Timer():
    df.write_excel( "dataframe_pl.xlsx" )
    
# export to Excel from pandas
pf = df.to_pandas()
with Timer():
    pf.to_excel( "dataframe_pd.xlsx", index=False )

Results:
(Writing 1,000,000 mixed-type elements)

library time taken
polars 3.43 secs
pandas 5.50 secs

Polars ~60% faster.

@jmcnamara
Copy link

jmcnamara commented May 12, 2023

I've been working on a data handling section for the rust_xlsxwriter docs and wrote a Polar dataframe to Excel wrapper function here: jmcnamara/rust_xlsxwriter#39

It doesn't have a fraction of the functionality of write_excel() and doesn't intend to but from a quick and dirtier implementation of your benchmark above it is ~7x faster than the Python backed writer.

@jmcnamara
Copy link

I've uploaded a new Rust crate called polars_excel_writer for serializing Polars dataframes into Excel Xlsx files using rust_xlsxwriter as a backend engine.

It provides two interfaces for writing a Polars Rust dataframe to an Excel Xlsx file:

  • ExcelWriter a simple Excel serializer that implements the Polars SerWriter trait to write a dataframe to an Excel Xlsx file. This is similar to the CsvWriter interface.

  • PolarsXlsxWriter a more configurable Excel serializer that resembles the interface options provided by the Polars Python write_excel() dataframe method. There is still work in progress for this interface.

    One useful feature of PolarsXlsxWriter is that you can mix Polars and rust_xlsxwriter code to access Excel features not available in the current interface.

Note, this is for Rust dataframes rather than Python dataframes so folks on this thread may not be as interested. If you are and you try it out you can leave some feedback here.

@jmcnamara
Copy link

Just to note, in case anyone finds it useful, with version 3.2.2 of XlsxWriter you can format boolean values in Polars dataframes as checkboxes in Excel:

import polars as pl

# Create a Pandas dataframe with some sample data.
df = pl.DataFrame(
    {
        "Region": ["North", "South", "East", "West"],
        "Target": [100, 70, 90, 120],
        "On-track": [False, True, True, False],
    }
)

# Write the dataframe to a new Excel file with formatting options.
df.write_excel(
    workbook="polars_checkbox.xlsx",

    # Set the checkbox format for the "On-track" boolean column.
    column_formats={"On-track": {"checkbox": True}},

    # Set an alternative table style.
    table_style="Table Style Light 9",

    # Autofit the column widths.
    autofit=True,
)

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
6 participants