Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Write data at table level in write_excel #17757

Merged
merged 4 commits into from
Jul 25, 2024

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented Jul 20, 2024

Resolves #17756.

Previously, write_column was used to write data to the table, which writes formatting to each cell individually. This update writes the data in the add_table step, which is much simpler, and which also supplies formatting to the table, which in turn formats the columns. The result is simpler data write, simpler formatting, and (very minor) reduction in file size.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Jul 20, 2024
@mcrumiller mcrumiller marked this pull request as draft July 20, 2024 19:35
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 20, 2024

Ahh... the contention is that write_column actually writes formats at the cell-level? I'm fine with the change if you can make it work - the failing test implies that those formats are not being applied properly for datetime columns (I eyeballed the sheet output and the datetime col "f" looks like a date 🤔)

Co-authored-by: Alexander Beedie <alexander-beedie@users.noreply.github.com>
@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jul 20, 2024

write_column actually writes formats at the cell-level

Yeah it does (which is a bit weird, no?). I'm looking into the date formatting issue, it's coming back as YYYY-MM-DD regardless of output format. I'm off to see a movie now but I'll look into this later.

@alexander-beedie
Copy link
Collaborator

Yeah it does (which is a bit weird, no?).

It's coming back to me now; write_column decomposes to cell-level writes, therefore also cell-level formatting: https://xlsxwriter.readthedocs.io/worksheet.html#write_column

I'm looking into the date formatting issue, it's coming back as YYYY-MM-DD regardless of output format. I'm off to see a movie now but I'll look into this later.

Have fun ;) If you can find a way to address this issue then it looks like a sensible update to me.

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jul 21, 2024

@alexander-beedie I figured it out, and might be useful info for you in the future, and doesn't really make sense hence why it's confusing: when the workbook's default_date_format is supplied, it overrides column-level date format unless the data is supplied with the table. See these examples, three attempts, first two fail.

(failure) Write data (with no table), set column format

from datetime import date
import xlsxwriter

# specify default workbook date format
wb = xlsxwriter.Workbook("date_format.xlsx", {"default_date_format": "yyyy.dd.mm"})
ws = wb.add_worksheet("Date")

data = [date(2024, 1, 1), date(2024, 1, 2)]
ws.write_column(0, 0, [date(2024, 1, 1), date(2024, 1, 2)])

# assign column format (has no effect)
date_format = wb.add_format({"num_format": "mm-dd-yyyy"})
ws.set_column(0, 0, 10, date_format)

wb.close()

Result: column format completely ignored.
image

(failure) Write empty table, then write data, then set column format

Next, if we set the default date format, and use add_table but add the data later, it still fails, even if we supply a format:

from datetime import date
import xlsxwriter

wb = xlsxwriter.Workbook("date_format.xlsx", {"default_date_format": "yyyy.dd.mm"})
ws = wb.add_worksheet("Date")

data = [[date(2024, 1, 1)], [date(2024, 1, 2)]]
date_format = wb.add_format({"num_format": "mm-dd-yyyy"})
ws.add_table("A1:A2", options={
    "header_row": False,
    "columns": [{"format": date_format}]
})

# write the data afterwards
ws.write_column(0, 0, [date(2024, 1, 1), date(2024, 1, 2)])
ws.set_column(0, 0, 10, date_format)

wb.close()

same thing:
image

(success) Supply the data with the table.

If we supply the data to the add_table function, our formatting is finally applied:

from datetime import date
import xlsxwriter

wb = xlsxwriter.Workbook("date_format.xlsx", {"default_date_format": "yyyy.dd.mm"})
ws = wb.add_worksheet("Date")

data = [[date(2024, 1, 1)], [date(2024, 1, 2)]]
date_format = wb.add_format({"num_format": "mm-dd-yyyy"})
ws.add_table(
    "A1:A2",
    options={
        "data": data,  # !!! SEE HERE !!
        "header_row": False,
        "columns": [{"format": date_format}],
    },
)

wb.close()

image

Moving the data write to the add_table function simplifies everything here (removes a loop) and solves the issue.

@mcrumiller mcrumiller marked this pull request as ready for review July 21, 2024 18:22
Copy link

codecov bot commented Jul 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.53%. Comparing base (1df3b0b) to head (038c66d).
Report is 48 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17757      +/-   ##
==========================================
+ Coverage   80.40%   80.53%   +0.12%     
==========================================
  Files        1502     1503       +1     
  Lines      197041   197026      -15     
  Branches     2794     2800       +6     
==========================================
+ Hits       158439   158676     +237     
+ Misses      38088    37830     -258     
- Partials      514      520       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mcrumiller mcrumiller changed the title feat(python): Write format at column level feat(python): Write format at column level in write_excel Jul 21, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 22, 2024

Moving the data write to the add_table function simplifies everything here (removes a loop) and solves the issue.

Nice discovery - any potential downsides to this approach? The tests can't really validate formatting, so do we see the expected formatting when eyeballing some of the more sophisticated examples manually? (I'll start double-checking some ;)

@mcrumiller
Copy link
Contributor Author

I don't think so. The "heavily customized formatting/definition" parameter set in test_spreadsheet.test_excel_round_trip() still produces the same result, which is this:

image

which includes the top/bottom formatting, column-specific formatting, and dtype-specific formatting.

Also, I left in the column-formatting section which is probably not strictly necessary any more, I but I feel probably can't hurt if someone decides to manually add data below the table for whatever reason.

Copy link
Collaborator

@alexander-beedie alexander-beedie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I've tested on some other cases and it looks like the new code is not behaving quite the same in several instances - specifically, issues with the table header.

For example:

pl.DataFrame({
    "id": ["a123", "b345", "c567", "d789"],
    "values": [99, 45, 50, 85],
    "misc": [1.2, 3.4, 5.6, 7.8],
}).write_excel(
    "~/output.xlsx",
    table_style={"style": "Table Style Medium 15"},
)

With this patch the second and third column header names end up black; they are present, but do not conform to the given table_style, which defines them as bold/white, so they look like they aren't there 🤔

Before:

Screenshot 2024-07-23 at 17 19 08

After:

Screenshot 2024-07-23 at 17 14 24

Will need some further tinkering :)


(Interestingly there is also a 2-pixel per column increase in column width with the new code, though that's not important - just an odd observation!)

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jul 23, 2024

Thanks--I'll take a look at this tonight and do some more thorough testing. I wonder if applying the column-level formatting overrides the table formatting, since the numerical columns are applying a num_format (which defaults to black) whereas the id column is not. If that's the case, the fix is to probably not apply the column-level formatting and leave it to the table formatter.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jul 23, 2024

Excel is a dark art 😆

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jul 23, 2024

@alexander-beedie I reproduced, and indeed removing the column formatting fixed it.

I think that we should simply leave the formatting to add_table, which seems to be sufficient and simplest. I'll rename the PR since now we're not actually touching the column-level format.

@mcrumiller mcrumiller changed the title feat(python): Write format at column level in write_excel feat(python): Write format at table level in write_excel Jul 23, 2024
@alexander-beedie
Copy link
Collaborator

@alexander-beedie I reproduced, and indeed removing the column formatting fixed it.

Nice; will take another run through it today 👍

@mcrumiller mcrumiller changed the title feat(python): Write format at table level in write_excel feat(python): Write data at table level in write_excel Jul 24, 2024
@alexander-beedie alexander-beedie added the A-io-excel Area: reading/writing Excel files label Jul 25, 2024
@alexander-beedie
Copy link
Collaborator

Looks good :)

@alexander-beedie alexander-beedie merged commit 3016c07 into pola-rs:main Jul 25, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-excel Area: reading/writing Excel files enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

write_excel: write column formats for column, not individual cells within column
2 participants