Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coercing dates to strings adds time of the form '00:00:00' #252

Open
severinh opened this issue Jul 17, 2024 · 2 comments
Open

Coercing dates to strings adds time of the form '00:00:00' #252

severinh opened this issue Jul 17, 2024 · 2 comments
Labels
🗒️ calamine 🗒️ Calamine-related issue

Comments

@severinh
Copy link

severinh commented Jul 17, 2024

How to reproduce

Suppose you have an Excel file with the following mixed-type data:

Header
=DATE(2024, 7, 1)
=DATE(2024, 7, 2)
"some string"

Now lets read this Excel file into a Polars dataframe, which coerces the column to strings:
excel_reader.load_sheet(0).to_polars()

This produces the following data frame:

Header
"2024-07-01 00:00:00"
"2024-07-02 00:00:00"
"some string"

Expected behavior

What I would have expected fastexcel to not include the time 00:00:00 for these dates. That is,

Header
"2024-07-01"
"2024-07-02"
"some string"

Test case

Excel sheet: sheet-date.xlsx

def test_date_casting_to_string() -> None:
    excel_reader = fastexcel.read_excel(path_for_fixture("sheet-date.xlsx"))
    sheet = excel_reader.load_sheet(0, column_names=["col1"])
    expected = {
        "col1": ["2024-07-01", "2024-07-02", "some string"],
    }
    pl_assert_frame_equal(sheet.to_polars(), pl.DataFrame(expected))
@PrettyWood
Copy link
Member

Yup the issue is that it's detected as datetime by calamine. We can try to do a as_date() afterwards but it would be better if calamine could handle directly dates.
I'll dig into into it this weekend but it won't be a quick fix like the bool to string. I'll probably need to open a PR on calamine side

@severinh
Copy link
Author

severinh commented Jul 17, 2024

Thanks @PrettyWood! I suspected that this one would not be as straight-forward.

In any case, no pressure on this. In contrast to #250, this issue here is not blocking our adoption of fastexcel. We can work around this for now by trimming 00:00:00 as a post-processing step. Something like:

sheet_df.with_columns(pl.all().str.replace_all(r"^([0-9]{4}-[0-9]{2}-[0-9]{2}) 00:00:00$", "$1"))

@lukapeschke lukapeschke added the 🗒️ calamine 🗒️ Calamine-related issue label Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗒️ calamine 🗒️ Calamine-related issue
Projects
None yet
Development

No branches or pull requests

3 participants