Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When coercing columns to strings, boolean cells turn into null #250

Closed
severinh opened this issue Jul 17, 2024 · 5 comments · Fixed by #251
Closed

When coercing columns to strings, boolean cells turn into null #250

severinh opened this issue Jul 17, 2024 · 5 comments · Fixed by #251
Labels
bug Something isn't working 🦀 rust 🦀 Pull requests that edit Rust code 🗒️ calamine 🗒️ Calamine-related issue
Milestone

Comments

@severinh
Copy link

severinh commented Jul 17, 2024

In our project, we have Excel files with mixed data types. Because of this, we need to coerce all columns to strings, and interpret the data downstream. We're currently blocked from adopting fastexcel because it does not coerce booleans as expected.

How to reproduce

Suppose you have an Excel file with the following data:

Header
=TRUE
=FALSE
"some string"

Now lets read this Excel file into a Polars dataframe, while coercing the column to strings:
excel_reader.load_sheet(0, header_row=1, dtypes={"Header": "string"}).to_polars()

This produces the following data frame:

Header
null
null
"some string"

Expected behavior

What I would have expected fastexcel to coerce the boolean values to "0" and "1" instead of losing them. That is:

Header
"1"
"0"
"some string"

Test case

I've cloned the fastexcel repo and wrote a new unit test for this, which currently fails.

Excel sheet: sheet-bool.xlsx

def test_bool_casting_to_string_for_polars() -> None:
    excel_reader = fastexcel.read_excel(path_for_fixture("sheet-bool.xlsx"))

    actual_polars_df = excel_reader.load_sheet(
        0, header_row=None, dtypes={0: "string"}, column_names=["0"]
    ).to_polars()
    expected_polars_df = pl.DataFrame(
        {
            "0": ["1", "0", "some string"],
        }
    )

    pl_assert_frame_equal(actual_polars_df, expected_polars_df)

Unfortunately, I'm not experienced with Rust, so I did not yet figure out where/how to make the change in the Rust code to make the test pass.

Closing words

First of all, big thanks for building fastexcel! I'm eager to migrate to fastexcel due to its ability to directly output Polars, which our downstream pipelines are based on, which will give us a massive performance boost.

It would be much appreciated if you either gave me some pointers on where/how to make this change in Rust code (so I can open a PR), or made the fix yourself.

@PrettyWood
Copy link
Member

Hi @severinh and thanks for your kind words
I'm on my phone right now so hard to help more but I would start with this PR
You probably want to update STRING_TYPES_CELL to support booleans
If that doesn't work we can make a quick fix soon. I'm planning on working on fastexcel this weekend and release a new version v0.11.0 soon, mostly for @lukapeschke amazing work on #147

@PrettyWood
Copy link
Member

Btw @lukapeschke and myself tend to prefer the "true" / "false" representation (the classic to_string() for a boolean)
Would that be ok for you @severinh ?
I'll let you work on it if you want. If you prefer I can open a fix at lunch. You tell me

@lukapeschke lukapeschke added bug Something isn't working 🦀 rust 🦀 Pull requests that edit Rust code 🗒️ calamine 🗒️ Calamine-related issue labels Jul 17, 2024
@lukapeschke lukapeschke added this to the v0.11.0 milestone Jul 17, 2024
@severinh
Copy link
Author

severinh commented Jul 17, 2024

@PrettyWood Thanks for the quick response!.

Btw @lukapeschke and myself tend to prefer the "true" / "false" representation (the classic to_string() for a boolean)
Would that be ok for you @severinh ?

Sure! Makes sense to be more explicit. The main thing I care about is that we don't lose legitimate data.

You probably want to update STRING_TYPES_CELL to support booleans

Tried that, and unfortunately that did not seem to be enough yet. The output was still null. Since I won't have time this afternoon to look further into it, I would appreciate you going ahead and making the change.

PS: Another problem we ran into: A cell like =DATE(2024, 7, 17) is coerced into '2024-07-01 00:00:00'. I would have expected '2024-07-01' for dates. For now, we're trimming ' 00:00:00' in post-processing, which is not ideal. Is this behavior expected and intentional? If not, I can file a separate ticket for this.

@PrettyWood
Copy link
Member

PrettyWood commented Jul 17, 2024

Just made a quick fix @severinh
For the question on date to string, can you please open a new issue with another example file?
Thank you

@severinh
Copy link
Author

severinh commented Jul 17, 2024

Amazing! Thank you.

I've opened #252 for the issue with the date coercion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working 🦀 rust 🦀 Pull requests that edit Rust code 🗒️ calamine 🗒️ Calamine-related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants