Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(rust): Refactor AnyValue casting logic #13140

Merged
merged 2 commits into from
Feb 26, 2024

Conversation

mcrumiller
Copy link
Contributor

@mcrumiller mcrumiller commented Dec 19, 2023

This PR improves the organization of the AnyValue casting logic. The current logic uses a few helper functions that are somewhat sporadically called, and it makes it a little hard to detect where holes in our casting logic exist.

This implementation has the general format of match (from_dtype, to_dtype), sorted by from_dtype, with logic to deal with special cases first, and the non-special cases at the end to catch all remaining. This both removes a few function calls and makes it clearer where future casts should be placed. Overall, the result is a bit cleaner with fewer LOC, fewer function calls, and fewer total matche branches.

@github-actions github-actions bot added internal An internal refactor or improvement rust Related to Rust Polars labels Dec 19, 2023
.into_date()
.into_series()),
},
(Datetime(tu, _), Date) => Ok(match tu {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a minor ergonomic improvement.

@stinodego
Copy link
Member

I'm not sure about disallowing those casts. In the end, they're just integers under the hood.

I'll mark this as a bug fix, though I'm not sure this should be 'fixed'.

@stinodego stinodego changed the title refactor(rust): organize literal casting fix: Organize literal casting and disallow certain casts Dec 20, 2023
@github-actions github-actions bot added fix Bug fix python Related to Python Polars labels Dec 20, 2023
@mcrumiller
Copy link
Contributor Author

@stinodego isn't that just an implementation detail though? I'm noticing that we do allow casting from categorical to date and such, despite that not making much sense. Should we impose rules for logical casting based on whether they make sense, or whether they're possible due to the underlying structure?

@mcrumiller
Copy link
Contributor Author

(also, I'm willing to add those casts back in--the main point of this PR was to organize the code and to improve the cast testing).

@stinodego
Copy link
Member

(also, I'm willing to add those casts back in--the main point of this PR was to organize the code and to improve the cast testing).

It would be really helpful if each PR does only one thing. So if you would limit this PR to the refactor, and then followup with a PR for the casting update (or create an issue to discuss first), that would be great!

@mcrumiller
Copy link
Contributor Author

@stinodego Absolutely, thanks for the helpful comments!

@mcrumiller
Copy link
Contributor Author

@stinodego I just simplified the code a bit and re-added back the disabled casts. I think the resulting code is a bit easier to follow and manage, as it removed some indirection and additional functions/macros.

@mcrumiller mcrumiller changed the title fix: Organize literal casting and disallow certain casts fix: Organize literal casting Dec 20, 2023
@stinodego stinodego changed the title fix: Organize literal casting refactor: Organize literal casting Dec 21, 2023
@stinodego stinodego removed the fix Bug fix label Dec 21, 2023
@stinodego
Copy link
Member

@mcrumiller would you mind rebasing this? I recently ran into this mess - it would be nice if the casting logic was a bit cleaner. I think you made some good improvements here.

@mcrumiller
Copy link
Contributor Author

Hi @stinodego sure, it may take a little while though as I think there have been a few changes to this since I've updated it and I'll have to understand how they impact the logic.

@stinodego stinodego changed the title refactor: Organize literal casting refactor(rust): Refactor AnyValue casting logic Feb 14, 2024
@stinodego
Copy link
Member

Hi @stinodego sure, it may take a little while though as I think there have been a few changes to this since I've updated it and I'll have to understand how they impact the logic.

Thanks, and apologies for getting back to this so late.

@mcrumiller
Copy link
Contributor Author

Thanks, and apologies for getting back to this so late.

Not necessary! There is a sea of PRs.

@mcrumiller mcrumiller marked this pull request as draft February 14, 2024 19:56
@stinodego stinodego removed the python Related to Python Polars label Feb 14, 2024
@mcrumiller
Copy link
Contributor Author

mcrumiller commented Feb 14, 2024

@stinodego any idea how to run a test on these? Cast pl.lit(x).cast(dt) uses the AnyValue path but it gets overridden somewhere else. For example, casting from string to int shouldn't work:

            AnyValue::String(s) if dtype == &DataType::Binary => AnyValue::Binary(s.as_bytes()),
            _ => {
                polars_bail!(ComputeError: "cannot cast any-value '{:?}' to '{:?}'", self.dtype(), dtype)
            },

And indeed, when I run a simple script to cast a literal string to int:

import polars as pl
print(pl.select(pl.lit("1").cast(pl.Int32)))

That code path is hit:
image
But it's clear that result is ignored and discarded, because the script ends up working:

shape: (1, 1)
┌─────────┐
│ literal │
│ ---     │
│ i32     │
╞═════════╡
│ 1       │
└─────────┘

This all happens through the optimization steps, and so my guess is that it's trying to perform the literal cast, can't, and then later expands the literal to a series and then performs the cast, which is why it works. What other code paths reach and rely on the AnyValue cast?

@stinodego
Copy link
Member

The constructors will rely on this in the future (see py-polars/tests/unit/constructors/test_any_value_fallbacks.py). They already partially do.

But it's hard to test this directly.

@mcrumiller
Copy link
Contributor Author

I'm also noting (for the future) that there's no currently no quick exit for the identity map, e.g. Date.cast(Date) will bail, although maybe that check happens upstream.

@mcrumiller
Copy link
Contributor Author

@stinodego I did my best to align what I had with the existing cast structure. It's possible there may be some minor changes, but I did not touch the test suite and it looks like everything is passing.

I left quite a few // TODOs in there where I think we should add code paths (along with commented code that, when enabled, will provide those paths). These are mostly self maps that may or may not involve a change in a dtype parameter (e.g. Duration -> Duration can change either time unit or time zone), and also casts from Logical dtypes to booleans, which work on Series and make physical sense (ints underneath the hood) but not necessarily logical sense (what is special about Jan 1, 1970 that makes it false, but all other dates true? However, durations of 0 makes sense map to false).

I'm not completely convinced that the new structure in this refactor is actually better than what was there before--it's at least organized, but the tradeoff for reducing the number of function calls is that our matches now require (usually) two matches per arm. If we only matched on the from dtype or to dtype we'd probably end up with a lot more duplicated code but fewer total branches.

Anyway, let me know of any suggested changes/improvements! I can remove the TODO code to clean up for now.

@mcrumiller mcrumiller marked this pull request as ready for review February 14, 2024 22:32
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the tests. If you could update this, I will take another look and probably shuffle some things around and merge it.

py-polars/tests/unit/operations/test_cast.py Outdated Show resolved Hide resolved
py-polars/tests/unit/operations/test_cast.py Outdated Show resolved Hide resolved
py-polars/tests/unit/operations/test_cast.py Outdated Show resolved Hide resolved
crates/polars-core/src/datatypes/any_value.rs Outdated Show resolved Hide resolved
@mcrumiller mcrumiller marked this pull request as draft February 15, 2024 14:50
@mcrumiller mcrumiller marked this pull request as ready for review February 15, 2024 15:38
@mcrumiller
Copy link
Contributor Author

@stinodego I believe I have addressed your issues.

@stinodego stinodego force-pushed the lit-cast-refactor branch 2 times, most recently from 1894ce0 to 071d7eb Compare February 26, 2024 12:21
@stinodego stinodego merged commit d01c4bf into pola-rs:main Feb 26, 2024
17 of 18 checks passed
@mcrumiller mcrumiller deleted the lit-cast-refactor branch February 26, 2024 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants