-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LazyFrame
should not be generic
#4256
Comments
I fully agree with you. This generic business increases the complexity of the classes and types for no apparent benefit. I personally do not know a single DataFrame library that defines their core DataFrame class as a generic type. Maybe @JakobGM or @ritchie46 care to elaborate here. |
A good read @matteosantama and I agree. Let's undo the generic |
Since this would be a breaking change, it would be nice if we could include this in the upcoming @matteosantama do you want to pick this up, or should I take a stab at this? |
@stinodego I've been working on a PR actually. I'll put the finishing touches and open up the PR today |
Sorry for not participating in this discussion originally, especially since I was the one that argued for and introduced these changes! 😅 Having read your your points @matteosantama (thanks for the links as well!), I agree with these general changes I have one question, though! Do we think we could preserve the class, not in roundtrips, but in general simple methods? Take this example: def wrap_ldf(ldf: PyLazyFrame) -> LazyFrame:
return LazyFrame._from_pyldf(ldf)
class LazyFrame:
@classmethod
def read_json(
cls,
file: str | Path | IOBase,
) -> LazyFrame:
# --- /snip ---
return wrap_ldf(PyLazyFrame.read_json(file)) Is there a reason for us not doing it in this way? class LazyFrame:
@classmethod
def _wrap_ldf(cls: Type[LDF], ldf: PyLazyFrame) -> LDF:
return cls._from_pyldf(ldf)
@classmethod
def read_json(
cls: Type[LDF],
file: str | Path | IOBase,
) -> LDF:
# --- /snip ---
return cls._wrap_ldf(PyLazyFrame.read_json(file)) That way we preserve the original class, while not using generics of any kind. What do you think? In my mind, placing this alternative class constructor on the class itself also makes sense, without complicating the code. |
Seems reasonable to me! We can strive to maintain the proper classes without making a total guarantee. I'm sure there are other places that can be cleaned up in a similar manner. |
Alright! I can create a PR when I get the time, and then we can take a look at how it ends up looking in practice before deciding. I think it might be done cleanly in most cases, and the rest we leave just as is. If we do it that way I might be able to salvage https://github.com/kolonialno/patito by just overwriting |
LazyFrame
is currently defined as a generic class:The motivation for this implementation was raised in #2862, but I think it was a mistake. Allow me to list my reasons.
LazyFrame
generic was very weak to begin with. It was noted thatfails. And although this may be slightly unexpected behavior, it is by no means wrong if properly documented. I claim that supporting this very specific use-case at the library level is overly burdensome for the marginal benefit it provides.
We have arbitrarily chosen
LazyFrame
to be generic over the type ofDataFrame
it constructs, but in truthDataFrame
should also be generic over the type ofLazyFrame
it generates. This of course creates a circular relationship, which is why we don't use generics and resort to metaclass "hacking" on theDataFrame
side. IMO, this is an indication that these two classes should not be generic over one another, and instead should have a strict one-to-one relationship.There has been significant, ongoing efforts to enforce
mypy --strict=true
[ref Additional lints for the Python code base #4044]. Part of this work includes settingdisallow_any_generics=true
, meaning every reference toLazyFrame
will have to be given an explicit generic parameter. Consider the implications of theio
methods.polars.io.scan_csv
is typed to return aLazyFrame
, but what should the generic parameter be?pl.DataFrame
: No, not necessarily. We may be reading in a subclassedDataFrame
instead.Any
: This will silence mypy but is wrong because it can't truly beAny
.pl.LazyFrame[int]
makes no sense!This specific problem exists for
polars.io.scan_csv
polars.io.scan_ipc
polars.io.scan_parquet
polars.io.scan_ds
but there are literally hundreds of other instances where we must decide what generic to use. This problem was also raised in #4052.
I am not totally against supporting the roundtrip
isinstance
check, but the current implementation falls short. In particular, it relies too heavily on duck-typing (the# type: ignore
comments are a sure sign). The usage of duck-typing has fallen out of favor in modern Python -- replaced by strict typing and static analyzers. Adhering to strict typing greatly benefits the health of a code base, something especially crucial in a large project like this. I think it makes sense to revert Preserve DataFrame type after LazyFrame roundtrips #2862 until we identify a more compelling reason to support a roundtrip.As a final thought, the desired round trip behavior could alternately be achieved on the user side if we provided a helper method
to_subclass
.tl;dr
polars
should not makeLazyFrame
generic. It is impossible to simultaneously type aDataFrame
generic overLazyFrame
along with aLazyFrame
generic overDataFrame
. Since we cannot properly support the two-way generics with type annotations, I don't think our code should support it. All internal operations should only deal with theDataFrame
/LazyFrame
pair and any conversions be done on the user-side.The text was updated successfully, but these errors were encountered: