Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LazyFrame should not be generic #4256

Closed
matteosantama opened this issue Aug 4, 2022 · 7 comments · Fixed by #4309
Closed

LazyFrame should not be generic #4256

matteosantama opened this issue Aug 4, 2022 · 7 comments · Fixed by #4309
Labels
bug Something isn't working

Comments

@matteosantama
Copy link
Contributor

matteosantama commented Aug 4, 2022

LazyFrame is currently defined as a generic class:

DF = TypeVar("DF", bound="pli.DataFrame")

class LazyFrame(Generic[DF]):
    ...

The motivation for this implementation was raised in #2862, but I think it was a mistake. Allow me to list my reasons.

  1. Start by considering that the motivation for making LazyFrame generic was very weak to begin with. It was noted that
import polars as pl

class MyDataFrame(pl.DataFrame):
    pass

assert isinstance(MyDataFrame().lazy().collect(), MyDataFrame)

fails. And although this may be slightly unexpected behavior, it is by no means wrong if properly documented. I claim that supporting this very specific use-case at the library level is overly burdensome for the marginal benefit it provides.

  1. We have arbitrarily chosen LazyFrame to be generic over the type of DataFrame it constructs, but in truth DataFrame should also be generic over the type of LazyFrame it generates. This of course creates a circular relationship, which is why we don't use generics and resort to metaclass "hacking" on the DataFrame side. IMO, this is an indication that these two classes should not be generic over one another, and instead should have a strict one-to-one relationship.

  2. There has been significant, ongoing efforts to enforce mypy --strict=true [ref Additional lints for the Python code base #4044]. Part of this work includes setting disallow_any_generics=true, meaning every reference to LazyFrame will have to be given an explicit generic parameter. Consider the implications of the io methods. polars.io.scan_csv is typed to return a LazyFrame, but what should the generic parameter be?

    • pl.DataFrame: No, not necessarily. We may be reading in a subclassed DataFrame instead.
    • Any: This will silence mypy but is wrong because it can't truly be Any. pl.LazyFrame[int] makes no sense!

This specific problem exists for

  • polars.io.scan_csv
  • polars.io.scan_ipc
  • polars.io.scan_parquet
  • polars.io.scan_ds

but there are literally hundreds of other instances where we must decide what generic to use. This problem was also raised in #4052.

  1. I am not totally against supporting the roundtrip isinstance check, but the current implementation falls short. In particular, it relies too heavily on duck-typing (the # type: ignore comments are a sure sign). The usage of duck-typing has fallen out of favor in modern Python -- replaced by strict typing and static analyzers. Adhering to strict typing greatly benefits the health of a code base, something especially crucial in a large project like this. I think it makes sense to revert Preserve DataFrame type after LazyFrame roundtrips #2862 until we identify a more compelling reason to support a roundtrip.

  2. As a final thought, the desired round trip behavior could alternately be achieved on the user side if we provided a helper method to_subclass.

T = TypeVar("T", bound=pl.DataFrame)

class MyDataFrame(pl.DataFrame):
    @classmethod
    def _from_dataframe(cls: type[T], df: pl.DataFrame) -> T:
        # user defined function that transforms a polars df to a custom df
        ...

class DataFrame:
    ...
    @classmethod
    def _from_dataframe(cls: type[T], df: pl.DataFrame) -> T:
        return df

    def to_subclass(self, subclass_type: type[T]) -> T:
        return subclass_type._from_dataframe(self)


assert isinstance(MyDataFrame().lazy().collect().to_subclass(MyDataFrame), MyDataFrame)

tl;dr

polars should not make LazyFrame generic. It is impossible to simultaneously type a DataFrame generic over LazyFrame along with a LazyFrame generic over DataFrame. Since we cannot properly support the two-way generics with type annotations, I don't think our code should support it. All internal operations should only deal with the DataFrame / LazyFrame pair and any conversions be done on the user-side.

@matteosantama matteosantama added the bug Something isn't working label Aug 4, 2022
@stinodego
Copy link
Contributor

I fully agree with you. This generic business increases the complexity of the classes and types for no apparent benefit. I personally do not know a single DataFrame library that defines their core DataFrame class as a generic type.

Maybe @JakobGM or @ritchie46 care to elaborate here.

@ritchie46
Copy link
Member

A good read @matteosantama and I agree. Let's undo the generic LazyFrame.

@stinodego
Copy link
Contributor

Since this would be a breaking change, it would be nice if we could include this in the upcoming 0.14.0 release.

@matteosantama do you want to pick this up, or should I take a stab at this?

@matteosantama
Copy link
Contributor Author

@stinodego I've been working on a PR actually. I'll put the finishing touches and open up the PR today

matteosantama pushed a commit to matteosantama/polars that referenced this issue Aug 8, 2022
@JakobGM
Copy link
Contributor

JakobGM commented Aug 11, 2022

Sorry for not participating in this discussion originally, especially since I was the one that argued for and introduced these changes! 😅 Having read your your points @matteosantama (thanks for the links as well!), I agree with these general changes ☺️ Not having DataFrame and LazyFrame being generic, especially due to the circularity of their relations, seems reasonable.

I have one question, though! Do we think we could preserve the class, not in roundtrips, but in general simple methods? Take this example:

def wrap_ldf(ldf: PyLazyFrame) -> LazyFrame:
    return LazyFrame._from_pyldf(ldf)

class LazyFrame:
    @classmethod
    def read_json(
        cls,
        file: str | Path | IOBase,
    ) -> LazyFrame:
		# --- /snip ---
        return wrap_ldf(PyLazyFrame.read_json(file))

Is there a reason for us not doing it in this way?

class LazyFrame:
    @classmethod
    def _wrap_ldf(cls: Type[LDF], ldf: PyLazyFrame) -> LDF:
        return cls._from_pyldf(ldf)

    @classmethod
    def read_json(
        cls: Type[LDF],
        file: str | Path | IOBase,
    ) -> LDF:
		# --- /snip ---
        return cls._wrap_ldf(PyLazyFrame.read_json(file))

That way we preserve the original class, while not using generics of any kind. What do you think? In my mind, placing this alternative class constructor on the class itself also makes sense, without complicating the code.

@matteosantama
Copy link
Contributor Author

Seems reasonable to me! We can strive to maintain the proper classes without making a total guarantee. I'm sure there are other places that can be cleaned up in a similar manner.

@JakobGM
Copy link
Contributor

JakobGM commented Aug 11, 2022

Seems reasonable to me! We can strive to maintain the proper classes without making a total guarantee. I'm sure there are other places that can be cleaned up in a similar manner.

Alright! I can create a PR when I get the time, and then we can take a look at how it ends up looking in practice before deciding. I think it might be done cleanly in most cases, and the rest we leave just as is.

If we do it that way I might be able to salvage https://github.com/kolonialno/patito by just overwriting DataFrame.lazy() and LazyFrame.collect() in order to preserve the runtime type (not static analysis type) of patito's custom DataFrame class, without forcing polars to accommodate for such a niche use of the library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants