Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename entrypoint to __consortium_api__? #323

Closed
MarcoGorelli opened this issue Nov 16, 2023 · 7 comments
Closed

Rename entrypoint to __consortium_api__? #323

MarcoGorelli opened this issue Nov 16, 2023 · 7 comments

Comments

@MarcoGorelli
Copy link
Contributor

If #308 goes in, then the return value of Column.get_value will change. It will no longer be a Python scalar, but a Scalar

This means I'll have to update the tests in pandas/Polars:

https://github.com/pandas-dev/pandas/blob/f777e67d2b29cda5b835d30c855b633269f5e8e8/pandas/tests/test_downstream.py#L340-L344

I'll change it to something much simpler that realistically will never break, like asserting something about result.name

If I'm going to have to change things upstream, I'd like to take the chance the rename the entrypoint

__dataframe_consortium_standard__ is just...long. Originally we'd suggested __dataframe_standard__, but Brock correctly pointed out that this has normative connotations

We're starting to get positive responses (see koaning/scikit-lego#597, skrub-data/skrub#786), so the time to make changes is running out

My hope is that this would then need to be the last upstream update. The rest, we can handle here / in dataframe-api-compat

@MarcoGorelli
Copy link
Contributor Author

Slightly dreading starting the conversation though, and the downside is that the minimum pandas version supported by the standard would have to rise to 2.2

An alternative could be that in dataframe-api-compat I just make a decorator, so people can write df-agnostic functions like this:

from typing import Any

from dataframe_api_compat import dataframe_api


@dataframe_api(api_version='2023.11-beta')
def my_dataframe_agnostic_function(df: DataFrame) -> Any:
    for column_name in df.column_names:
        new_column = df.col(column_name)
        new_column = (new_column - new_column.mean()) / new_column.std()
        df = df.assign(new_column.rename(f'{column_name}_scaled'))

    return df.dataframe

Then we don't need to bother pandas, and this looks pretty clean anyway

@kkraus14
Copy link
Collaborator

Folks may not want to take on the dataframe-api-compat package as a dependency, even given it's small, pure python, and vendorable.

I have no objections to the name change other than it may be a bit confusing when working across arrays, dataframes, and other future types that may have efforts to standardize APIs.

We should probably also have our spec include this dunder method as part of the DataFrame, Column, and maybe Scalar classes?

@MarcoGorelli
Copy link
Contributor Author

It's already mentioned here:

The signatures should be (note: docstring is optional):
```python
def __dataframe_consortium_standard__(
self, *, api_version: str
) -> Any:
def __column_consortium_standard__(
self, *, api_version: str
) -> Any:
```
`api_version` is
a string representing the version of the dataframe API specification
to be returned, in ``'YYYY.MM'`` form, for example, ``'2023.04'``.
If the given version is invalid or not implemented for the given module,
an error should be raised. It is suggested to use the earliest API
version required for maximum compatibility.

I don't think DataFrame / Column / Scalar need it, this is just the entry-point for going for "non-necessarily-standard-compliant" to "standard-compliant"

If you have a DataFrame as defined in our spec, it's already standard-compliant, and you'd have no need to call __dataframe_consortium_standard__ on it

@kkraus14
Copy link
Collaborator

I don't think DataFrame / Column / Scalar need it, this is just the entry-point for going for "non-necessarily-standard-compliant" to "standard-compliant"

If you have a DataFrame as defined in our spec, it's already standard-compliant, and you'd have no need to call __dataframe_consortium_standard__ on it

If I get an arbitrary dataframe as input and I want to confirm it's standard-compliant, how do I do that today? In my mind the easiest way would be to have standard-compliant classes implement __dataframe_consortium_standard__ that return self.

@MarcoGorelli
Copy link
Contributor Author

there's __dataframe_namespace__ for that

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 16, 2023

there's __dataframe_namespace__ for that

That returns the namespace and not a compliant dataframe object. So the code would end up looking like:

def get_compliant_dataframe(df):
    if hasattr(df, "__dataframe_namespace__"):
        return df
    else:
        return df.__dataframe_consortium_standard__(...)

It feels a bit clunky but I guess it's not too bad?

@MarcoGorelli
Copy link
Contributor Author

It feels a bit clunky but I guess it's not too bad?

yeah, and as Ralf said, in the end, people will probably just write their own helper functions

might as well close then, this isn't too bad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants