Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add details to expectations for scalars #308

Merged
merged 31 commits into from
Nov 17, 2023

Conversation

MarcoGorelli
Copy link
Contributor

closes #305

@kkraus14
Copy link
Collaborator

kkraus14 commented Oct 31, 2023

I think we need to have more details than this. I.E. is inheriting from float / int / bool required? Does isinstance(..., float) return True? Does the scalar need to support pickle, if so can it be namespace specific or does it need to return general Python scalars to work cross namespace? Which methods need to be supported vs which dont?

@MarcoGorelli
Copy link
Contributor Author

I don't really have to strong an opinion on these. I can make a list of the methods which need to be supported, do you have preferences on the rest?

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 1, 2023

I don't really have to strong an opinion on these. I can make a list of the methods which need to be supported, do you have preferences on the rest?

I'm still generally -1 on ducktyping Python scalars because of the messiness that will come when doing things like LIST / STRUCT / DECIMAL types where I'd be in much stronger favor of having a Scalar class that can have a much smaller API area that implementations need to conform to.

@MarcoGorelli
Copy link
Contributor Author

agree - besides, the longer users stay within classes defined by the standard, the longer they know exactly what to expect

will update then, thanks

@MarcoGorelli
Copy link
Contributor Author

I've added a list of required methods

Comment on lines 72 to 73
- `__int__`
- `__float__`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

punt on these

@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 10, 2023

I've updated to add a Scalar class, to note that __bool__ may force compute (or raise, implementation-dependent), for now leaving out __int__ / __float__ / any other method to get a Python scalar out

OK to move forwards here?

@MarcoGorelli
Copy link
Contributor Author

I think everyone agreed with at least the content of this PR

There is further discussion to be had, but I think we can defer it until later, I suggest we start by merging the part which we agree on

__all__ = ['Scalar']


class Scalar(Protocol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a dtype property similar to what we have with Columns?

Copy link
Contributor Author

@MarcoGorelli MarcoGorelli Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest I'd be fine with departing completely from the idea of ducktyped Python scalars and just adding extra things (like dtype or persist) if they're useful

let's discuss this part in the next call

this would mean that Python scalars would no longer implement the Scalar Protocol

__all__ = ['Scalar']


class Scalar(Protocol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a clean way from a typing perspective to allow people to pass either Scalar objects or typical Python scalars that we can handle constructing Scalars from without issue? Without that I think the API will be a bit of a pain with people having to sprinkle Scalar(...) or Scalar.from_py(...) all over.

We should also have some way to go from Python scalars to these Scalar objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've added an example (spec/API_specification/examples/05_scalars_example.py)

Python scalars do in fact implement the Scalar protocol, so they can be passed without any issue

I think this is an argument against adding dtype and other methods to our scalars which aren't supported by Python ones

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python scalars do in fact implement the Scalar protocol, so they can be passed without any issue

at least, Python floats do

trying to do fill_null('foo') wouldn't be fine

I'm starting to see the writing on the wall for FloatScalar / IntScalar / StringScalar ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine Scalars would be typed following the Column dtypes, which then allows us to define a consistent set of type handling and type promotion rules. Otherwise, if Scalars are not typed similarly to Columns, you may need to introspect the value of a Scalar for example in calling fill_null against an Int8 dtype column with an IntScalar where the value is above what is supported by Int8.

Comment on lines 51 to 53
# null is a special object which represents a missing value.
# It is not valid as a type.
NullType = Any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can just have the special NULL scalar singleton be of type Scalar instead of a special type here? Not sure what NULL.dtype would yield though since we don't have a null or empty dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, brilliant point, thanks

I guess it could just return None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from #308 (comment) , I think we may need to end up with NullScalar

@@ -18,7 +18,7 @@ class DataFrame:
...

class Column:
def mean(self, skip_nulls: bool = True) -> float | NullType:
def mean(self, skip_nulls: bool = True) -> Scalar | NullType:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's weird that this can return either a Scalar which is a protocol or a NullType which doesn't guarantee the same interface. It makes it hard to guarantee the ability to nicely method chain here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, you're right - it should just be -> Scalar, correct? Because Scalar could be backed by a null value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if indeed we did go with #308 (comment), then Scalar could be a union of FloatScalar, IntScalar, NullScalar, ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we would want Scalar to be type erased similar to Column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify please? You could do isinstance(value, namespace.FloatScalar) to check the dtype

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, we're aligned here. We have different classes per type as opposed to just a top level Scalar class where type specific functionality is hidden underneath the class (similar to our Column class).

@kkraus14 kkraus14 mentioned this pull request Nov 14, 2023
@MarcoGorelli MarcoGorelli marked this pull request as draft November 15, 2023 11:53
@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Nov 15, 2023

If we add dtype / persist to the Scalar protocol, but Python scalars will no longer implement that protocol. But maybe that's OK

All we need is some separate type hints:

  • BoolScalar: either bool or Scalar
  • FloatScalar: either float or Scalar
  • ...

Then:

  • return values will typically be typed as Scalar
  • arguments where the value can either be Scalar or PythonScalar can be typed as AnyScalar

@MarcoGorelli MarcoGorelli marked this pull request as ready for review November 15, 2023 13:51
@@ -169,15 +167,31 @@ def __column_consortium_standard__(
...


PythonScalar = Union[str, int, float, bool]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need date and datetime values in here as well as DateScalar and DatetimeScalar I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's already namespace.date to create a scalar which can be used for date columns (e.g. see how it's used in TPCH Q1:

    mask = lineitem.col("l_shipdate") <= namespace.date(1998, 9, 2)

), I think that's enough?

NumericScalar = Union[FloatScalar, IntScalar]
StringScalar = Union[str, Scalar]
AnyScalar = Union[PythonScalar, Scalar]
NullType = Namespace.NullType
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a null scalar singleton or a null dtype or something else? If it's a null scalar singleton then I think it needs to be added to bunch of these Union types.

PythonScalar = Union[str, int, float, bool]
BoolScalar = Union[bool, Scalar]
FloatScalar = Union[float, Scalar]
IntScalar = Union[int, Scalar]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without Int8Scalar, Int16Scalar, Int32Scalar, I think we need to do value introspection to figure out how to handle type promotion or throwing or whatnot.

Should we have a Scalar class for each dtype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just type hints

The idea is:

  • Scalar is the Standard class for scalars. It has dtype and persist methods

FloatScalar is just for type hint purposes and means "either a Python float or a standard Scalar

I could simplify and just have PythonScalar and Scaler, this is becoming too complex

@MarcoGorelli
Copy link
Contributor Author

Have updated. We now have:

  • Scalar: class backed by a scalar, which allows data to reside on GPU or stay lazy. Has dtype property and persist method
  • AnyScalar: type hint meaning "either a Python scalar, or a Scalar`. This makes explicit that both
    df: DataFrame
    df > 3  # supported
    df > df.col('a').mean()  # also supported
    are supported (3 is a Python scalar, df.col('a').mean() is a Scalar)

null is just a singleton null value. It is not a Scalar, as it does not have a dtype defined. So, for fill_nan, the fill value can be float | NullType | Scalar, meaning you can do:

df: DataFrame
ns = df.__dataframe_namespace__()
df.col('a').fill_nan(3.)  # supported (`float`)
df.col('a').fill_nan(ns.null)  # also supported (`NullType`)
df.col('a').fill_nan(df.col('b').mean())  # also supported (`Scalar`)

From the discussion in #294, I've also added Scalar.persist, which seems very much needed.
If people are OK with this design, then I think #294 will finally be ready (sorry @cbourjau it's taken so long, and thanks for all the discussion which your example has spurred!)

@MarcoGorelli
Copy link
Contributor Author

thanks for your review! leaving open a bit, if others have objections please speak up - else, really excited to be moving forwards, so glad we've found common ground on the design here

@MarcoGorelli MarcoGorelli merged commit 1f81476 into data-apis:main Nov 17, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

What's the deal with Scalars?
2 participants