-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add details to expectations for scalars #308
Conversation
90f1645
to
834e978
Compare
834e978
to
751a131
Compare
I think we need to have more details than this. I.E. is inheriting from |
I don't really have to strong an opinion on these. I can make a list of the methods which need to be supported, do you have preferences on the rest? |
I'm still generally -1 on ducktyping Python scalars because of the messiness that will come when doing things like |
agree - besides, the longer users stay within classes defined by the standard, the longer they know exactly what to expect will update then, thanks |
I've added a list of required methods |
- `__int__` | ||
- `__float__` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
punt on these
I've updated to add a Scalar class, to note that OK to move forwards here? |
I think everyone agreed with at least the content of this PR There is further discussion to be had, but I think we can defer it until later, I suggest we start by merging the part which we agree on |
__all__ = ['Scalar'] | ||
|
||
|
||
class Scalar(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a dtype
property similar to what we have with Columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be honest I'd be fine with departing completely from the idea of ducktyped Python scalars and just adding extra things (like dtype
or persist
) if they're useful
let's discuss this part in the next call
this would mean that Python scalars would no longer implement the Scalar Protocol
__all__ = ['Scalar'] | ||
|
||
|
||
class Scalar(Protocol): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a clean way from a typing perspective to allow people to pass either Scalar
objects or typical Python scalars that we can handle constructing Scalar
s from without issue? Without that I think the API will be a bit of a pain with people having to sprinkle Scalar(...)
or Scalar.from_py(...)
all over.
We should also have some way to go from Python scalars to these Scalar
objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've added an example (spec/API_specification/examples/05_scalars_example.py
)
Python scalars do in fact implement the Scalar protocol, so they can be passed without any issue
I think this is an argument against adding dtype
and other methods to our scalars which aren't supported by Python ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python scalars do in fact implement the Scalar protocol, so they can be passed without any issue
at least, Python floats do
trying to do fill_null('foo')
wouldn't be fine
I'm starting to see the writing on the wall for FloatScalar
/ IntScalar
/ StringScalar
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine Scalars would be typed following the Column dtypes, which then allows us to define a consistent set of type handling and type promotion rules. Otherwise, if Scalars are not typed similarly to Columns, you may need to introspect the value of a Scalar for example in calling fill_null
against an Int8
dtype column with an IntScalar
where the value is above what is supported by Int8
.
# null is a special object which represents a missing value. | ||
# It is not valid as a type. | ||
NullType = Any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can just have the special NULL
scalar singleton be of type Scalar
instead of a special type here? Not sure what NULL.dtype
would yield though since we don't have a null
or empty
dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, brilliant point, thanks
I guess it could just return None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from #308 (comment) , I think we may need to end up with NullScalar
@@ -18,7 +18,7 @@ class DataFrame: | |||
... | |||
|
|||
class Column: | |||
def mean(self, skip_nulls: bool = True) -> float | NullType: | |||
def mean(self, skip_nulls: bool = True) -> Scalar | NullType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's weird that this can return either a Scalar
which is a protocol or a NullType
which doesn't guarantee the same interface. It makes it hard to guarantee the ability to nicely method chain here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, you're right - it should just be -> Scalar
, correct? Because Scalar
could be backed by a null value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if indeed we did go with #308 (comment), then Scalar
could be a union of FloatScalar
, IntScalar
, NullScalar
, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we would want Scalar
to be type erased similar to Column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify please? You could do isinstance(value, namespace.FloatScalar)
to check the dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, we're aligned here. We have different classes per type as opposed to just a top level Scalar
class where type specific functionality is hidden underneath the class (similar to our Column
class).
If we add All we need is some separate type hints:
Then:
|
@@ -169,15 +167,31 @@ def __column_consortium_standard__( | |||
... | |||
|
|||
|
|||
PythonScalar = Union[str, int, float, bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need date
and datetime
values in here as well as DateScalar
and DatetimeScalar
I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's already namespace.date
to create a scalar which can be used for date columns (e.g. see how it's used in TPCH Q1:
mask = lineitem.col("l_shipdate") <= namespace.date(1998, 9, 2)
), I think that's enough?
NumericScalar = Union[FloatScalar, IntScalar] | ||
StringScalar = Union[str, Scalar] | ||
AnyScalar = Union[PythonScalar, Scalar] | ||
NullType = Namespace.NullType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a null scalar singleton or a null dtype or something else? If it's a null scalar singleton then I think it needs to be added to bunch of these Union
types.
PythonScalar = Union[str, int, float, bool] | ||
BoolScalar = Union[bool, Scalar] | ||
FloatScalar = Union[float, Scalar] | ||
IntScalar = Union[int, Scalar] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without Int8Scalar
, Int16Scalar
, Int32Scalar
, I think we need to do value introspection to figure out how to handle type promotion or throwing or whatnot.
Should we have a Scalar class for each dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just type hints
The idea is:
- Scalar is the Standard class for scalars. It has
dtype
andpersist
methods
FloatScalar is just for type hint purposes and means "either a Python float or a standard Scalar
I could simplify and just have PythonScalar and Scaler, this is becoming too complex
Have updated. We now have:
df: DataFrame
ns = df.__dataframe_namespace__()
df.col('a').fill_nan(3.) # supported (`float`)
df.col('a').fill_nan(ns.null) # also supported (`NullType`)
df.col('a').fill_nan(df.col('b').mean()) # also supported (`Scalar`) From the discussion in #294, I've also added |
thanks for your review! leaving open a bit, if others have objections please speak up - else, really excited to be moving forwards, so glad we've found common ground on the design here |
closes #305