What's the deal with Scalars? #305

MarcoGorelli · 2023-10-30T19:00:29Z

Is the following pattern supported by the Standard?

df: DataFrame
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:
        features.append(column_name)
return features

?

I think people are expecting that it should, but there's currently really no guarantee that it does.

All the API currently says about Column.std is that the return type should be a ducktyped float scalar. Let's try this:

In [1]: class FancyFloatScalar:
   ...:     def __init__(self, value):
   ...:         self._value = value
   ...:     def __float__(self):
   ...:         return value
   ...:

In [2]: scalar = FancyFloatScalar(0.5)

In [3]: scalar > 0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 scalar > 0

TypeError: '>' not supported between instances of 'FancyFloatScalar' and 'int'

In [4]: scalar > 0.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 scalar > 0.

TypeError: '>' not supported between instances of 'FancyFloatScalar' and 'float'

What's the way of out of this?

cc @kkraus14 as you've said you wanted ducktyped scalars here

The text was updated successfully, but these errors were encountered:

kkraus14 · 2023-10-30T20:20:09Z

I mean a ducktyped float scalar I believe should be expected to implement all of the operator methods that floats support like __add__, __sub__, __radd__, __rsub__, etc.

For what it's worth, I wanted a Scalar class that we used as opposed to using ducktyped Python scalars but there was seemingly consensus on using Python scalars 😄.

MarcoGorelli · 2023-10-30T20:42:13Z

Was this before I joined? If so I'd way prefer a Scalar class tbh

cbourjau · 2023-11-06T18:52:31Z

There are two different usages of Scalar in the current API draft: In argument-position (i.e. as function arguments) and in return-position (i.e. as the return type of a function). The example in this issue is using a scalar in the return-position, but it might be good to keep this separation in mind when looking for a solution.

Going through the current draft, there are currently nine usages of Scalar in the return-position of Column member functions (get_value, max, mean, median, min, prod, std, sum, var), but I think this list should also include any and all.

It appears to me that all these cases with the exception of get_value (let's ignore it for now) can be served on the subset of data types provided by the array-API standard if the immediate support for datetimes is removed (which may be the right thing to do anyway given the intricate details of datetimes). The datetime use-cases can be unambiguously handled by first converting to unix timestamps which are of integer data type. If we are fine with working on a subset of the array-API data types then we can simply say that these functions return a standard rank-0 array. Returning standard arrays would immediately work (for array implementations that are able to collect) for the provided example.

If we are comfortable with constraining get_value to columns of array-API compatible data types then we have "solved" the problem for scalars in the return position.

The argument-position appears somewhat obvious in conjunction with the return-position: We want the following to work:

col: Column  # Assuming an array-api data type
col / col.mean()  # Scaling with a rank-0 array
col / 42 # scaling with a Python scalar (with implicit type promotion)

Both calls desugar to a call to __div__(self, other) which suggests that other ought to be of type Union[int, float, Array]. Note that either call may remain lazy if a lazy array implementation is used. Other dunder functions may include bool, str, date, datetime, and/or duration in the Union, but I don't see the necessity for an explicit Scalar type in the argument position.

tl;dr: It seems within grasp that this issue can be deferred to the array API, does it not?

kkraus14 · 2023-11-06T19:04:29Z

It appears to me that all these cases with the exception of get_value (let's ignore it for now) can be served on the subset of data types provided by the array-API standard if the immediate support for datetimes is removed (which may be the right thing to do anyway given the intricate details of datetimes)

These can all return NULL so we can't use the array-API data types

cbourjau · 2023-11-06T19:29:58Z

Oh my! The good old Null striks again! Technically, they all (with the possibly erroneous exception of get_value) return Scalar | NullType. NullType objects are inherently eager. So one way or another handling the null-case is tricky!

kkraus14 · 2023-11-06T19:32:04Z

I believe the idea behind ducktyping was to allow having say an Int32Scalar whose value could be NULL without needing to introspect the value to check if it's NULL to coerce it to NullType.

MarcoGorelli · 2023-11-06T19:40:04Z

If you want to do

if df.col('a').std() > 0:
   ...

then you'll have had to address the nulls anyway beforehand

@cbourjau is your suggestion that the example from #315 would become

df: DataFrame
df = df.may_execute()
arr = df.col('a').fill_null(float('nan')).to_array()
array_namespace = arr.__array_namespace__()
if array_namespace.std(arr) > 0:
    ...

, where may_execute would really just be a hint which libraries with lazy arrays could choose to ignore, and defer bool(scalar) to the array api?

cbourjau · 2023-11-06T19:53:18Z

Rather something like this which would a bit more user-friendly, IMHO:

df: DataFrame
df = df.may_execute() # see PR #307 
arr: Array = df.col('a').fill_null(float('nan')).std()
# or maybe better
arr: Array = df.col('a').std(default='nan')
if arr > 0:
    ...

kkraus14 · 2023-11-06T20:23:36Z

If you want to do
if df.col('a').std() > 0:
   ...
then you'll have had to address the nulls anyway beforehand

Calling __bool__ against a NULL value I would expect to throw due to it being ambiguous. That being said, Pandas has the skipna parameter and pyarrow has the skip_nulls parameter for apis like min, max, std, etc. to handle this general case.

You can't fill the nulls beforehand because you can't always pick a good sentinel value to get correct results, and you always need to be able to handle the situation where all of your values are NULL.

MarcoGorelli · 2023-11-06T20:46:08Z

Right, so if we can't piggy-back off of the Array API for bool(scalar), then...what do we do?

How would you deal with

df: DataFrame
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:
        features.append(column_name)
return features

in a standard-compliant dask implementation?

kkraus14 · 2023-11-06T21:17:12Z

Assuming df.col(column_name).std() returns a Scalar, then Scalar.__gt__ would also return a Scalar (lets ignore being able to handle the Python scalar here for the time being). Then we have Scalar.__bool__ which is the point of contention. I think the question is whether we wish to allow having Scalar.__bool__ throw in the case that something is lazy or if we want Scalar.__bool__ to force computation in order to resolve the local Python bool value.

To add some fuel to the fire... PyArrow's Scalar class doesn't implement __bool__ so it hits the default Python implementation which always returns True based on the instance being a thing...

import pyarrow as pa

a = pa.scalar(False, type=pa.bool_())
if a:
  print("Uh oh 1!")
  
b = pa.scalar(None, type=pa.bool_())
if b:
  print("Uh oh 2!")
  
# Uh oh 1!
# Uh oh 2!

So yea... we should probably figure out what we want to do here.

How would you deal with

df: DataFrame
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:
        features.append(column_name)
return features

in a standard-compliant dask implementation?

You can't execute this today without possibly recomputing df at every loop invocation unless Dask did some kind of fancy tracking / caching internally. Based on the current standard and current Dask, I'd use df.std() instead of looping over the columns and then compare each column's single value in order to get the list of features. This is unique to std() where I'm sure we could work up a different example that requires looping over the columns as opposed to using a DataFrame level function.

Outside of the standard, I'd add a df = df.persist() call before the loop, which doesn't turn things eager but prevents recalculating df at each invocation.

MarcoGorelli · 2023-11-06T21:28:39Z

thanks

Based on the current standard and current Dask, I'd use df.std() instead of looping over the columns and then compare each column's single value in order to get the list of features

Could you show the code please? I think you'll still run into issues, but I may have misunderstood

kkraus14 · 2023-11-06T22:08:59Z

You are correct, still run into the problem when you try to get the scalars after using df.std() without having some kind of persist() / compute() / collect() mechanism can cause the dataframe to be recomputed each time. Dask being able to cross the DataFrame / array border lazily actually hurts it here...

df: DataFrame
feature_df = df.std() > 0  # Yields a 1 row dataframe with each column having a `True` / `False` value, lazily in the case of Dask
features = []

# feature_array = feature_df.to_array()  # This would be an escape hatch to only trigger computation once after doing all of the "expensive" stuff, but in Dask's case it stays lazy
# feature_array = feature_array[0]  # Slice to a 1d-array since feature_df was 1 row
# for i in range(feature_df.shape[1]):
#     if feature_array[i]:  # Not sure if `__bool__` works on 0d arrays per the standard either?
#         features.append(df.column_names[i])

for column_name in feature_df.column_names:
    if feature_df.col(column_name).get_value(0):  # Assuming `get_value(0)` is lazy, the subsequent `__bool__` call will trigger computation here, which would cause recomputing `feature_df` each time
        feature.append(column_name)
return features

MarcoGorelli · 2023-11-07T13:19:55Z

Indeed, thanks for looking into this

What do you suggest as a solution?

kkraus14 · 2023-11-07T20:12:57Z

Indeed, thanks for looking into this

What do you suggest as a solution?

I don't have a suggestion here currently unfortunately. I agree this is problematic for lazy libraries. I'd suggest we take the conversation to #307 to brainstorm on a solution as this goes beyond just Scalars.

MarcoGorelli added the API design label Oct 30, 2023

MarcoGorelli mentioned this issue Oct 30, 2023

Add example of a sklearn like pipeline #294

Merged

MarcoGorelli mentioned this issue Oct 31, 2023

Add details to expectations for scalars #308

Merged

cbourjau mentioned this issue Nov 6, 2023

Dealing with if scalar #315

Closed

MarcoGorelli mentioned this issue Nov 7, 2023

Add DataFrame.persist, and notes on execution model #307

Merged

MarcoGorelli closed this as completed in #308 Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the deal with Scalars? #305

What's the deal with Scalars? #305

MarcoGorelli commented Oct 30, 2023 •

edited

Loading

kkraus14 commented Oct 30, 2023

MarcoGorelli commented Oct 30, 2023

cbourjau commented Nov 6, 2023

kkraus14 commented Nov 6, 2023

cbourjau commented Nov 6, 2023

kkraus14 commented Nov 6, 2023

MarcoGorelli commented Nov 6, 2023

cbourjau commented Nov 6, 2023

kkraus14 commented Nov 6, 2023

MarcoGorelli commented Nov 6, 2023

kkraus14 commented Nov 6, 2023

MarcoGorelli commented Nov 6, 2023

kkraus14 commented Nov 6, 2023

MarcoGorelli commented Nov 7, 2023

kkraus14 commented Nov 7, 2023

What's the deal with Scalars? #305

What's the deal with Scalars? #305

Comments

MarcoGorelli commented Oct 30, 2023 • edited Loading

kkraus14 commented Oct 30, 2023

MarcoGorelli commented Oct 30, 2023

cbourjau commented Nov 6, 2023

kkraus14 commented Nov 6, 2023

cbourjau commented Nov 6, 2023

kkraus14 commented Nov 6, 2023

MarcoGorelli commented Nov 6, 2023

cbourjau commented Nov 6, 2023

kkraus14 commented Nov 6, 2023

MarcoGorelli commented Nov 6, 2023

kkraus14 commented Nov 6, 2023

MarcoGorelli commented Nov 6, 2023

kkraus14 commented Nov 6, 2023

MarcoGorelli commented Nov 7, 2023

kkraus14 commented Nov 7, 2023

MarcoGorelli commented Oct 30, 2023 •

edited

Loading