Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the deal with Scalars? #305

Closed
MarcoGorelli opened this issue Oct 30, 2023 · 15 comments · Fixed by #308
Closed

What's the deal with Scalars? #305

MarcoGorelli opened this issue Oct 30, 2023 · 15 comments · Fixed by #308

Comments

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Oct 30, 2023

Is the following pattern supported by the Standard?

df: DataFrame
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:
        features.append(column_name)
return features

?

I think people are expecting that it should, but there's currently really no guarantee that it does.

All the API currently says about Column.std is that the return type should be a ducktyped float scalar. Let's try this:

In [1]: class FancyFloatScalar:
   ...:     def __init__(self, value):
   ...:         self._value = value
   ...:     def __float__(self):
   ...:         return value
   ...:

In [2]: scalar = FancyFloatScalar(0.5)

In [3]: scalar > 0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 scalar > 0

TypeError: '>' not supported between instances of 'FancyFloatScalar' and 'int'

In [4]: scalar > 0.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 scalar > 0.

TypeError: '>' not supported between instances of 'FancyFloatScalar' and 'float'

What's the way of out of this?

cc @kkraus14 as you've said you wanted ducktyped scalars here

@kkraus14
Copy link
Collaborator

I mean a ducktyped float scalar I believe should be expected to implement all of the operator methods that floats support like __add__, __sub__, __radd__, __rsub__, etc.

For what it's worth, I wanted a Scalar class that we used as opposed to using ducktyped Python scalars but there was seemingly consensus on using Python scalars 😄.

@MarcoGorelli
Copy link
Contributor Author

Was this before I joined? If so I'd way prefer a Scalar class tbh

@cbourjau
Copy link
Contributor

cbourjau commented Nov 6, 2023

There are two different usages of Scalar in the current API draft: In argument-position (i.e. as function arguments) and in return-position (i.e. as the return type of a function). The example in this issue is using a scalar in the return-position, but it might be good to keep this separation in mind when looking for a solution.

Going through the current draft, there are currently nine usages of Scalar in the return-position of Column member functions (get_value, max, mean, median, min, prod, std, sum, var), but I think this list should also include any and all.

It appears to me that all these cases with the exception of get_value (let's ignore it for now) can be served on the subset of data types provided by the array-API standard if the immediate support for datetimes is removed (which may be the right thing to do anyway given the intricate details of datetimes). The datetime use-cases can be unambiguously handled by first converting to unix timestamps which are of integer data type. If we are fine with working on a subset of the array-API data types then we can simply say that these functions return a standard rank-0 array. Returning standard arrays would immediately work (for array implementations that are able to collect) for the provided example.

If we are comfortable with constraining get_value to columns of array-API compatible data types then we have "solved" the problem for scalars in the return position.

The argument-position appears somewhat obvious in conjunction with the return-position: We want the following to work:

col: Column  # Assuming an array-api data type
col / col.mean()  # Scaling with a rank-0 array
col / 42 # scaling with a Python scalar (with implicit type promotion)

Both calls desugar to a call to __div__(self, other) which suggests that other ought to be of type Union[int, float, Array]. Note that either call may remain lazy if a lazy array implementation is used. Other dunder functions may include bool, str, date, datetime, and/or duration in the Union, but I don't see the necessity for an explicit Scalar type in the argument position.

tl;dr: It seems within grasp that this issue can be deferred to the array API, does it not?

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 6, 2023

It appears to me that all these cases with the exception of get_value (let's ignore it for now) can be served on the subset of data types provided by the array-API standard if the immediate support for datetimes is removed (which may be the right thing to do anyway given the intricate details of datetimes)

These can all return NULL so we can't use the array-API data types

@cbourjau
Copy link
Contributor

cbourjau commented Nov 6, 2023

Oh my! The good old Null striks again! Technically, they all (with the possibly erroneous exception of get_value) return Scalar | NullType. NullType objects are inherently eager. So one way or another handling the null-case is tricky!

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 6, 2023

I believe the idea behind ducktyping was to allow having say an Int32Scalar whose value could be NULL without needing to introspect the value to check if it's NULL to coerce it to NullType.

@MarcoGorelli
Copy link
Contributor Author

If you want to do

if df.col('a').std() > 0:
   ...

then you'll have had to address the nulls anyway beforehand

@cbourjau is your suggestion that the example from #315 would become

df: DataFrame
df = df.may_execute()
arr = df.col('a').fill_null(float('nan')).to_array()
array_namespace = arr.__array_namespace__()
if array_namespace.std(arr) > 0:
    ...

, where may_execute would really just be a hint which libraries with lazy arrays could choose to ignore, and defer bool(scalar) to the array api?

@cbourjau
Copy link
Contributor

cbourjau commented Nov 6, 2023

Rather something like this which would a bit more user-friendly, IMHO:

df: DataFrame
df = df.may_execute() # see PR #307 
arr: Array = df.col('a').fill_null(float('nan')).std()
# or maybe better
arr: Array = df.col('a').std(default='nan')
if arr > 0:
    ...

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 6, 2023

If you want to do

if df.col('a').std() > 0:
   ...

then you'll have had to address the nulls anyway beforehand

Calling __bool__ against a NULL value I would expect to throw due to it being ambiguous. That being said, Pandas has the skipna parameter and pyarrow has the skip_nulls parameter for apis like min, max, std, etc. to handle this general case.

You can't fill the nulls beforehand because you can't always pick a good sentinel value to get correct results, and you always need to be able to handle the situation where all of your values are NULL.

@MarcoGorelli
Copy link
Contributor Author

Right, so if we can't piggy-back off of the Array API for bool(scalar), then...what do we do?

How would you deal with

df: DataFrame
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:
        features.append(column_name)
return features

in a standard-compliant dask implementation?

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 6, 2023

Assuming df.col(column_name).std() returns a Scalar, then Scalar.__gt__ would also return a Scalar (lets ignore being able to handle the Python scalar here for the time being). Then we have Scalar.__bool__ which is the point of contention. I think the question is whether we wish to allow having Scalar.__bool__ throw in the case that something is lazy or if we want Scalar.__bool__ to force computation in order to resolve the local Python bool value.

To add some fuel to the fire... PyArrow's Scalar class doesn't implement __bool__ so it hits the default Python implementation which always returns True based on the instance being a thing...

import pyarrow as pa

a = pa.scalar(False, type=pa.bool_())
if a:
  print("Uh oh 1!")
  
b = pa.scalar(None, type=pa.bool_())
if b:
  print("Uh oh 2!")
  
# Uh oh 1!
# Uh oh 2!

So yea... we should probably figure out what we want to do here.

How would you deal with

df: DataFrame
features = []
for column_name in df.column_names:
    if df.col(column_name).std() > 0:
        features.append(column_name)
return features

in a standard-compliant dask implementation?

You can't execute this today without possibly recomputing df at every loop invocation unless Dask did some kind of fancy tracking / caching internally. Based on the current standard and current Dask, I'd use df.std() instead of looping over the columns and then compare each column's single value in order to get the list of features. This is unique to std() where I'm sure we could work up a different example that requires looping over the columns as opposed to using a DataFrame level function.

Outside of the standard, I'd add a df = df.persist() call before the loop, which doesn't turn things eager but prevents recalculating df at each invocation.

@MarcoGorelli
Copy link
Contributor Author

thanks

Based on the current standard and current Dask, I'd use df.std() instead of looping over the columns and then compare each column's single value in order to get the list of features

Could you show the code please? I think you'll still run into issues, but I may have misunderstood

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 6, 2023

You are correct, still run into the problem when you try to get the scalars after using df.std() without having some kind of persist() / compute() / collect() mechanism can cause the dataframe to be recomputed each time. Dask being able to cross the DataFrame / array border lazily actually hurts it here...

df: DataFrame
feature_df = df.std() > 0  # Yields a 1 row dataframe with each column having a `True` / `False` value, lazily in the case of Dask
features = []

# feature_array = feature_df.to_array()  # This would be an escape hatch to only trigger computation once after doing all of the "expensive" stuff, but in Dask's case it stays lazy
# feature_array = feature_array[0]  # Slice to a 1d-array since feature_df was 1 row
# for i in range(feature_df.shape[1]):
#     if feature_array[i]:  # Not sure if `__bool__` works on 0d arrays per the standard either?
#         features.append(df.column_names[i])

for column_name in feature_df.column_names:
    if feature_df.col(column_name).get_value(0):  # Assuming `get_value(0)` is lazy, the subsequent `__bool__` call will trigger computation here, which would cause recomputing `feature_df` each time
        feature.append(column_name)
return features

@MarcoGorelli
Copy link
Contributor Author

Indeed, thanks for looking into this

What do you suggest as a solution?

@kkraus14
Copy link
Collaborator

kkraus14 commented Nov 7, 2023

Indeed, thanks for looking into this

What do you suggest as a solution?

I don't have a suggestion here currently unfortunately. I agree this is problematic for lazy libraries. I'd suggest we take the conversation to #307 to brainstorm on a solution as this goes beyond just Scalars.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants