-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What's the deal with Scalars? #305
Comments
I mean a ducktyped float scalar I believe should be expected to implement all of the operator methods that floats support like For what it's worth, I wanted a |
Was this before I joined? If so I'd way prefer a Scalar class tbh |
There are two different usages of Going through the current draft, there are currently nine usages of It appears to me that all these cases with the exception of If we are comfortable with constraining The argument-position appears somewhat obvious in conjunction with the return-position: We want the following to work: col: Column # Assuming an array-api data type
col / col.mean() # Scaling with a rank-0 array
col / 42 # scaling with a Python scalar (with implicit type promotion) Both calls desugar to a call to tl;dr: It seems within grasp that this issue can be deferred to the array API, does it not? |
These can all return |
Oh my! The good old Null striks again! Technically, they all (with the possibly erroneous exception of |
I believe the idea behind ducktyping was to allow having say an |
If you want to do if df.col('a').std() > 0:
... then you'll have had to address the nulls anyway beforehand @cbourjau is your suggestion that the example from #315 would become df: DataFrame
df = df.may_execute()
arr = df.col('a').fill_null(float('nan')).to_array()
array_namespace = arr.__array_namespace__()
if array_namespace.std(arr) > 0:
... , where |
Rather something like this which would a bit more user-friendly, IMHO: df: DataFrame
df = df.may_execute() # see PR #307
arr: Array = df.col('a').fill_null(float('nan')).std()
# or maybe better
arr: Array = df.col('a').std(default='nan')
if arr > 0:
... |
Calling You can't fill the nulls beforehand because you can't always pick a good sentinel value to get correct results, and you always need to be able to handle the situation where all of your values are |
Right, so if we can't piggy-back off of the Array API for How would you deal with df: DataFrame
features = []
for column_name in df.column_names:
if df.col(column_name).std() > 0:
features.append(column_name)
return features in a standard-compliant dask implementation? |
Assuming To add some fuel to the fire... PyArrow's import pyarrow as pa
a = pa.scalar(False, type=pa.bool_())
if a:
print("Uh oh 1!")
b = pa.scalar(None, type=pa.bool_())
if b:
print("Uh oh 2!")
# Uh oh 1!
# Uh oh 2! So yea... we should probably figure out what we want to do here.
You can't execute this today without possibly recomputing Outside of the standard, I'd add a |
thanks
Could you show the code please? I think you'll still run into issues, but I may have misunderstood |
You are correct, still run into the problem when you try to get the scalars after using df: DataFrame
feature_df = df.std() > 0 # Yields a 1 row dataframe with each column having a `True` / `False` value, lazily in the case of Dask
features = []
# feature_array = feature_df.to_array() # This would be an escape hatch to only trigger computation once after doing all of the "expensive" stuff, but in Dask's case it stays lazy
# feature_array = feature_array[0] # Slice to a 1d-array since feature_df was 1 row
# for i in range(feature_df.shape[1]):
# if feature_array[i]: # Not sure if `__bool__` works on 0d arrays per the standard either?
# features.append(df.column_names[i])
for column_name in feature_df.column_names:
if feature_df.col(column_name).get_value(0): # Assuming `get_value(0)` is lazy, the subsequent `__bool__` call will trigger computation here, which would cause recomputing `feature_df` each time
feature.append(column_name)
return features |
Indeed, thanks for looking into this What do you suggest as a solution? |
I don't have a suggestion here currently unfortunately. I agree this is problematic for lazy libraries. I'd suggest we take the conversation to #307 to brainstorm on a solution as this goes beyond just Scalars. |
Is the following pattern supported by the Standard?
?
I think people are expecting that it should, but there's currently really no guarantee that it does.
All the API currently says about
Column.std
is that the return type should be a ducktyped float scalar. Let's try this:What's the way of out of this?
cc @kkraus14 as you've said you wanted ducktyped scalars here
The text was updated successfully, but these errors were encountered: