Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: DataFrame.agg has no partial failure #40211

Closed
rhshadrach opened this issue Mar 4, 2021 · 2 comments · Fixed by #40238 or #40288
Closed

API: DataFrame.agg has no partial failure #40211

rhshadrach opened this issue Mar 4, 2021 · 2 comments · Fixed by #40238 or #40288
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@rhshadrach
Copy link
Member

Synthetic example:

def foo(x):
    if x.name == "a":
        raise ValueError
    return x

def bar(x):
    if x.name == "a":
        raise ValueError
    return x.sum()

df = pd.DataFrame(
    {'a': [1, 2, 3], 'b': [1, 2, 3]}
)
print(df.transform([foo]))
print(df.agg([bar]))

The transform call results in

    b
  foo
0   1
1   2
2   3

whereas the agg call fails outright due to the raise ValueError. Series/DataFrame apply just calls agg. I also tried but couldn't find an example similar to transform's behavior of partial failure using DataFrame.groupby with apply/agg/transform, but wasn't able to (the code paths are a bit complex here, so I've resorted to blackbox testing).

My thinking here is having transform fail outright in the example above is the good way to go. It would be simpler from a code perspective and avoids silent failure, although perhaps it would make a user drop nuisance columns. I'll also mention that one of the things I'd like to work toward is having e.g.

df.transform(['mean'])

have the same performance as

df.transform('mean')

for which having transform allow partial failure like this means there would need to be a fallback.

cc @jorisvandenbossche @jreback @jbrockmendel

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Apply Apply, Aggregate, Transform, Map API - Consistency Internal Consistency of API/Behavior labels Mar 4, 2021
@jbrockmendel
Copy link
Member

My thinking here is having transform fail outright in the example above is the good way to go. It would be simpler from a code perspective and avoids silent failure

Agreed.

DataFrame.groupby

This becomes a giant PITA bc we have a bunch of places where we pin the 'name' attribute to an object before passing it to a UDF. I'm not aware of any documentation for this. But it seems like the kind of thing that would mess with your examples.

@rhshadrach
Copy link
Member Author

Reopening because I missed the behavior difference when raising a TypeError instead of a ValueError (see below). We deprecated any partial-failure in #40238 for Series/DataFrame.transform, but we could have instead only deprecated when the exception raised is not a TypeError.

While I personally prefer not allowing any partial failure (i.e. as a user, I think it is my responsibility to select the columns that I want an operation to work on), I think we should partially revert #40238 so that only TypeErrors can give rise to partial failure in Series/DataFrame.transform (any other exception type would instead raise in a future version). This would then be consistent with most of the groupby behavior. If we want to take it a step further and also not allow TypeErrors, that can be done later.

Summary with TypeError (code below):

DataFrame.agg with callable: raised TypeError()
DataFrame.agg with list: partial failure allowed
DataFrame.transform with callable: raised AssertionError()
DataFrame.transform with list: partial failure allowed (but is deprecated in 1.3)
DataFrame.groupby(...).agg with callable: partial failure allowed
DataFrame.groupby(...).agg with list: partial failure allowed
DataFrame.groupby(...).transform with callable: partial failure allowed
DataFrame.groupby(...).transform with list: raised TypeError("unhashable type: 'list'")

Note: df.transform(func) where func is a callable will fallback to func(df) which is why the assertion error is raised.

Code:

def foo(x):
    if np.sum(np.sum(x)) < 10:
        raise TypeError
    return x

def bar(x):
    if np.sum(np.sum(x)) < 10:
        raise TypeError
    return x.sum()

df = pd.DataFrame(
    {'a': [1, 2, 3], 'b': [400, 500, 600]}
)

for use_gb in (False, True):
    for method in ('agg', 'transform'):
        for is_list in (False, True):
            obj = df
            if use_gb:
                obj = obj.groupby(level=0)
            func = foo if method == 'transform' else bar
            if is_list:
                func = [func]
            try:
                res = getattr(obj, method)(func)
                assert res.shape[1] == 1
                result = 'partial failure allowed'
            except Exception as err:
                result = f'raised {repr(err)}'
            print(
                f'DataFrame{".groupby(...)" if use_gb else ""}.{method} '
                f'{"with list" if is_list else "with callable"}: {result}'
            )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Apply Apply, Aggregate, Transform, Map Needs Discussion Requires discussion from core team before further action
Projects
None yet
3 participants