Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error for str.cat with listlike of wrong dtype. #26607

Merged
merged 9 commits into from
Jun 14, 2019
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -575,7 +575,7 @@ Strings
^^^^^^^

- Bug in the ``__name__`` attribute of several methods of :class:`Series.str`, which were set incorrectly (:issue:`23551`)
-
- Improved error message when passing ``Series`` of wrong dtype to :meth:`Series.str.cat` (:issue:`22722`)
jreback marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :class:`Series`

-


Expand Down
56 changes: 39 additions & 17 deletions pandas/core/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -2280,6 +2280,23 @@ def cat(self, others=None, sep=None, na_rep=None, join=None):
'must all be of the same length as the '
'calling Series/Index.')

# data has already been checked by _validate to be of correct dtype,
# but others could still have Series of dtypes (e.g. integers) which
# will necessarily fail in concatenation. To avoid deep and confusing
# traces, we raise here for anything that's not object or all-NA float.
def _legal_dtype(series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is way complicated, what exactly are you trying to do here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had an inline condition within any, but @simonjayhawkins found this too complex, so I broke out that condition into a function.

Basically, I want to fail early for any Series that will necessarily fail concatenation (based on dtype). Object must obviously be allowed, but also all-NA float (which can happen if two Series completely misalign), plus needs handling of categorical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had an inline condition within any, but @simonjayhawkins found this too complex, so I broke out that condition into a function.

the any is applied to a generator expesssion with a for loop and then raising. so i'm not sure the use of any here is a benefit. could you not just use a for loop and do away with the separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier

The dtype has only been inferred for data at this point, not for others. I want to avoid inferring for other, as that would lead to another (not insignificant) perf-hit, whereas reading out the dtypes is trivial.

# unify dtype handling between categorical/non-categorical
dtype = (series.dtype if not is_categorical_dtype(series)
else series.cat.categories.dtype)
legal = dtype == 'O' or (dtype == 'float' and series.isna().all())
return legal
err_wrong_dtype = ('Can only concatenate list-likes containing only '
'strings (or missing values).')
if any(not _legal_dtype(x) for x in others):
raise TypeError(err_wrong_dtype + ' Received list-like of dtype: '
'{}'.format([x.dtype for x in others
if not _legal_dtype(x)][0]))

if join is None and warn:
warnings.warn("A future version of pandas will perform index "
"alignment when `others` is a Series/Index/"
Expand Down Expand Up @@ -2307,23 +2324,28 @@ def cat(self, others=None, sep=None, na_rep=None, join=None):
na_masks = np.array([isna(x) for x in all_cols])
union_mask = np.logical_or.reduce(na_masks, axis=0)

if na_rep is None and union_mask.any():
# no na_rep means NaNs for all rows where any column has a NaN
# only necessary if there are actually any NaNs
result = np.empty(len(data), dtype=object)
np.putmask(result, union_mask, np.nan)

not_masked = ~union_mask
result[not_masked] = cat_core([x[not_masked] for x in all_cols],
sep)
elif na_rep is not None and union_mask.any():
# fill NaNs with na_rep in case there are actually any NaNs
all_cols = [np.where(nm, na_rep, col)
for nm, col in zip(na_masks, all_cols)]
result = cat_core(all_cols, sep)
else:
# no NaNs - can just concatenate
result = cat_core(all_cols, sep)
# if there are any non-string, non-null values hidden within an object
# dtype, cat_core will fail; catch error and return with better message
try:
if na_rep is None and union_mask.any():
# no na_rep means NaNs for all rows where any column has a NaN
# only necessary if there are actually any NaNs
result = np.empty(len(data), dtype=object)
np.putmask(result, union_mask, np.nan)

not_masked = ~union_mask
result[not_masked] = cat_core([x[not_masked]
for x in all_cols], sep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you limit the try/except to the relevant code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am; cat_core is called in all three branches, and may raise in any one of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error

maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error
maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.

Moved try/except to cat_safe, which wraps around cat_core to do what you describe in the second sentence.

elif na_rep is not None and union_mask.any():
# fill NaNs with na_rep in case there are actually any NaNs
all_cols = [np.where(nm, na_rep, col)
for nm, col in zip(na_masks, all_cols)]
result = cat_core(all_cols, sep)
else:
# no NaNs - can just concatenate
result = cat_core(all_cols, sep)
except TypeError:
raise TypeError(err_wrong_dtype)

if isinstance(self._orig, Index):
# add dtype for case that result is all-NA
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/test_strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,22 @@ def test_str_cat_categorical(self, box, dtype_caller, dtype_target, sep):
result = s.str.cat(t, sep=sep)
assert_series_or_index_equal(result, expected)

# test integer/float dtypes (inferred by constructor) and mixed
@pytest.mark.parametrize('data', [[1, 2, 3], [.1, .2, .3], [1, 2, 'b']],
ids=['integers', 'floats', 'mixed'])
# without dtype=object, np.array would cast [1, 2, 'b'] to ['1', '2', 'b']
@pytest.mark.parametrize('box', [Series, Index, list,
lambda x: np.array(x, dtype=object)],
ids=['Series', 'Index', 'list', 'np.array'])
def test_str_cat_wrong_dtype_raises(self, box, data):
# GH 22722
s = Series(['a', 'b', 'c'])
t = box(data)

msg = 'Can only concatenate list-likes containing only strings.*'
simonjayhawkins marked this conversation as resolved.
Show resolved Hide resolved
with pytest.raises(TypeError, match=msg):
s.str.cat(t, join='left')

@pytest.mark.parametrize('box', [Series, Index])
def test_str_cat_mixed_inputs(self, box):
s = Index(['a', 'b', 'c', 'd'])
Expand Down