Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error for str.cat with listlike of wrong dtype. #26607

Merged
merged 9 commits into from
Jun 14, 2019

Conversation

h-vetinari
Copy link
Contributor

This had been blocked on #23167.

@codecov
Copy link

codecov bot commented Jun 1, 2019

Codecov Report

Merging #26607 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26607      +/-   ##
==========================================
- Coverage   91.85%   91.84%   -0.01%     
==========================================
  Files         174      174              
  Lines       50707    50709       +2     
==========================================
- Hits        46578    46576       -2     
- Misses       4129     4133       +4
Flag Coverage Δ
#multiple 90.39% <100%> (ø) ⬆️
#single 41.76% <0%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/strings.py 98.92% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dbb99e...81451bd. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 1, 2019

Codecov Report

Merging #26607 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26607      +/-   ##
==========================================
- Coverage   91.86%   91.85%   -0.01%     
==========================================
  Files         179      179              
  Lines       50700    50709       +9     
==========================================
+ Hits        46574    46579       +5     
- Misses       4126     4130       +4
Flag Coverage Δ
#multiple 90.44% <100%> (ø) ⬆️
#single 41.09% <21.42%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/core/strings.py 98.93% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a60888c...9752aa7. Read the comment docs.

@gfyoung gfyoung added Error Reporting Incorrect or improved errors from pandas Series Series data structure labels Jun 1, 2019
for x in others):
# data has already been checked by str-accessor
raise TypeError('Can only concatenate list-likes containing only '
'strings (or missing values)!')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No exclamation points

pandas/core/strings.py Outdated Show resolved Hide resolved
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h-vetinari my main concern is the complexity of the condition.

pandas/core/strings.py Outdated Show resolved Hide resolved
pandas/tests/test_strings.py Outdated Show resolved Hide resolved
pandas/tests/test_strings.py Outdated Show resolved Hide resolved
@simonjayhawkins simonjayhawkins added this to the 0.25.0 milestone Jun 1, 2019
@@ -420,6 +420,16 @@ def test_str_cat_categorical(self, box, dtype_caller, dtype_target, sep):
result = s.str.cat(t, sep=sep)
assert_series_or_index_equal(result, expected)

@pytest.mark.parametrize('box', [Series, Index, np.array, list])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these cover the different conditions for raising?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO yes. It would be possible to add more dtypes, but I don't consider this a great improvement.

Happy to add if you don't mind combinatorially more tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if i modify the condition, we have tests that fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean here. Adding different containers (as long as they are legal inputs to .str.cat) is possible, but I've already added the most import ones.

It would be possible add a further parametrization over the dtype and not just test that integer Series fail (as currently), but also floats, complex, datetime etc. This would work without failure.

You'd have to be more concrete with "modify the condition" for me to answer more clearly. This test serves as a relatively minimal example from #22722. If you want more dtypes, I'm happy to add them. Anything else you'll need to explain to me in more detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not raising the message on an invalid case would just result in the old message be raised. rasing on valid cases is a regression.

would it not be simpler to catch the message and reraise?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h-vetinari thanks for the explanation. i'm not convinced that for the sake of a better error message we should be adding this complexity, especially if the old message is still raised in some cases.

@jreback

Copy link
Contributor Author

@h-vetinari h-vetinari Jun 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not convinced that for the sake of a better error message we should be adding this complexity, especially if the old message is still raised in some cases.

Looking at #22722, I think this is a clear win - it covers 99% (guesstimate) of real-world cases where this error happens (putting a Series with a transparently incompatible dtype into .str.cat). If someone actually has mixed strings / integers in their Series, it just won't fail as gracefully. But that doesn't imply (to me) that we shouldn't aim to do better in the 99%.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: if your only worry is an inconsistent error message, I could have the 99%-fail-early-check and catch-and-reraise afterwards (with the same error message).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

best to wait for feedback from other maintainers before you make any changes on my account.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too late. ;-)

# no NaNs - can just concatenate
result = cat_core(all_cols, sep)
except TypeError as exc:
if re.match((r'can only concatenate str \(not [\w\"]+) to str'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we reach here, we already have a user-facing TypeError? and we are just replacing the message.

are any other TypeErrors expected, i'd be inclined to not check the message check.

if it is necessary, maybe have a _is_not_string_exception in pandas\core\dtypes\common.py similar to _is_unorderable_exception rather than have the condition here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are any other TypeErrors expected, i'd be inclined to not check the message check.

Fair enough, that's a not gonna happen.

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

not_masked = ~union_mask
result[not_masked] = cat_core([x[not_masked]
for x in all_cols], sep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you limit the try/except to the relevant code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am; cat_core is called in all three branches, and may raise in any one of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error

maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error
maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.

Moved try/except to cat_safe, which wraps around cat_core to do what you describe in the second sentence.

# but others could still have Series of dtypes (e.g. integers) which
# will necessarily fail in concatenation. To avoid deep and confusing
# traces, we raise here for anything that's not object or all-NA float.
def _legal_dtype(series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is way complicated, what exactly are you trying to do here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had an inline condition within any, but @simonjayhawkins found this too complex, so I broke out that condition into a function.

Basically, I want to fail early for any Series that will necessarily fail concatenation (based on dtype). Object must obviously be allowed, but also all-NA float (which can happen if two Series completely misalign), plus needs handling of categorical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had an inline condition within any, but @simonjayhawkins found this too complex, so I broke out that condition into a function.

the any is applied to a generator expesssion with a for loop and then raising. so i'm not sure the use of any here is a benefit. could you not just use a for loop and do away with the separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier

The dtype has only been inferred for data at this point, not for others. I want to avoid inferring for other, as that would lead to another (not insignificant) perf-hit, whereas reading out the dtypes is trivial.

@jreback jreback removed this from the 0.25.0 milestone Jun 2, 2019
Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responses

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
# but others could still have Series of dtypes (e.g. integers) which
# will necessarily fail in concatenation. To avoid deep and confusing
# traces, we raise here for anything that's not object or all-NA float.
def _legal_dtype(series):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had an inline condition within any, but @simonjayhawkins found this too complex, so I broke out that condition into a function.

Basically, I want to fail early for any Series that will necessarily fail concatenation (based on dtype). Object must obviously be allowed, but also all-NA float (which can happen if two Series completely misalign), plus needs handling of categorical.


not_masked = ~union_mask
result[not_masked] = cat_core([x[not_masked]
for x in all_cols], sep)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am; cat_core is called in all three branches, and may raise in any one of them.

# but others could still have Series of dtypes (e.g. integers) which
# will necessarily fail in concatenation. To avoid deep and confusing
# traces, we raise here for anything that's not object or all-NA float.
def _legal_dtype(series):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier


not_masked = ~union_mask
result[not_masked] = cat_core([x[not_masked]
for x in all_cols], sep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error

maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @simonjayhawkins

I moved the try/except + raise into a separate wrapper, as @jreback suggested. This allows to remove the extra check block I had introduced.

However, removing that extra check has the edge case that wrongly dtyped data that is completely misaligned will slip through (i.e. not raise):

>>> s = pd.Series(['a', 'b', 'c'])
>>> t = pd.Series([0, 1, 2], index=[10, 11, 12])
>>> s.str.cat(t, join='left')
0    NaN
1    NaN
2    NaN
dtype: object

It is a downside, but I would be fine with it, not least considering the much less complicated code now.

# but others could still have Series of dtypes (e.g. integers) which
# will necessarily fail in concatenation. To avoid deep and confusing
# traces, we raise here for anything that's not object or all-NA float.
def _legal_dtype(series):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: do you not already have the inferred types at this point? and if you don't, why not just infer them, then this condition becomes easier

The dtype has only been inferred for data at this point, not for others. I want to avoid inferring for other, as that would lead to another (not insignificant) perf-hit, whereas reading out the dtypes is trivial.


not_masked = ~union_mask
result[not_masked] = cat_core([x[not_masked]
for x in all_cols], sep)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback: so rather than having a giant try/except, either add this check in the cat_core, or write a function which calls cat_core and catches and formats a nicer error
maybe try to combine the _legal_dtype check with this, IOW. you can just try to cat them then catch an error, at which point you can then infer and give a nice message.

Moved try/except to cat_safe, which wraps around cat_core to do what you describe in the second sentence.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks much more reasonable. typing & doc-string comments, ping on green.

@@ -53,6 +53,27 @@ def cat_core(list_of_columns, sep):
return np.sum(list_with_sep, axis=0)


def cat_safe(list_of_columns, sep):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type the args & return value & add a Parameters / Returns section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your expectation for typing the args? Just List? It would strictly speaking be List[np.array], but AFAICT, mypy resp. the typing module doesn't yet support numpy stubs natively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that would be fine

Same signature as cat_core, but handles TypeErrors in concatenation, which
happen if the Series in list_of columns have the wrong dtypes or content.
"""
# if there are any non-string values (wrong dtype or hidden behind object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the comment to the except

@@ -607,7 +607,7 @@ Strings
^^^^^^^

- Bug in the ``__name__`` attribute of several methods of :class:`Series.str`, which were set incorrectly (:issue:`23551`)
-
- Improved error message when passing ``Series`` of wrong dtype to :meth:`Series.str.cat` (:issue:`22722`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :class:`Series`

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typing comments

@@ -53,6 +53,27 @@ def cat_core(list_of_columns, sep):
return np.sum(list_with_sep, axis=0)


def cat_safe(list_of_columns, sep):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your expectation for typing the args? Just List? It would strictly speaking be List[np.array], but AFAICT, mypy resp. the typing module doesn't yet support numpy stubs natively.

@simonjayhawkins
Copy link
Member

@h-vetinari if you merge master the checks ci failure should be fixed. I think i've seen the Windows py37_np141 failure before. probably a flaky test. likely pass next time around.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing request, otherwise lgtm. ping on green.

@@ -53,6 +53,27 @@ def cat_core(list_of_columns, sep):
return np.sum(list_with_sep, axis=0)


def cat_safe(list_of_columns, sep):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that would be fine

@jreback jreback added this to the 0.25.0 milestone Jun 13, 2019
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much more comfortable without the pre-check to fail early and simpler code.

Thanks @h-vetinari

raise TypeError(
'Concatenation requires list-likes containing only '
'strings (or missing values). Offending values found in '
'column {}'.format(dtype)) from None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth having a raise outside the for loop to ensure we don't slip through, or is that not going to happen?

@jreback jreback merged commit 5d0ff69 into pandas-dev:master Jun 14, 2019
@jreback
Copy link
Contributor

jreback commented Jun 14, 2019

thanks @h-vetinari

@h-vetinari h-vetinari deleted the str_cat_err branch June 14, 2019 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Series Series data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve TypeError message for str.cat
4 participants