str.contains - returns series of zeroes instead of series of bools when all values are NaNs. #9184

Sereger13 · 2015-01-02T14:22:34Z

import pandas as pd
df = pd.DataFrame({'a': ['x', 'y', 'z'], 'b': [np.nan, np.nan, np.nan]})

Applying to string column - produces correct result:

df['a'].str.contains('c', na=False)
0    False
1    False
2    False
Name: a, dtype: bool

Applying to float column - returns zeroes instead of bools and return type is float64:

df['b'].str.contains('c', na=False)
0    0
1    0
2    0
Name: b, dtype: float64

The text was updated successfully, but these errors were encountered:

Sereger13 · 2015-01-02T14:32:45Z

INSTALLED VERSIONS

commit: None
python: 2.7.5.final.0
python-bits: 32
OS: Linux
OS-release: 2.6.18-238.9.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.15.0
nose: 1.3.0
Cython: 0.20
numpy: 1.7.1
scipy: 0.13.0
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.1.3
patsy: 0.2.1
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.1.0
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.6.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: None
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.8.3
pymysql: None
psycopg2: None

jreback · 2015-01-02T17:03:25Z

hmm. The behavior of .str on non-object/string-like columns is actually suspect in general. We could raise (e.g. like what we are doing with an operation that does not apply to that type, like doing with .cat).

Thoughts?

@jorisvandenbossche @shoyer @cpcloud

shoyer · 2015-01-02T18:57:13Z

I agree, .str probably should raise on non-string columns instead of the operations returning vectors of NaN. But all NaN columns even with float dtype are somewhat ambiguous, so it would be reasonable to still interpret them as (plausibly) string-like.

cpcloud · 2015-01-04T17:42:41Z

IMO cat and str shouldn't even be attributes on a series unless it has the proper dtype. For example a float64 column should throw an attribute error on str access. This can be achieved by overriding getattr behavior.

I also don't think an all nan column of float64 is ambiguous. @shoyer can you elaborate on why you think that's ambiguous?

jreback · 2015-01-04T18:03:23Z

.cat does this now

but .str is still the original code

cpcloud · 2015-01-04T18:04:49Z

What I'm saying is that it shouldn't even show up in tab completion, just like if you did s.blarg

jreback · 2015-01-04T18:06:45Z

hmm it should be taken out of local_dir then

shoyer · 2015-01-04T19:27:42Z

@cpcloud the all nan column of float64 is somewhat ambiguous only because pandas presumes that all NA lists, for example, should be floats. For example, consider the original example here:

df = pd.DataFrame({'a': ['x', 'y', 'z'], 'b': [np.nan, np.nan, np.nan]})

We don't have any clues for the type of column b. To indicate that it's a string type, you would need to write something like np.array([np.nan, np.nan, np.nan], dtype=object).

That said... this is an edge case that probably isn't worth worrying about. Likely only expert users even realize that we use nan to mark missing values in string arrays.

dalejung · 2015-01-04T19:32:46Z

We should patch pandas to send anonymous usage statistics. Would help answer the frequency of edge cases :/

jreback · 2015-01-04T19:49:58Z

@shoyer presuming an all-na list is float-like has been long-stranding, and the most likely case. Agreed that the user would have to explicity specifiy another dtype.

Ok, so this issue is one of fixing the visibility of .str/.cat in the Series namespace when the dtype is not-appropriate. (which fixes the operating on a non-object series for strings as a by-product).

jorisvandenbossche · 2015-01-05T22:52:12Z

I fully agree that .str should better raise an error on a non-string series. And it should then at once raise an informative message that it is only availble for string types, and that you can use "astype(str)" to obtain that.

But I don't know if it is worth the effort to also have it not visible in the Series namespace. Still seeing it also on non-string series can make people more aware of its existence, and maybe remind them they can make it strings to use that function.

Sereger13 · 2015-01-06T09:24:38Z

Is it feasible to convert series to string type (i.e. apply "astype(str)") internally when calling .str? I always thought that is what was going on under the hood.

jreback added API Design Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data labels Jan 2, 2015

jreback added this to the 0.16.0 milestone Jan 2, 2015

jreback mentioned this issue Jan 22, 2015

ENH/DOC: reimplement Series delegates/accessors using descriptors #9322

Merged

shoyer closed this as completed in #9322 Jan 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str.contains - returns series of zeroes instead of series of bools when all values are NaNs. #9184

str.contains - returns series of zeroes instead of series of bools when all values are NaNs. #9184

Sereger13 commented Jan 2, 2015

Sereger13 commented Jan 2, 2015

jreback commented Jan 2, 2015

shoyer commented Jan 2, 2015

cpcloud commented Jan 4, 2015

jreback commented Jan 4, 2015

cpcloud commented Jan 4, 2015

jreback commented Jan 4, 2015

shoyer commented Jan 4, 2015

dalejung commented Jan 4, 2015

jreback commented Jan 4, 2015

jorisvandenbossche commented Jan 5, 2015

Sereger13 commented Jan 6, 2015

str.contains - returns series of zeroes instead of series of bools when all values are NaNs. #9184

str.contains - returns series of zeroes instead of series of bools when all values are NaNs. #9184

Comments

Sereger13 commented Jan 2, 2015

Sereger13 commented Jan 2, 2015

INSTALLED VERSIONS

jreback commented Jan 2, 2015

shoyer commented Jan 2, 2015

cpcloud commented Jan 4, 2015

jreback commented Jan 4, 2015

cpcloud commented Jan 4, 2015

jreback commented Jan 4, 2015

shoyer commented Jan 4, 2015

dalejung commented Jan 4, 2015

jreback commented Jan 4, 2015

jorisvandenbossche commented Jan 5, 2015

Sereger13 commented Jan 6, 2015