Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.contains - returns series of zeroes instead of series of bools when all values are NaNs. #9184

Closed
Sereger13 opened this issue Jan 2, 2015 · 12 comments · Fixed by #9322
Closed
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data
Milestone

Comments

@Sereger13
Copy link
Contributor

import pandas as pd
df = pd.DataFrame({'a': ['x', 'y', 'z'], 'b': [np.nan, np.nan, np.nan]})

Applying to string column - produces correct result:

df['a'].str.contains('c', na=False)
0    False
1    False
2    False
Name: a, dtype: bool

Applying to float column - returns zeroes instead of bools and return type is float64:

df['b'].str.contains('c', na=False)
0    0
1    0
2    0
Name: b, dtype: float64
@Sereger13
Copy link
Contributor Author

INSTALLED VERSIONS

commit: None
python: 2.7.5.final.0
python-bits: 32
OS: Linux
OS-release: 2.6.18-238.9.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.15.0
nose: 1.3.0
Cython: 0.20
numpy: 1.7.1
scipy: 0.13.0
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.1.3
patsy: 0.2.1
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.1.0
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.6.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: None
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.8.3
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented Jan 2, 2015

hmm. The behavior of .str on non-object/string-like columns is actually suspect in general. We could raise (e.g. like what we are doing with an operation that does not apply to that type, like doing with .cat).

Thoughts?

@jorisvandenbossche @shoyer @cpcloud

@jreback jreback added API Design Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data labels Jan 2, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 2, 2015
@shoyer
Copy link
Member

shoyer commented Jan 2, 2015

I agree, .str probably should raise on non-string columns instead of the operations returning vectors of NaN. But all NaN columns even with float dtype are somewhat ambiguous, so it would be reasonable to still interpret them as (plausibly) string-like.

@cpcloud
Copy link
Member

cpcloud commented Jan 4, 2015

IMO cat and str shouldn't even be attributes on a series unless it has the proper dtype. For example a float64 column should throw an attribute error on str access. This can be achieved by overriding getattr behavior.

I also don't think an all nan column of float64 is ambiguous. @shoyer can you elaborate on why you think that's ambiguous?

@jreback
Copy link
Contributor

jreback commented Jan 4, 2015

.cat does this now

but .str is still the original code

@cpcloud
Copy link
Member

cpcloud commented Jan 4, 2015

What I'm saying is that it shouldn't even show up in tab completion, just like if you did s.blarg

@jreback
Copy link
Contributor

jreback commented Jan 4, 2015

hmm it should be taken out of local_dir then

@shoyer
Copy link
Member

shoyer commented Jan 4, 2015

@cpcloud the all nan column of float64 is somewhat ambiguous only because pandas presumes that all NA lists, for example, should be floats. For example, consider the original example here:

df = pd.DataFrame({'a': ['x', 'y', 'z'], 'b': [np.nan, np.nan, np.nan]})

We don't have any clues for the type of column b. To indicate that it's a string type, you would need to write something like np.array([np.nan, np.nan, np.nan], dtype=object).

That said... this is an edge case that probably isn't worth worrying about. Likely only expert users even realize that we use nan to mark missing values in string arrays.

@dalejung
Copy link
Contributor

dalejung commented Jan 4, 2015

We should patch pandas to send anonymous usage statistics. Would help answer the frequency of edge cases :/

@jreback
Copy link
Contributor

jreback commented Jan 4, 2015

@shoyer presuming an all-na list is float-like has been long-stranding, and the most likely case. Agreed that the user would have to explicity specifiy another dtype.

Ok, so this issue is one of fixing the visibility of .str/.cat in the Series namespace when the dtype is not-appropriate. (which fixes the operating on a non-object series for strings as a by-product).

@jorisvandenbossche
Copy link
Member

I fully agree that .str should better raise an error on a non-string series. And it should then at once raise an informative message that it is only availble for string types, and that you can use "astype(str)" to obtain that.

But I don't know if it is worth the effort to also have it not visible in the Series namespace. Still seeing it also on non-string series can make people more aware of its existence, and maybe remind them they can make it strings to use that function.

@Sereger13
Copy link
Contributor Author

Is it feasible to convert series to string type (i.e. apply "astype(str)") internally when calling .str? I always thought that is what was going on under the hood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants