Disable .str-accessor for byte data #23011

h-vetinari · 2018-10-05T21:29:03Z

This supersedes #22721.

Pandas is trying to straddle many different chasms, which leads to undesirable behaviour on the fringes. For the purpose of this issue, I'm talking mainly about

supporting python 2/3 (will be over soon...)
being largely based on numpy's type system

From the first point, we have the inconsistent handling of str vs. bytes, so having the Series-concatenator work with bytes is a necessity in Python 2.

Mostly due to the second point, there's no proper string dtype, it's just hiding in the object dtype. I started #22721 as a side issue which came up while refactoring in #22725. Then I got told that:

We do NOT handle bytes in .str if you want to add tests and raise, pls do so, but not going to 'make it work better'. It is amazingly confusing and causes all sorts of errors. We probably don't have explicit checks on this (though I thought that we always infer on the strings that must be string/unicode and never bytes).

However, it works already -- the Series.str-accessor already checks that it can only be called on an object column, but there's not much more it can do (not least because inspecting every element of a Series would be very performance-intense). Consequently, .str.cat currently does work on bytes data, and easily at that:

>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series(np.array(list('abc'), 'S1').astype(object))
>>> t = pd.Series(np.array(list('def'), 'S1').astype(object))
>>> s.str.cat(t, sep=b'')
0    b'ad'
1    b'be'
2    b'cf'
dtype: object
>>> s.str.cat(t, sep=b',')
0    b'a,d'
1    b'b,e'
2    b'c,f'
dtype: object

Long story short - this issue supersedes #22721, and should serve as a long term goal to disable .str once Python 2 gets dropped and/or there is a string dtype.

The text was updated successfully, but these errors were encountered:

jreback · 2018-10-06T15:05:10Z

we already have the capability to infer what object types are, and use this in many places in the codebase, so this is pretty easy to do. all byte operations should be disabled explicity.

In [12]: s = pd.Series(np.array(list('abc'), 'S1').astype(object))
    ...: t = pd.Series(list('def'))
    ...: 

In [13]: from pandas.api.types import infer_dtype

In [14]: infer_dtype(s)
Out[14]: 'bytes'

In [15]: infer_dtype(t)
Out[15]: 'string'

h-vetinari · 2018-10-06T15:25:51Z

@jreback

Then it should be a pretty simple fix in the __init__ of the .str-accessor to check that infer_dtype != 'bytes'... And for the list of Series in .str.cat after _get_series_list(other).

h-vetinari mentioned this issue Oct 5, 2018

str.cat not working with binary data on Python3 #22721

Closed

This was referenced Oct 15, 2018

.str._validate should infer for Series, not raise for all-na Index #23163

Closed

API: Series.str-accessor infers dtype (and Index.str does not raise on all-NA) #23167

Merged

jreback added API Design Error Reporting Incorrect or improved errors from pandas Strings String extension data type and string data labels Oct 17, 2018

jreback added this to the Contributions Welcome milestone Oct 17, 2018

h-vetinari mentioned this issue Jan 28, 2019

RLS: 0.25.0 #24950

Closed

jreback modified the milestones: Contributions Welcome, 0.25.0 May 19, 2019

jreback closed this as completed in #23167 Jun 1, 2019

h-vetinari mentioned this issue Jul 26, 2019

DEPR: execute deprecations for str.cat in v1.0 #27611

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable .str-accessor for byte data #23011

Disable .str-accessor for byte data #23011

h-vetinari commented Oct 5, 2018

jreback commented Oct 6, 2018

h-vetinari commented Oct 6, 2018 •

edited

Loading

Disable .str-accessor for byte data #23011

Disable .str-accessor for byte data #23011

Comments

h-vetinari commented Oct 5, 2018

jreback commented Oct 6, 2018

h-vetinari commented Oct 6, 2018 • edited Loading

h-vetinari commented Oct 6, 2018 •

edited

Loading