Series.str.rsplit not working with regex patterns (v 0.24.2 and 0.25.2) #29633

jamespreed · 2019-11-15T11:36:55Z

Code Sample, a copy-pastable example if possible

import pandas as pd

S = pd.Series(['1+1=2'])
S.str.rsplit(r'\+|=')
# returns:
0    [1+1=2]

S.str.rsplit(r'\+|=', expand=True)
# returns:
       0
0  1+1=2

Problem description

The str.rsplit method is not recognizing regex patterns. This example above is from the documentation.

Expected Output

S.str.rsplit(r'\+|=')
0    [1, 2, 3]

S.str.rsplit(r'\+|=', expand=True)
     0    1    2
0    1    1    2

Output of `pd.show_versions()`

Version 0.24.2:

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.24.2
pytest: 5.0.1
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.12
numpy: 1.16.4
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.6.1
sphinx: 2.1.2
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.2
numexpr: 2.6.9
feather: None
matplotlib: 3.1.0
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.4
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.5
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Version 0.25.2:

INSTALLED VERSIONS ------------------ commit : None python : 3.6.7.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 0.25.2
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

asishm · 2019-11-17T01:08:19Z

Had a look at it,

the rsplit method (https://github.com/pandas-dev/pandas/blob/master/pandas/core/strings.py#L2535-L2539) calls str_rsplit (

pandas/pandas/core/strings.py

Lines 1407 to 1413 in 94412ee

    
           def str_rsplit(arr, pat=None, n=None): 
        
               if n is None or n == 0: 
        
                   n = -1 
        
               f = lambda x: x.rsplit(pat, n) 
        
               res = _na_map(f, arr) 
        
               return res

) which just seems to be applying str.rsplit which of course doesn't use regex.

jamespreed · 2019-11-20T12:44:03Z

Yeah, I saw that as well. Need to either update the function or update the docs.

jamespreed · 2019-11-20T13:43:02Z

This is not the most efficient method, but it does accomplish the regex rsplit by regex splitting the string, grabbing the n last results, then finding the index in the string of the last character of the last non-included results and slicing the string to that location.

def str_rsplit(arr, pat=None, n=None):
    
    if pat is None or len(pat) == 1:
        if n is None or n == 0:
            n = -1
        f = lambda x: x.rsplit(pat, n)
    else:
        if n is None or n == -1:
            n = 0
        regex = re.compile(pat)
        def f(x):
            s = regex.split(x)
            a, b = s[:-n], s[-n:]
            if not a:
                return b
            ix = 0
            for a_ in a:
                ix = x.find(a_, ix) + len(a_)
            x_ = [x[:ix]]
            return x_ + b
    return f
    res = _na_map(f, arr)
    return res

Here is a test similar to that in the documentation:

for n in range(-1, 6):
    print(f'#{n:>2}:', str_rsplit(None,  r'\+|=', n=n)('1+1+1+1=4'))

#-1: ['1', '1', '1', '1', '4']
# 0: ['1', '1', '1', '1', '4']
# 1: ['1+1+1+1', '4']
# 2: ['1+1+1', '1', '4']
# 3: ['1+1', '1', '1', '4']
# 4: ['1', '1', '1', '1', '4']
# 5: ['1', '1', '1', '1', '4']

Reference #3584 This PR adds 4 new libcudf strings APIs for split. - `cudf::strings::split_re` - split using regex to locate delimiters with table output like `cudf::strings::split`. - `cudf::strings::rsplit_re` - same as `split_re` but delimiter search starts from the end of each string - `cudf::strings::split_record_re` - same as `split_re` but returns a list column like `split_record` does - `cudf::strings::rsplit_record_re` - same as `split_record_re` but delimiter search starts from the end of each string Like `split/rsplit` the results try to match Pandas behavior for these. The `record` results are similar to specifying `expand=False` in the Pandas `split/rsplit` APIs. Python/Cython updates for cuDF will be in a follow-on PR. Currently, Pandas does not support regex for its `rsplit` even though it has been documented and there is an issue [here](pandas-dev/pandas#29633). New gtests have been added for these along with some additional tests that were missing for the non-regex versions of these APIs. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - AJ Schmidt (https://github.com/ajschmidt8) - https://github.com/nvdbaranec - Andy Grove (https://github.com/andygrove) - Nghia Truong (https://github.com/ttnghia) URL: #10128

jbrockmendel added the Strings String extension data type and string data label Nov 30, 2019

mroeschke added the Bug label May 3, 2020

davidwendt mentioned this issue Jan 31, 2022

Add libcudf strings split API that accepts regex pattern rapidsai/cudf#10128

Merged

davidwendt mentioned this issue Feb 17, 2022

Add regex flags parameter to python cudf strings split rapidsai/cudf#10185

Merged

yeandy mentioned this issue Mar 3, 2022

[BEAM-13947] Add split() and rsplit(), non-deferred column operations on categorical columns apache/beam#16677

Merged

4 tasks

damccorm mentioned this issue Jun 4, 2022

Update DataFrame rsplit() api once pandas rsplit() supports regex apache/beam#20962

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series.str.rsplit not working with regex patterns (v 0.24.2 and 0.25.2) #29633

Series.str.rsplit not working with regex patterns (v 0.24.2 and 0.25.2) #29633

jamespreed commented Nov 15, 2019 •

edited

Loading

asishm commented Nov 17, 2019

jamespreed commented Nov 20, 2019

jamespreed commented Nov 20, 2019 •

edited

Loading

Series.str.rsplit not working with regex patterns (v 0.24.2 and 0.25.2) #29633

Series.str.rsplit not working with regex patterns (v 0.24.2 and 0.25.2) #29633

Comments

jamespreed commented Nov 15, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

asishm commented Nov 17, 2019

jamespreed commented Nov 20, 2019

jamespreed commented Nov 20, 2019 • edited Loading

jamespreed commented Nov 15, 2019 •

edited

Loading

Output of `pd.show_versions()`

jamespreed commented Nov 20, 2019 •

edited

Loading