[FEA] Support regular expression patterns for string split #3584

beckernick · 2019-12-11T16:08:45Z

I'd like to be able to pass a regular expression pattern for string splitting, like I could do in base Python or pandas. This is relevant for tasks like trying to do sentence tokenization with regular expressions, as sentences can end with more than one type of punctuation (for example).

import pandas
import cudf
import re

print(s)
0    this is. an? example
dtype: object

s = cudf.Series(['this is. an? example'])
ps = s.to_pandas()

print(ps.str.split('[.?!] ', expand=True))
         0   1        2
0  this is  an  example

print(re.split('[.?!] ', ps.iloc[0]))
['this is', 'an', 'example']

# cudf
print(s.str.split('[.?!] '))
                      0
0  this is. an? example

randerzander · 2020-01-28T14:52:06Z

@beckernick we created nvstrings.replace_multi to support a performance way to make many string substitutions in a single pass (as opposed to passing a lengthy regex string with multiple patterns).

I wonder if a similarly behaved nvstrings,split_multi would suffice? I suspect it would be more performant than a regex based split.

@davidwendt thoughts?

davidwendt · 2020-01-28T15:29:47Z

I would certainly recommend implementing a split_multi over adding regex to split.

beckernick · 2020-01-28T16:42:46Z

I agree with both of you. However, I'm curious to get your thoughts on the following hypothetical:

What if I wanted to split on my own regular expression based definition of "End of Sentence", but slightly more complex than the toy example above. It would be difficult to enumerate all the possible "states" explicitly, and if the dependency is more than 1 character long (i.e., more than just a punctuation mark) the state space actually explodes.

replace_multi appears to support regex . If split_multi would also support multiple regex patterns, that would likely allow this behavior. But, at that point, I'm not sure of the additional value compared to standard split + regex in the above scenario.

Reference #3584 This PR adds 4 new libcudf strings APIs for split. - `cudf::strings::split_re` - split using regex to locate delimiters with table output like `cudf::strings::split`. - `cudf::strings::rsplit_re` - same as `split_re` but delimiter search starts from the end of each string - `cudf::strings::split_record_re` - same as `split_re` but returns a list column like `split_record` does - `cudf::strings::rsplit_record_re` - same as `split_record_re` but delimiter search starts from the end of each string Like `split/rsplit` the results try to match Pandas behavior for these. The `record` results are similar to specifying `expand=False` in the Pandas `split/rsplit` APIs. Python/Cython updates for cuDF will be in a follow-on PR. Currently, Pandas does not support regex for its `rsplit` even though it has been documented and there is an issue [here](pandas-dev/pandas#29633). New gtests have been added for these along with some additional tests that were missing for the non-regex versions of these APIs. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - AJ Schmidt (https://github.com/ajschmidt8) - https://github.com/nvdbaranec - Andy Grove (https://github.com/andygrove) - Nghia Truong (https://github.com/ttnghia) URL: #10128

Closes #3584 This depends on libcudf changes in PR #10128 This adds the regex parameter to the cudf strings `split()` function similar to the 1.4.0 Pandas one documented [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html). The main difference is that the `pat` parameter will only be interpreted as regex if the `pat` string has more than 1 character and the `regex` parameter is set to `True`. This is to help with consistency and migration from the previous implementation. The 1.3.x Pandas version does not have a `regex` parameter for `split()` but instead will try to interpret the intention of the `pat` parameter without it. This seems a bit dangerous since regex would be much slower for us here. Therefore, the `regex` parameter is required to be set to `True` in the cudf implementation in order to use the regex logic path. Pandas does not support regex for its `rsplit` even though it has been documented and there is an issue here. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10185

beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Dec 11, 2019

davidwendt mentioned this issue Dec 7, 2021

[FEA] Implement version of string::split that accepts a regular expression for the delimiter #9862

Closed

davidwendt self-assigned this Dec 7, 2021

andygrove mentioned this issue Dec 7, 2021

[FEA] Add regular expression support to GPU implementation of StringSplit NVIDIA/spark-rapids#4003

Closed

This was referenced Jan 26, 2022

Add libcudf strings split API that accepts regex pattern #10128

Merged

Add regex flags parameter to python cudf strings split #10185

Merged

rapids-bot bot closed this as completed in #10185 Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support regular expression patterns for string split #3584

[FEA] Support regular expression patterns for string split #3584

beckernick commented Dec 11, 2019

randerzander commented Jan 28, 2020

davidwendt commented Jan 28, 2020

beckernick commented Jan 28, 2020 •

edited

Loading

[FEA] Support regular expression patterns for string split #3584

[FEA] Support regular expression patterns for string split #3584

Comments

beckernick commented Dec 11, 2019

randerzander commented Jan 28, 2020

davidwendt commented Jan 28, 2020

beckernick commented Jan 28, 2020 • edited Loading

beckernick commented Jan 28, 2020 •

edited

Loading