Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support regular expression patterns for string split #3584

Closed
beckernick opened this issue Dec 11, 2019 · 3 comments · Fixed by #10185
Closed

[FEA] Support regular expression patterns for string split #3584

beckernick opened this issue Dec 11, 2019 · 3 comments · Fixed by #10185
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@beckernick
Copy link
Member

I'd like to be able to pass a regular expression pattern for string splitting, like I could do in base Python or pandas. This is relevant for tasks like trying to do sentence tokenization with regular expressions, as sentences can end with more than one type of punctuation (for example).

import pandas
import cudf
import reprint(s)
0    this is. an? example
dtype: object

s = cudf.Series(['this is. an? example'])
ps = s.to_pandas()
​
print(ps.str.split('[.?!] ', expand=True))
         0   1        2
0  this is  an  example

print(re.split('[.?!] ', ps.iloc[0]))
['this is', 'an', 'example']

# cudf
print(s.str.split('[.?!] '))
                      0
0  this is. an? example
@beckernick beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Dec 11, 2019
@randerzander
Copy link
Contributor

@beckernick we created nvstrings.replace_multi to support a performance way to make many string substitutions in a single pass (as opposed to passing a lengthy regex string with multiple patterns).

I wonder if a similarly behaved nvstrings,split_multi would suffice? I suspect it would be more performant than a regex based split.

@davidwendt thoughts?

@davidwendt
Copy link
Contributor

I would certainly recommend implementing a split_multi over adding regex to split.

@beckernick
Copy link
Member Author

beckernick commented Jan 28, 2020

I agree with both of you. However, I'm curious to get your thoughts on the following hypothetical:

What if I wanted to split on my own regular expression based definition of "End of Sentence", but slightly more complex than the toy example above. It would be difficult to enumerate all the possible "states" explicitly, and if the dependency is more than 1 character long (i.e., more than just a punctuation mark) the state space actually explodes.

replace_multi appears to support regex . If split_multi would also support multiple regex patterns, that would likely allow this behavior. But, at that point, I'm not sure of the additional value compared to standard split + regex in the above scenario.

@davidwendt davidwendt self-assigned this Dec 7, 2021
rapids-bot bot pushed a commit that referenced this issue Feb 11, 2022
Reference #3584

This PR adds 4 new libcudf strings APIs for split.
- `cudf::strings::split_re` - split using regex to locate delimiters with table output like `cudf::strings::split`.
- `cudf::strings::rsplit_re` - same as `split_re` but delimiter search starts from the end of each string
- `cudf::strings::split_record_re` - same as `split_re` but returns a list column like `split_record` does
- `cudf::strings::rsplit_record_re` - same as `split_record_re` but delimiter search starts from the end of each string

Like `split/rsplit` the results try to match Pandas behavior for these. The `record` results are similar to specifying `expand=False` in the Pandas `split/rsplit` APIs. Python/Cython updates for cuDF will be in a follow-on PR.
Currently, Pandas does not support regex for its `rsplit` even though it has been documented and there is an issue [here](pandas-dev/pandas#29633).

New gtests have been added for these along with some additional tests that were missing for the non-regex versions of these APIs.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - https://github.com/nvdbaranec
  - Andy Grove (https://github.com/andygrove)
  - Nghia Truong (https://github.com/ttnghia)

URL: #10128
rapids-bot bot pushed a commit that referenced this issue Feb 19, 2022
Closes #3584 

This depends on libcudf changes in PR #10128 

This adds the regex parameter to the cudf strings `split()` function similar to the 1.4.0 Pandas one documented [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html). 

The main difference is that the `pat` parameter will only be interpreted as regex if the `pat` string has more than 1 character and the `regex` parameter is set to `True`. This is to help with consistency and migration from the previous implementation.

The 1.3.x Pandas version does not have a `regex` parameter for `split()` but instead will try to interpret the intention of the `pat` parameter without it. This seems a bit dangerous since regex would be much slower for us here. Therefore, the `regex` parameter is required to be set to `True` in the cudf implementation in order to use the regex logic path.

Pandas does not support regex for its `rsplit` even though it has been documented and there is an issue here.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10185
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants