-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support regular expression patterns for string split #3584
Comments
@beckernick we created I wonder if a similarly behaved @davidwendt thoughts? |
I would certainly recommend implementing a |
I agree with both of you. However, I'm curious to get your thoughts on the following hypothetical: What if I wanted to split on my own regular expression based definition of "End of Sentence", but slightly more complex than the toy example above. It would be difficult to enumerate all the possible "states" explicitly, and if the dependency is more than 1 character long (i.e., more than just a punctuation mark) the state space actually explodes.
|
Reference #3584 This PR adds 4 new libcudf strings APIs for split. - `cudf::strings::split_re` - split using regex to locate delimiters with table output like `cudf::strings::split`. - `cudf::strings::rsplit_re` - same as `split_re` but delimiter search starts from the end of each string - `cudf::strings::split_record_re` - same as `split_re` but returns a list column like `split_record` does - `cudf::strings::rsplit_record_re` - same as `split_record_re` but delimiter search starts from the end of each string Like `split/rsplit` the results try to match Pandas behavior for these. The `record` results are similar to specifying `expand=False` in the Pandas `split/rsplit` APIs. Python/Cython updates for cuDF will be in a follow-on PR. Currently, Pandas does not support regex for its `rsplit` even though it has been documented and there is an issue [here](pandas-dev/pandas#29633). New gtests have been added for these along with some additional tests that were missing for the non-regex versions of these APIs. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - AJ Schmidt (https://github.com/ajschmidt8) - https://github.com/nvdbaranec - Andy Grove (https://github.com/andygrove) - Nghia Truong (https://github.com/ttnghia) URL: #10128
Closes #3584 This depends on libcudf changes in PR #10128 This adds the regex parameter to the cudf strings `split()` function similar to the 1.4.0 Pandas one documented [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html). The main difference is that the `pat` parameter will only be interpreted as regex if the `pat` string has more than 1 character and the `regex` parameter is set to `True`. This is to help with consistency and migration from the previous implementation. The 1.3.x Pandas version does not have a `regex` parameter for `split()` but instead will try to interpret the intention of the `pat` parameter without it. This seems a bit dangerous since regex would be much slower for us here. Therefore, the `regex` parameter is required to be set to `True` in the cudf implementation in order to use the regex logic path. Pandas does not support regex for its `rsplit` even though it has been documented and there is an issue here. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10185
I'd like to be able to pass a regular expression pattern for string splitting, like I could do in base Python or pandas. This is relevant for tasks like trying to do sentence tokenization with regular expressions, as sentences can end with more than one type of punctuation (for example).
The text was updated successfully, but these errors were encountered: