Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex flags parameter to python cudf strings split #10185

Merged
merged 40 commits into from
Feb 19, 2022

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Feb 1, 2022

Closes #3584

This depends on libcudf changes in PR #10128

This adds the regex parameter to the cudf strings split() function similar to the 1.4.0 Pandas one documented here.

The main difference is that the pat parameter will only be interpreted as regex if the pat string has more than 1 character and the regex parameter is set to True. This is to help with consistency and migration from the previous implementation.

The 1.3.x Pandas version does not have a regex parameter for split() but instead will try to interpret the intention of the pat parameter without it. This seems a bit dangerous since regex would be much slower for us here. Therefore, the regex parameter is required to be set to True in the cudf implementation in order to use the regex logic path.

Pandas does not support regex for its rsplit even though it has been documented and there is an issue here.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress Python Affects Python cuDF API. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Feb 1, 2022
@davidwendt davidwendt self-assigned this Feb 1, 2022
@github-actions github-actions bot added CMake CMake build issue conda libcudf Affects libcudf (C++/CUDA) code. labels Feb 1, 2022
@codecov
Copy link

codecov bot commented Feb 1, 2022

Codecov Report

Merging #10185 (65d906d) into branch-22.04 (a7d88cd) will increase coverage by 0.20%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10185      +/-   ##
================================================
+ Coverage         10.42%   10.63%   +0.20%     
================================================
  Files               119      122       +3     
  Lines             20603    20954     +351     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18726     +271     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/strings/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/_version.py 0.00% <ø> (ø)
python/cudf/cudf/comm/gpuarrow.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3342a8...65d906d. Read the comment docs.

@github-actions github-actions bot removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 11, 2022
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 15, 2022
@davidwendt davidwendt marked this pull request as ready for review February 15, 2022 14:32
@davidwendt davidwendt requested a review from a team as a code owner February 15, 2022 14:32
@davidwendt
Copy link
Contributor Author

@shwina I added support for regex in rsplit in this PR though it is not supported in Pandas (even in 1.4.0 as far as I can tell). I made it consistent with split which accepts a regex=True parameter (default to None). Let me know what you think. We can remove the rsplit regex support if you think that it would be confusing to have it here.

@davidwendt
Copy link
Contributor Author

I added support for regex in rsplit in this PR though it is not supported in Pandas (even in 1.4.0 as far as I can tell). I made it consistent with split which accepts a regex=True parameter (default to None). Let me know what you think. We can remove the rsplit regex support if you think that it would be confusing to have it here.

@shwina Just wanted to check with you on this before merging.

@vyasr
Copy link
Contributor

vyasr commented Feb 19, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8a50a22 into rapidsai:branch-22.04 Feb 19, 2022
@davidwendt davidwendt deleted the python-split-with-regex branch May 5, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support regular expression patterns for string split
4 participants