Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust,python): add new str.find expression, returning the index of a regex pattern or literal substring #13561

Merged
merged 5 commits into from
Jan 10, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Jan 9, 2024

Implements expressified str.find string functionality for Expr and Series, returning the index/position of a given pattern (regex or literal) in the underlying string. Returns -1 (updated) None if the pattern is not found.

(Also closes #13552 by adding a note about requiring a capture group in the given regex in the extract and extract_groups docstrings).

Example

import polars as pl

df = pl.DataFrame(
    data=[
        ("Dubai", 3564931, "b[ai]"),
        ("Abu Dhabi", 1807000, "b[ai]"),
        ("Sharjah", 1405000, "[ai]n"),
        ("Al Ain", 846747, "[ai]+n"),
        ("Ajman", 490035, "[ai]n"),
        ("Ras Al Khaimah", 191753, "a.+a"),
        ("Fujairah", 118933, "a.+a"),
        ("Umm Al Quwain", 59098, "a.+a"),
    ],
    schema={"city": pl.String, "population": pl.Int32, "pattern": pl.String},
)

df.with_columns(
    rx_a = pl.col("city").str.find("(?i)a"),
    lit_a = pl.col("city").str.find("a", literal=True),
    lit_00 = pl.col("population").cast(pl.String).str.find("00", literal=True),
    find_col = pl.col("city").str.find(pl.col("pattern")),
)
# shape: (8, 7)
# ┌────────────────┬────────────┬─────────┬──────┬───────┬────────┬──────────┐
# │ city           ┆ population ┆ pattern ┆ rx_a ┆ lit_a ┆ lit_00 ┆ find_col │
# │ ---            ┆ ---        ┆ ---     ┆ ---  ┆ ---   ┆ ---    ┆ ---      │
# │ str            ┆ i32        ┆ str     ┆ u32  ┆ u32   ┆ u32    ┆ u32      │
# ╞════════════════╪════════════╪═════════╪══════╪═══════╪════════╪══════════╡
# │ Dubai          ┆ 3564931    ┆ b[ai]   ┆ 3    ┆ 3     ┆ null   ┆ 2        │
# │ Abu Dhabi      ┆ 1807000    ┆ b[ai]   ┆ 0    ┆ 6     ┆ 4      ┆ 7        │
# │ Sharjah        ┆ 1405000    ┆ [ai]n   ┆ 2    ┆ 2     ┆ 4      ┆ null     │
# │ Al Ain         ┆ 846747     ┆ [ai]n   ┆ 0    ┆ null  ┆ null   ┆ 4        │
# │ Ajman          ┆ 490035     ┆ [ai]n   ┆ 0    ┆ 3     ┆ 2      ┆ 3        │
# │ Ras Al Khaimah ┆ 191753     ┆ a.+a    ┆ 1    ┆ 1     ┆ null   ┆ 1        │
# │ Fujairah       ┆ 118933     ┆ a.+a    ┆ 3    ┆ 3     ┆ null   ┆ 3        │
# │ Umm Al Quwain  ┆ 59098      ┆ a.+a    ┆ 4    ┆ 10    ┆ null   ┆ null     │
# └────────────────┴────────────┴─────────┴──────┴───────┴────────┴──────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 9, 2024
@alexander-beedie alexander-beedie changed the title feat(rust,python): add support for str.find expression, returning the index of a regex pattern or literal substring feat(rust,python): add new str.find expression, returning the index of a regex pattern or literal substring Jan 9, 2024
@Julian-J-S
Copy link
Contributor

nice!!
But one thing I would definetely change is returning -1 when the pattern is not found.

Returning those sentinal values is very dangenous because it hides errors (if you are not 100% familiar with the function you might miss is because there is no Error/None)!

Rust (using Option/Result) and polars handle this generally very good!
I recommend returning null if the pattern is not found. This is also in line with other functions like Expr.list.get which return None/null if out of bounds.

Or is there anything I miss?

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 9, 2024

One thing I would definitely change is returning -1
Or is there anything I miss?

If you do that then you can't distinguish between not finding the pattern in a valid string and applying find to null values. Both cases would return null, but I think these should probably be different results as we are operating in frame context rather than pure Rust-only context (where returning Option/None would indeed be the expected behaviour) 🤔

For example, calling "find" against a string in Python also returns -1 if a match is not found:

"abc".find("x")
-1

SQL's instr and/or position methods also distinguish between the two cases; applied to NULL they return NULL, but if they can't find a valid match in a string they return zero (as SQL is 1-indexed - which I really dislike, but such is life ;)

@Julian-J-S
Copy link
Contributor

ah I see, thanks for explaining 😉

@orlp
Copy link
Collaborator

orlp commented Jan 9, 2024

I'm not completely convinced that returning -1 when the match is not found is superior to returning null. The distinguishing argument is kind of moot since you can just use .is_null() on the input if necessary. I also find it sketchy when it is combined with offsets, e.g. x.find("foo") + 10.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 9, 2024

I'm not completely convinced that returning -1 when the match is not found is superior to returning null. The distinguishing argument is kind of moot since you can just use .is_null() on the input if necessary. I also find it sketchy when it is combined with offsets, e.g. x.find("foo") + 10.

True enough; it's not a hill I'd die on - there's also an .index method in Python that raises an error if the substring doesn't exist, which is (hand-wavingly) analogous to returning null (though I'm not at all convinced that we should follow suit and have two almost identical methods ;). Would CSE eliminate the cost of running a potenitially expensive expression twice? (Once to check if it's null, then once more to apply the find if it isn't?)

@ritchie46, want to tie-break this one? There are pros/cons to both (I like distinguishing between "not found" and "applied to null" at a glance), but @orlp's point about combining with offsets is quite compelling 🤔

@orlp
Copy link
Collaborator

orlp commented Jan 9, 2024

@alexander-beedie Another argument in favour of null is if you try to compute any statistics on the found index, e.g. 'average text length before the first link' with pl.col(c).str.find("https?://").mean(). Including -1 there would give bad results.

@alexander-beedie
Copy link
Collaborator Author

@alexander-beedie Another argument in favour of null is if you try to compute any statistics on the found index, e.g. 'average text length before the first link' with pl.col(c).str.find("https?://").mean(). Including -1 there would give bad results.

I feel myself being persuaded 🤣

@orlp
Copy link
Collaborator

orlp commented Jan 9, 2024

Even more disastrous is the minimum, e.g. trying to find the webpage with the shortest header: pl.col("html").str.find("</head>").arg_min().

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Jan 9, 2024

The tipping point has been reached; I'm changing it 😆 (For bonus points the result type can be downsized from an i64 to a u32 now it doesn't have to handle a negative value).

…he index of a regex pattern or literal substring
@alexander-beedie
Copy link
Collaborator Author

Done; further streamlined a few things while I was at it...
Thanks for the suggestions guys 👌

@ritchie46 ritchie46 merged commit 94306a4 into pola-rs:main Jan 10, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Documentation for pattern parameter in str.extract
4 participants