Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.replace_many to take a dictionary that defines a replacement mapping. #17220

Closed
avhz opened this issue Jun 26, 2024 · 8 comments · Fixed by #18214
Closed

str.replace_many to take a dictionary that defines a replacement mapping. #17220

avhz opened this issue Jun 26, 2024 · 8 comments · Fixed by #18214
Labels
A-dtype-string Area: string data type A-input-parsing Area: parsing input arguments accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@avhz
Copy link

avhz commented Jun 26, 2024

Description

Currently the str.replace_many method takes two lists for the original and replacement strings.

df = polars.DataFrame({
    "A": ["a", "b", "c", "d", "e"],
})

df.with_columns(
    polars.col("A").str.replace_many(
        patterns=["a", "b", "c"],
        replace_with=["x", "y", "z"],
    )
)
┌─────┐
│ A   │
│ --- │
│ str │
╞═════╡
│ a   │
│ b   │
│ c   │
│ d   │
│ e   │
└─────┘

┌─────┐
│ A   │
│ --- │
│ str │
╞═════╡
│ x   │
│ y   │
│ z   │
│ d   │
│ e   │
└─────┘

It would be handy to include the ability to just pass a dictionary which defines the replacement mapping:

map = {
    "a": "x",
    "b": "y",
    "c": "z",
}

df.with_columns(
    polars.col("A").str.replace_many(map)
)
@avhz avhz added the enhancement New feature or an improvement of an existing feature label Jun 26, 2024
@Arengard
Copy link

just use

map = {
"a": "x",
"b": "y",
"c": "z",
}

df.with_columns(
polars.col("A").replace(map)
)

@stinodego stinodego added the A-input-parsing Area: parsing input arguments label Jun 26, 2024
@stinodego
Copy link
Member

Good suggestion - we can do the same 'trick' as we do in replace.

@stinodego stinodego added accepted Ready for implementation A-dtype-string Area: string data type labels Jun 26, 2024
@avhz
Copy link
Author

avhz commented Jun 27, 2024

Is there a way that we can make this work with regex ?

I have tried something like:

import regex
import polars 

df = polars.DataFrame({
    "x": ["a", "b", "c", "1", "2", "3"],
})

map = {
    regex.compile(r"[a-z]"): "alpha",
    regex.compile(r"[0-9]"): "digit",
}

df.with_columns(
    polars.col("x").replace(map)
)

For my personal use case, I need to replace a large number of regex patterns, and it's not very ergonomic to use two lists because it can be hard to keep track of what is replacing what.

Another possibility is a list of tuples:

map = [
    (r"[a-z]", "alpha"),
    (r"[0-9]", "digit"),
]

This (in my opinion) is nicer to follow than something like:

old = [r"[a-z]", r"[0-9]"]
new = ["alpha", "digit"]

@stinodego
Copy link
Member

A dict is just going to be syntactic sugar. You can just define your map and then call str.replace_many(map.keys(), map.values()).

@avhz
Copy link
Author

avhz commented Jun 27, 2024

That gives me:

TypeError: cannot create expression literal for value of type dict_keys: 
...

Hint: Pass `allow_object=True` to accept any value and create a literal of type Object.

@stinodego
Copy link
Member

Call list on each input then. I'm just saying: this is just some syntactic sugar that you can do yourself. You don't need us to take care of it. Though it would be nice if we did.

@avhz
Copy link
Author

avhz commented Jun 27, 2024

I must be missing something, as none of the following work for me:

## ============================================================================

import regex
import polars

## ============================================================================

df = polars.DataFrame(
    {
        "x": ["a", "b", "c", "1", "2", "3"],
    }
)

## ============================================================================

map = {
    regex.compile(r"[a-z]"): "alpha",
    regex.compile(r"[0-9]"): "digit",
}

df.with_columns(polars.col("x").replace(map))
df.with_columns(polars.col("x").replace(map.keys(), map.values()))
df.with_columns(polars.col("x").replace(list(map.keys()), list(map.values())))

## ============================================================================

map = {
    r"[a-z]": "alpha",
    r"[0-9]": "digit",
}

df.with_columns(polars.col("x").str.replace_many(map))
df.with_columns(polars.col("x").str.replace_many(map.keys(), map.values()))
df.with_columns(polars.col("x").str.replace_many(list(map.keys()), list(map.values())))

## ============================================================================

All throw an exception except the third (when calling list()), which does not throw an exception, but also does not match the regex pattern, so I am left with the original dataframe.

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Jun 27, 2024

There are a few different issues:

  1. You're passing regex.compile() objects - which Polars does not understand.

Polars uses the Rust crate https://github.com/rust-lang/regex - so you must pass "strings" when using the regex functions.

  1. .str.replace_many does not work with regular expressions. (perhaps a note could be added to the docs?)

It uses https://github.com/BurntSushi/aho-corasick which works with "literal strings" only.

It sounds like you may really be asking for:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-string Area: string data type A-input-parsing Area: parsing input arguments accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants