Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dict/Hashmap lookup expression #3789

Closed
sm-Fifteen opened this issue Jun 23, 2022 · 8 comments
Closed

Dict/Hashmap lookup expression #3789

sm-Fifteen opened this issue Jun 23, 2022 · 8 comments

Comments

@sm-Fifteen
Copy link

Describe your feature request

Let's say I have a dataset like this:

import polars as pl

grades = pl.DataFrame(
    {
        "student": ["bas", "laura", "tim", "jenny", "bas", "laura", "tim", "jenny"],
        "class": ["MAT-150", "MAT-150", "MAT-210", "MAT-600", "COM-200", "COM-205", "COM-430", "COM-200"],
        "test_score": [10, 5, 6, 8, 7, 6, 10, 5],
        "test_max": [10, 10, 12, 10, 12, 10, 15, 12],
    }
)

And that I want to map class with their respective subject matter, so I can compare grades per subject instead of per class:

class_subject = {
    "MAT-150": "Mathematics",
    "MAT-210": "Mathematics",
    "MAT-430": "Mathematics",

    "COM-200": "Programming",
    "COM-205": "Programming",
    "COM-600": "Programming",
}

With Pandas, I can use Series.map to create a series that maps the contents of the initial column with the key of a Python dictionary and contains the value.

Using Polars, that's doable, but a fair amount more involved, because I need to cast both columns as Categorical and perform a join within the same context manager:

with pl.StringCache():
    class_subject_df = pl.from_records(list(class_subject.items()), columns=['class_code', 'class_subject'], orient='row')
    class_subject_df = class_subject_df.with_column(pl.col('class_code').cast(pl.Categorical))

    grades = grades.with_column(pl.col("class").cast(pl.Categorical))
    grades = grades.join(class_subject_df, left_on='class', right_on='class_code')

A new expression method, maybe something like Expr.lookup(map: dict[str | int, ...]) would make this sort of operation doable in a single step. An extra argument, like lookup(map, on_missing: Literal['omit','null','error']) could also be useful to specify the behavior when the hashmap does not contain anything. Pandas instead relies on the use of DefaultDict and the user running a second pass to filter out the NaNs that were inserted for missing entries.

If this is restricted to dicts and not lambda functions, it should be possible to copy the dict into a Rust HashMap and perform the operation without needing Python-owned resources.

@ritchie46
Copy link
Member

We have this functionality. This is a join. I don't see much benefit of adding more code we must maintain, binary bloat etc, for something that maps column a to column b.

@sm-Fifteen
Copy link
Author

sm-Fifteen commented Jun 23, 2022

We have this functionality. This is a join. I don't see much benefit of adding more code we must maintain, binary bloat etc, for something that maps column a to column b.

Join is unwieldy for this operation, since can't be expressed in-line on a select/with_column.

It's possible to perform this as an expression, but since Expr.map() materializes all the selected columns and Series.apply() invokes a user-defined python function, this is liable to poor performance on larger datasets:

grades.with_column(pl.col("class").map(lambda series: series.apply(lambda x: class_subject.get(x))))

EDIT: It turns out that you can use Expr.apply() directly, the doc just makes it seem like apply() should be reserved for groupby contexts.

csv_data = grades.with_column(pl.col("class").apply(class_subject.get))

@ritchie46
Copy link
Member

EDIT: It turns out that you can use Expr.apply() directly, the doc just makes it seem like apply() should be reserved for groupby contexts.

Yeap, apply is elementwise (in the select context).

@Arengard
Copy link

Yeap, apply is elementwise (in the select context). What???

@cbilot
Copy link

cbilot commented Jun 28, 2022

This section from the User Guide might help:

apply works on the smallest logical elements for that operation.
That is:
select context -> single elements
groupby context -> single groups

@ghuls
Copy link
Collaborator

ghuls commented Feb 10, 2023

@sm-Fifteen
This is now supported:

In [6]: grades.with_columns(pl.col("class").map_dict(class_subject, default="No Known Class").alias("class_code"))
Out[6]: 
shape: (8, 5)
┌─────────┬─────────┬────────────┬──────────┬────────────────┐
│ studentclasstest_scoretest_maxclass_code     │
│ ---------------            │
│ strstri64i64str            │
╞═════════╪═════════╪════════════╪══════════╪════════════════╡
│ basMAT-1501010Mathematics    │
│ lauraMAT-150510Mathematics    │
│ timMAT-210612Mathematics    │
│ jennyMAT-600810No Known Class │
│ basCOM-200712Programming    │
│ lauraCOM-205610Programming    │
│ timCOM-4301015No Known Class │
│ jennyCOM-200512Programming    │
└─────────┴─────────┴────────────┴──────────┴────────────────┘

Closed by #5899.

@ghuls ghuls closed this as completed Feb 10, 2023
@sm-Fifteen
Copy link
Author

sm-Fifteen commented Feb 10, 2023

@ghuls : Oh, wow, thanks, that's great! I'd actually given up on this, but for my use cases, it's actually a huge improvements in ergonomics and readability.

@sezanzeb
Copy link

sezanzeb commented Dec 7, 2023

Since I found this feature via this thread, I'd like to mention that from 0.19.16 on this method is called "replace"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants