Dict/Hashmap lookup expression #3789

sm-Fifteen · 2022-06-23T16:39:02Z

Describe your feature request

Let's say I have a dataset like this:

import polars as pl

grades = pl.DataFrame(
    {
        "student": ["bas", "laura", "tim", "jenny", "bas", "laura", "tim", "jenny"],
        "class": ["MAT-150", "MAT-150", "MAT-210", "MAT-600", "COM-200", "COM-205", "COM-430", "COM-200"],
        "test_score": [10, 5, 6, 8, 7, 6, 10, 5],
        "test_max": [10, 10, 12, 10, 12, 10, 15, 12],
    }
)

And that I want to map class with their respective subject matter, so I can compare grades per subject instead of per class:

class_subject = {
    "MAT-150": "Mathematics",
    "MAT-210": "Mathematics",
    "MAT-430": "Mathematics",

    "COM-200": "Programming",
    "COM-205": "Programming",
    "COM-600": "Programming",
}

With Pandas, I can use Series.map to create a series that maps the contents of the initial column with the key of a Python dictionary and contains the value.

Using Polars, that's doable, but a fair amount more involved, because I need to cast both columns as Categorical and perform a join within the same context manager:

with pl.StringCache():
    class_subject_df = pl.from_records(list(class_subject.items()), columns=['class_code', 'class_subject'], orient='row')
    class_subject_df = class_subject_df.with_column(pl.col('class_code').cast(pl.Categorical))

    grades = grades.with_column(pl.col("class").cast(pl.Categorical))
    grades = grades.join(class_subject_df, left_on='class', right_on='class_code')

A new expression method, maybe something like Expr.lookup(map: dict[str | int, ...]) would make this sort of operation doable in a single step. An extra argument, like lookup(map, on_missing: Literal['omit','null','error']) could also be useful to specify the behavior when the hashmap does not contain anything. Pandas instead relies on the use of DefaultDict and the user running a second pass to filter out the NaNs that were inserted for missing entries.

If this is restricted to dicts and not lambda functions, it should be possible to copy the dict into a Rust HashMap and perform the operation without needing Python-owned resources.

The text was updated successfully, but these errors were encountered:

ritchie46 · 2022-06-23T18:21:17Z

We have this functionality. This is a join. I don't see much benefit of adding more code we must maintain, binary bloat etc, for something that maps column a to column b.

sm-Fifteen · 2022-06-23T19:40:32Z

We have this functionality. This is a join. I don't see much benefit of adding more code we must maintain, binary bloat etc, for something that maps column a to column b.

Join is unwieldy for this operation, since can't be expressed in-line on a select/with_column.

It's possible to perform this as an expression, but since ~~Expr.map() materializes all the selected columns and~~ Series.apply() invokes a user-defined python function, this is liable to poor performance on larger datasets:

grades.with_column(pl.col("class").map(lambda series: series.apply(lambda x: class_subject.get(x))))

EDIT: It turns out that you can use Expr.apply() directly, the doc just makes it seem like apply() should be reserved for groupby contexts.

csv_data = grades.with_column(pl.col("class").apply(class_subject.get))

ritchie46 · 2022-06-23T20:12:03Z

EDIT: It turns out that you can use Expr.apply() directly, the doc just makes it seem like apply() should be reserved for groupby contexts.

Yeap, apply is elementwise (in the select context).

Arengard · 2022-06-28T13:17:47Z

Yeap, apply is elementwise (in the select context). What???

cbilot · 2022-06-28T13:52:10Z

This section from the User Guide might help:

apply works on the smallest logical elements for that operation.
That is:
select context -> single elements
groupby context -> single groups

ghuls · 2023-02-10T14:43:24Z

@sm-Fifteen
This is now supported:

In [6]: grades.with_columns(pl.col("class").map_dict(class_subject, default="No Known Class").alias("class_code"))
Out[6]: 
shape: (8, 5)
┌─────────┬─────────┬────────────┬──────────┬────────────────┐
│ student ┆ class   ┆ test_score ┆ test_max ┆ class_code     │
│ ---     ┆ ---     ┆ ---        ┆ ---      ┆ ---            │
│ str     ┆ str     ┆ i64        ┆ i64      ┆ str            │
╞═════════╪═════════╪════════════╪══════════╪════════════════╡
│ bas     ┆ MAT-150 ┆ 10         ┆ 10       ┆ Mathematics    │
│ laura   ┆ MAT-150 ┆ 5          ┆ 10       ┆ Mathematics    │
│ tim     ┆ MAT-210 ┆ 6          ┆ 12       ┆ Mathematics    │
│ jenny   ┆ MAT-600 ┆ 8          ┆ 10       ┆ No Known Class │
│ bas     ┆ COM-200 ┆ 7          ┆ 12       ┆ Programming    │
│ laura   ┆ COM-205 ┆ 6          ┆ 10       ┆ Programming    │
│ tim     ┆ COM-430 ┆ 10         ┆ 15       ┆ No Known Class │
│ jenny   ┆ COM-200 ┆ 5          ┆ 12       ┆ Programming    │
└─────────┴─────────┴────────────┴──────────┴────────────────┘

Closed by #5899.

sm-Fifteen · 2023-02-10T14:48:59Z

@ghuls : Oh, wow, thanks, that's great! I'd actually given up on this, but for my use cases, it's actually a huge improvements in ergonomics and readability.

sezanzeb · 2023-12-07T17:13:13Z

Since I found this feature via this thread, I'd like to mention that from 0.19.16 on this method is called "replace"

sm-Fifteen added the feature label Jun 23, 2022

ghuls closed this as completed Feb 10, 2023

sm-Fifteen mentioned this issue Feb 10, 2023

feat(python): Add map_dict expression. #5899

Merged

sm-Fifteen mentioned this issue Sep 7, 2023

Change default behavior map_dict #10755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dict/Hashmap lookup expression #3789

Dict/Hashmap lookup expression #3789

sm-Fifteen commented Jun 23, 2022

ritchie46 commented Jun 23, 2022

sm-Fifteen commented Jun 23, 2022 •

edited

Loading

ritchie46 commented Jun 23, 2022

Arengard commented Jun 28, 2022

cbilot commented Jun 28, 2022

ghuls commented Feb 10, 2023

sm-Fifteen commented Feb 10, 2023 •

edited

Loading

sezanzeb commented Dec 7, 2023

Dict/Hashmap lookup expression #3789

Dict/Hashmap lookup expression #3789

Comments

sm-Fifteen commented Jun 23, 2022

Describe your feature request

ritchie46 commented Jun 23, 2022

sm-Fifteen commented Jun 23, 2022 • edited Loading

ritchie46 commented Jun 23, 2022

Arengard commented Jun 28, 2022

cbilot commented Jun 28, 2022

ghuls commented Feb 10, 2023

sm-Fifteen commented Feb 10, 2023 • edited Loading

sezanzeb commented Dec 7, 2023

sm-Fifteen commented Jun 23, 2022 •

edited

Loading

sm-Fifteen commented Feb 10, 2023 •

edited

Loading