Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): Address incorrect align_frames result when the alignment column contains NULL values #18521

Merged
merged 1 commit into from
Sep 4, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Sep 2, 2024

Internally align_frames uses joins, but the default behaviour for joins is for NULL values not to match, which can result in extraneous/incorrect rows being generated during frame alignment if the alignment column(s) contain such values.

This PR fixes that oversight 👌

Example

import polars as pl

df1 = pl.DataFrame({
  "key": ["x", "y", None],
  "value": [1, 2, 0],
})
df2 = pl.DataFrame({
  "key": ["x", None, "z", "y"],
  "value": [4, 3, 6, 5],
})

# ┌──────┬───────┐   ┌──────┬───────┐   
# │ key  ┆ value │   │ key  ┆ value │   
# │ ---  ┆ ---   │   │ ---  ┆ ---   │   
# │ str  ┆ i64   │   │ str  ┆ i64   │   
# ╞══════╪═══════╡   ╞══════╪═══════╡   
# │ x    ┆ 1     │   │ x    ┆ 4     │   
# │ y    ┆ 2     │   │ null ┆ 3     │   
# │ null ┆ 0     │   │ z    ┆ 6     │   
# └──────┴───────┘   │ y    ┆ 5     │   
#                    └──────┴───────┘   

Before

Additional row gets (incorrectly) introduced:

pl.align_frames(df1, df2, on="key")
# ┌──────┬───────┐   ┌──────┬───────┐
# │ key  ┆ value │   │ key  ┆ value │
# │ ---  ┆ ---   │   │ ---  ┆ ---   │
# │ str  ┆ i64   │   │ str  ┆ i64   │
# ╞══════╪═══════╡   ╞══════╪═══════╡
# │ null ┆ null  │   │ null ┆ 3     │
# │ null ┆ 0     │   │ null ┆ null  │
# │ x    ┆ 1     │   │ x    ┆ 4     │
# │ y    ┆ 2     │   │ y    ┆ 5     │
# │ z    ┆ null  │   │ z    ┆ 6     │
# └──────┴───────┘   └──────┴───────┘

After

Correct number of appropriately aligned rows:

pl.align_frames(df1, df2, on="key")
# ┌──────┬───────┐   ┌──────┬───────┐
# │ key  ┆ value │   │ key  ┆ value │
# │ ---  ┆ ---   │   │ ---  ┆ ---   │
# │ str  ┆ i64   │   │ str  ┆ i64   │
# ╞══════╪═══════╡   ╞══════╪═══════╡
# │ null ┆ 0     │   │ null ┆ 3     │
# │ x    ┆ 1     │   │ x    ┆ 4     │
# │ y    ┆ 2     │   │ y    ┆ 5     │
# │ z    ┆ null  │   │ z    ┆ 6     │
# └──────┴───────┘   └──────┴───────┘

@github-actions github-actions bot added fix Bug fix python Related to Python Polars labels Sep 2, 2024
@alexander-beedie alexander-beedie changed the title fix(python): Address incorrect align_frames result when frames contain NULL values fix(python): Address incorrect align_frames result when the alignment column contains NULL values Sep 2, 2024
Copy link

codecov bot commented Sep 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.86%. Comparing base (ab25b3e) to head (e69118e).
Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #18521   +/-   ##
=======================================
  Coverage   79.85%   79.86%           
=======================================
  Files        1501     1501           
  Lines      201829   201829           
  Branches     2868     2868           
=======================================
+ Hits       161174   161187   +13     
+ Misses      40109    40096   -13     
  Partials      546      546           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46 ritchie46 merged commit 6d4b79d into pola-rs:main Sep 4, 2024
20 checks passed
@alexander-beedie alexander-beedie deleted the align-frames-null-fix branch September 4, 2024 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants