Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars vs Pandas forward fill time comparison #20669

Closed
2 tasks done
francescomandruvs opened this issue Jan 11, 2025 · 5 comments · Fixed by #20689
Closed
2 tasks done

Polars vs Pandas forward fill time comparison #20669

francescomandruvs opened this issue Jan 11, 2025 · 5 comments · Fixed by #20689
Assignees
Labels
accepted Ready for implementation performance Performance issues or improvements python Related to Python Polars

Comments

@francescomandruvs
Copy link

francescomandruvs commented Jan 11, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import numpy as np
import pandas as pd
import time

def polars_ffill(df, grp_cfg):
    return (
        df.with_columns(
            pl.col(grp_cfg['features'])
            .forward_fill()
            .over(grp_cfg['groupby'])
        )
    )

def pandas_ffill(df, grp_cfg):
    df[grp_cfg["features"]] = df.groupby(grp_cfg["groupby"])[grp_cfg["features"]].ffill()
    return df
    
# Test script:
num_rows = 100_000  
num_features_list = [10, 50, 100, 200, 500, 1000, 5000] 

for num_features in num_features_list:
    data = {f'feature_{i}': np.random.randn(num_rows) for i in range(num_features)}
    data['group'] = np.random.randint(0, 10_000, size=num_rows)
    df = pl.DataFrame(data)
    
    grp_cfg = {
        'features': [f'feature_{i}' for i in range(num_features)],
        'groupby': ['group']
    }
    
    start_time = time.time()
    polars_ffill(df, grp_cfg)
    end_time = time.time()    
    elapsed_time = end_time - start_time  
    print(f'Polars: # features: {num_features}, Time: {elapsed_time:.4f}s')

    df_pd = df.to_pandas()
    start_time = time.time()
    pandas_ffill(df_pd, grp_cfg)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f'Pandas: # features: {num_features}, Time: {elapsed_time:.4f}s')

Log output

Polars: # features: 10, Time: 0.1858s
Pandas: # features: 10, Time: 0.0319s
Polars: # features: 50, Time: 1.3480s
Pandas: # features: 50, Time: 0.0643s
Polars: # features: 100, Time: 2.6218s
Pandas: # features: 100, Time: 0.1119s
Polars: # features: 200, Time: 4.6255s
Pandas: # features: 200, Time: 0.1750s
Polars: # features: 500, Time: 11.3868s
Pandas: # features: 500, Time: 0.3941s
Polars: # features: 1000, Time: 24.1346s
Pandas: # features: 1000, Time: 0.8298s

Issue description

I don't know if this can be considered a bug or an unfair test, in case if not I'm sorry for the issue. I'm trying to do a forward fill on a large dataset (2/3M rows) and it seems that my polars implementation of the forward fill is very slow compared to the old counterpart in Pandas.

Expected behavior

The expected behavior would be a similar performance or even better

Installed versions

--------Version info---------
Polars:              1.17.1
Index type:          UInt32
Platform:            Windows-10-10.0.26100-SP0
Python:              3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
<not installed>ager  
<not installed>      
1.35.77              
3.1.0pickle          
<not installed>      
<not installed>      
<not installed>      
2024.10.0            
<not installed>      
2.36.0.auth          
<not installed>      
3.9.3otlib           
1.6.0asyncio         
1.26.4               
<not installed>      
2.2.3s               
18.1.0w              
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>      
<not installed>
@francescomandruvs francescomandruvs added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 11, 2025
@etiennebacher

This comment was marked as outdated.

@ritchie46
Copy link
Member

Yes, we can do much better here. There is lots of duplicate work and no parallelism. Need to change the design a bit, will come back to this.

@francescomandruvs
Copy link
Author

The group2 variable doesn't exist in the data you create (but once I remove it I see similar timings)

yeah sorry for that. It's a remnant of a further test I tried

@ritchie46 ritchie46 added performance Performance issues or improvements and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jan 12, 2025
@ritchie46
Copy link
Member

ritchie46 commented Jan 13, 2025

After #20689

Polars: # features: 10, Time: 0.0145s
Pandas: # features: 10, Time: 0.0197s
Polars: # features: 50, Time: 0.0234s
Pandas: # features: 50, Time: 0.0513s
Polars: # features: 100, Time: 0.0426s
Pandas: # features: 100, Time: 0.0762s
Polars: # features: 200, Time: 0.0801s
Pandas: # features: 200, Time: 0.1681s
Polars: # features: 500, Time: 0.2497s
Pandas: # features: 500, Time: 0.6723s
Polars: # features: 1000, Time: 0.7436s
Pandas: # features: 1000, Time: 1.1574s

@francescomandruvs
Copy link
Author

Thanks @ritchie46 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation performance Performance issues or improvements python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants