Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(python): Use underlying fileno for Python files when possible #17315

Merged
merged 5 commits into from
Jul 3, 2024

Conversation

ruihe774
Copy link
Contributor

@ruihe774 ruihe774 commented Jul 1, 2024

If a Python file object is passed as the argument of IO functions (e.g. read_* & write_*), currently it is wrapped into a PyFileLikeObject, and each read & write will acquire the GIL and call into Python, which is inefficient. This PR introduces a improvement that extracts the underlying fd through fileno() and opens the fd as a Rust File, which does not require GIL and have better performance.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jul 1, 2024
Copy link

codecov bot commented Jul 1, 2024

Codecov Report

Attention: Patch coverage is 93.27731% with 8 lines in your changes missing coverage. Please review.

Project coverage is 80.70%. Comparing base (59d2529) to head (ca6a1f6).
Report is 9 commits behind head on main.

Files Patch % Lines
py-polars/src/file.rs 91.30% 6 Missing ⚠️
py-polars/src/functions/io.rs 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17315      +/-   ##
==========================================
- Coverage   80.73%   80.70%   -0.03%     
==========================================
  Files        1475     1475              
  Lines      193238   193388     +150     
  Branches     2760     2760              
==========================================
+ Hits       156013   156080      +67     
- Misses      36715    36798      +83     
  Partials      510      510              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stinodego stinodego changed the title perf: Use underlying fileno for Python files when possible perf(python): Use underlying fileno for Python files when possible Jul 1, 2024
@stinodego stinodego removed the rust Related to Rust Polars label Jul 1, 2024
@ghuls
Copy link
Collaborator

ghuls commented Jul 1, 2024

You can't always skip the python read functions as they might do something more complex than just reading the file descriptor as it is: e.g. reading a compressed file format not supported by Polars, or nested compressed files.

$ head -n 1000 test.csv | gzip | gzip | gzip > test.csv.gz.gz.gz


In [19]: with gzip.open("test.csv.gz.gz.gz") as fh:
    ...:     with gzip.open(fh) as fh2:
    ...:         df = pl.read_csv(fh2)

@ruihe774
Copy link
Contributor Author

ruihe774 commented Jul 1, 2024

You can't always skip the python read functions as they might do something more complex than just reading the file descriptor as it is: e.g. reading a compressed file format not supported by Polars, or nested compressed files.

You're right. I limit this optimization to builtin file IO types in 4257873.

@ruihe774 ruihe774 force-pushed the py-fileno branch 3 times, most recently from 8e06b6b to 52f4c24 Compare July 1, 2024 12:27
@ruihe774 ruihe774 force-pushed the py-fileno branch 2 times, most recently from caf7d0e to ee0deda Compare July 1, 2024 13:31
@ruihe774
Copy link
Contributor Author

ruihe774 commented Jul 1, 2024

I also add an improvement: use inode number in MMapSemaphore instead of path, which better handles hard links and symbolic links.

@ruihe774 ruihe774 force-pushed the py-fileno branch 2 times, most recently from 30f937d to 6f3dc35 Compare July 2, 2024 05:12
@ruihe774
Copy link
Contributor Author

ruihe774 commented Jul 2, 2024

I also add an improvement: use inode number in MMapSemaphore instead of path, which better handles hard links and symbolic links.

A better way is to use open file description locks, and apply it to all file IO, not only mmap, to avoid data race in multithread/multiprocess conditions. I can implement it in the future.

@ruihe774 ruihe774 force-pushed the py-fileno branch 2 times, most recently from fcb4ce2 to 8cc24f4 Compare July 2, 2024 08:40
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ruihe774, this seems like an improvement indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants