-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(python): Use underlying fileno for Python files when possible #17315
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17315 +/- ##
==========================================
- Coverage 80.73% 80.70% -0.03%
==========================================
Files 1475 1475
Lines 193238 193388 +150
Branches 2760 2760
==========================================
+ Hits 156013 156080 +67
- Misses 36715 36798 +83
Partials 510 510 ☔ View full report in Codecov by Sentry. |
You can't always skip the python read functions as they might do something more complex than just reading the file descriptor as it is: e.g. reading a compressed file format not supported by Polars, or nested compressed files. $ head -n 1000 test.csv | gzip | gzip | gzip > test.csv.gz.gz.gz
In [19]: with gzip.open("test.csv.gz.gz.gz") as fh:
...: with gzip.open(fh) as fh2:
...: df = pl.read_csv(fh2) |
You're right. I limit this optimization to builtin file IO types in 4257873. |
8e06b6b
to
52f4c24
Compare
caf7d0e
to
ee0deda
Compare
I also add an improvement: use inode number in |
30f937d
to
6f3dc35
Compare
A better way is to use open file description locks, and apply it to all file IO, not only mmap, to avoid data race in multithread/multiprocess conditions. I can implement it in the future. |
this can handle hardlinks and symlinks
fcb4ce2
to
8cc24f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ruihe774, this seems like an improvement indeed.
If a Python file object is passed as the argument of IO functions (e.g.
read_*
&write_*
), currently it is wrapped into aPyFileLikeObject
, and each read & write will acquire the GIL and call into Python, which is inefficient. This PR introduces a improvement that extracts the underlying fd throughfileno()
and opens the fd as a RustFile
, which does not require GIL and have better performance.