Performance drop in collect() for LazyFrame with pl.scan_ndjson in 0.20.25 #16141
Closed
2 tasks done
Labels
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Checks
Reproducible example
Log output
Issue description
This is a follow-up from the previous issue #16067
With respect to Polars version 0.19.12, I was expecting the newer version to be faster. However, the collection process is way slower (1.7 sec vs 54 sec).
I am working with a relatively large dataset of around 700k json files with a total of ~3Gb and this issue makes 0.20.25 unusable for me. On the other hand, the old version was fast, as you can see in the log output above.
I tested it on two different machines, both with Linux. A different version of Python does not seem to make any sensible difference.
Expected behavior
I was expecting Polars version 0.19.12 to be faster than 0.20.25. However, the collection process is way slower (1.7 sec vs. 54 sec).
Note that in the new version, the initial scan process
pl.scan_ndjson
is way faster than in the old version (1.2 sec -> 0.25 sec) and does not seem to scale with the number of files.Installed versions
For version 0.20.25:
The text was updated successfully, but these errors were encountered: