-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use arrow::read_parquet instead of nanoparquet #462
Comments
@BenoitLondon Thank you for the benchmark. As you asked the why question: long story short of #315 , we wanted Parquet support by default. At first, R 1.0.0 with We are somehow reluctant in introducing options for choosing which package to use. We are still cleaning those up from the pre-1.0 era. I don't mind switching back to |
Can you share the code for the benchmark? Some notes:
Not really a good benchmark, but I just ran arrow and nanoparquet on the mentioned 33 milion row data set (10x It would be great to have a proper benchmark, but nevertheless I'll update note in the nanoparquet README, because it is acually competitive in terms of speed. I suspect that it is also competitive in terms of memory, but we'd need a better way to measure that. |
Oh thanks guys for the explanations, very much appreciated! median is the median time of 3 iterations so yeah in the small dataset case nano is 8 times faster than arrow. I m very happy to use nanoparquet if there s no downside (my use case is basically writing /reading biggish files (1-5 GB) in R and also reading in python or Julia so I wanted compatibility and speed and low ram usage if possible) Thanks again. |
It is a question how much this generalizes, but nanoparquet does not look bad at all: |
Thanks @gaborcsardi, I find similar results but not sure why I still find parquet to read faster in my benchmarks, though nano is faster at writing and doing a full cycle read + write I think I disabled ALTREP properly , maybe number of cores (I have 72 there) makes a difference. Anyway I'm very happy to use nanoparquet through rio as performance looks on par!
Here's my script for info: |
@BenoitLondon With which versions of the packages? I am also not sure if you can just run But yeah, it is also true in general that the results will vary among systems. In particular, the concurrent I/O in Arrow will take advantage of more advanced I/O architectures, probably. |
and I agree it s likely the reason for arrow looking faster at reading as when I do a full cycle it does not show anymore. :) |
You need to run the dev version of nanoparquet, from the GitHub repo. |
I've found in my benchmarks nanoparquet to be much less efficient than arrow in term of speed and RAM usage
speed and RAM usage when reading big files are not very good .
on nanoparquet repo they say :
rio uses arrow for feather already so I'm not sure why we rely on nanoparquet for parquet
If you keep nanoparquet as default maybe we could have an option to use arrow instead?
The text was updated successfully, but these errors were encountered: