Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use arrow::read_parquet instead of nanoparquet #462

Open
BenoitLondon opened this issue Jan 16, 2025 · 8 comments
Open

use arrow::read_parquet instead of nanoparquet #462

BenoitLondon opened this issue Jan 16, 2025 · 8 comments

Comments

@BenoitLondon
Copy link

BenoitLondon commented Jan 16, 2025

I've found in my benchmarks nanoparquet to be much less efficient than arrow in term of speed and RAM usage

        expression median mem_alloc   name   size
            <char>  <num>     <num> <char> <char>
 1:     df_parquet  1.153     5.578  write  small
 2: df_nanoparquet  0.674   183.986  write  small
 3:     dt_parquet  5.172     0.018  write  small
 4: dt_nanoparquet  0.656   183.876  write  small
 5:     df_parquet 10.878     0.015  write    big
 6: df_nanoparquet 10.182  2068.884  write    big
 7:     dt_parquet 11.461     0.015  write    big
 8: dt_nanoparquet 10.038  2068.947  write    big
 9:     df_parquet  0.088    34.901   read  small
10: df_nanoparquet  0.414   183.187   read  small
11:     df_parquet  1.187     0.009   read    big
12: df_nanoparquet  5.180  1324.072   read    big

speed and RAM usage when reading big files are not very good .

on nanoparquet repo they say :

Being single-threaded and not fully optimized, 
nanoparquet is probably not suited well for large data sets. 
It should be fine for a couple of gigabytes. 
Reading or writing a ~250MB file that has 32 million rows 
and 14 columns takes about 10-15 seconds on an M2 MacBook Pro.
 For larger files, use Apache Arrow or DuckDB.

rio uses arrow for feather already so I'm not sure why we rely on nanoparquet for parquet

If you keep nanoparquet as default maybe we could have an option to use arrow instead?

@chainsawriot
Copy link
Collaborator

chainsawriot commented Jan 16, 2025

@BenoitLondon Thank you for the benchmark.

As you asked the why question: long story short of #315 , we wanted Parquet support by default. At first, R 1.0.0 with arrow. But quickly reverted due to installation concerns. Then later, nanoparquet by @gaborcsardi was supposed to be installed by default, because it's dependency free. But it was, again, reverted due to the insufficient support for Big Endian platforms r-lib/nanoparquet#21. And therefore, we have the funny state of reading parquet with nanoparquet but feather with arrow. Going back to pre 1.1, arrow was used for both parquet and feather in the so-called Suggests tier.

We are somehow reluctant in introducing options for choosing which package to use. We are still cleaning those up from the pre-1.0 era. I don't mind switching back to arrow altogether. At the same time, I also believe that @gaborcsardi is actively developing nanoparquet to make it more efficient.

@gaborcsardi
Copy link

Can you share the code for the benchmark?

Some notes:

  • The dev version of nanoparquet has a completely rewritten read_parquet(), which is much faster. (See below)
  • I suspect that you can't really compare mem_alloc because if only includes memory allocated within R, and arrow probably allocates most of its memory in C/C++.
  • I am not totally sure how to interpret the results. E.g. does
            expression median mem_alloc   name   size
     3:     dt_parquet  5.172     0.018  write  small
     4: dt_nanoparquet  0.656   183.876  write  small
    
    mean that nanoparquet is 8 times faster here? Or 8 times slower?

Not really a good benchmark, but I just ran arrow and nanoparquet on the mentioned 33 milion row data set (10x flights from nycflights13), and nanoparquet is about 2 times faster when writing, and about the same in reading. (This is with options(arrow.use_altrep = FALSE), so that arrow actually reads the data.)

It would be great to have a proper benchmark, but nevertheless I'll update note in the nanoparquet README, because it is acually competitive in terms of speed. I suspect that it is also competitive in terms of memory, but we'd need a better way to measure that.

@BenoitLondon
Copy link
Author

BenoitLondon commented Jan 17, 2025

Oh thanks guys for the explanations, very much appreciated!
I guess my benchmarks were not very well designed. I suspected there was some ÀLTREP magic behind those numbers and for the ram as well it didn't look correct. I will use some summary after reading to make sure the data is actually loaded into R

median is the median time of 3 iterations so yeah in the small dataset case nano is 8 times faster than arrow.

I m very happy to use nanoparquet if there s no downside (my use case is basically writing /reading biggish files (1-5 GB) in R and also reading in python or Julia so I wanted compatibility and speed and low ram usage if possible)

Thanks again.
I will share my benchmark when fixed ;)

@gaborcsardi
Copy link

It is a question how much this generalizes, but nanoparquet does not look bad at all:
https://nanoparquet.r-lib.org/dev/articles/benchmarks.html#parquet-implementations-1

@BenoitLondon
Copy link
Author

BenoitLondon commented Jan 28, 2025

Thanks @gaborcsardi, I find similar results but not sure why I still find parquet to read faster in my benchmarks, though nano is faster at writing and doing a full cycle read + write

I think I disabled ALTREP properly , maybe number of cores (I have 72 there) makes a difference.

Anyway I'm very happy to use nanoparquet through rio as performance looks on par!

        expression   size      median    mem_alloc   name                       fn filesize
            <char> <char>       <num>        <num> <char>                   <char>    <num>
 1: df_nanoparquet    big 21.28752221 4.548025e+03   full df_big_test_nano.parquet      238
 2:     df_parquet    big 26.38431233 2.466664e+03   full   df_big_test_ar.parquet      240
 3: df_nanoparquet    big  5.89444197 2.967702e+03   read df_big_test_nano.parquet      238
 4:     df_parquet    big  2.52957607 2.466656e+03   read   df_big_test_ar.parquet      240
 5: df_nanoparquet    big  8.89001248 1.580325e+03  write df_big_test_nano.parquet      238
 6:     df_parquet    big 10.45254921 1.748657e-02  write   df_big_test_ar.parquet      240
 7: dt_nanoparquet    big  8.61057447 1.580388e+03  write dt_big_test_nano.parquet      238
 8:     dt_parquet    big 10.74620440 1.768494e-02  write   dt_big_test_ar.parquet      240
 9: df_nanoparquet  small  0.55284996 1.519936e+02   full     df_test_nano.parquet        8
10:     df_parquet  small  0.45033587 8.611523e+01   full       df_test_ar.parquet        8
11: df_nanoparquet  small  0.17726478 9.927212e+01   read     df_test_nano.parquet        8
12:     df_parquet  small  0.09908626 8.806947e+01   read       df_test_ar.parquet        8
13: df_nanoparquet  small  0.24138912 5.308556e+01  write     df_test_nano.parquet        8
14:     df_parquet  small  0.34817094 5.590614e+00  write       df_test_ar.parquet        8
15: dt_nanoparquet  small  0.24777139 5.294093e+01  write     dt_test_nano.parquet        8
16:     dt_parquet  small  0.41581183 1.983643e-02  write       dt_test_ar.parquet        8

Here's my script for info:
file_format_benchmark.txt

Image

@gaborcsardi
Copy link

@BenoitLondon With which versions of the packages?

I am also not sure if you can just run bench::mark() because Arrow or the OS may reuse the already open memory maps, so reading the same file the second time will not actually read it again.

But yeah, it is also true in general that the results will vary among systems. In particular, the concurrent I/O in Arrow will take advantage of more advanced I/O architectures, probably.

@BenoitLondon
Copy link
Author

BenoitLondon commented Jan 28, 2025

> packageVersion("nanoparquet")
[1] ‘0.3.1’
> packageVersion("arrow")
[1] ‘17.0.0.1’
> R.version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          3.2                         
year           2023                        
month          10                          
day            31                          
svn rev        85441                       
language       R                           
version.string R version 4.3.2 (2023-10-31)
nickname       Eye Holes  

and I agree it s likely the reason for arrow looking faster at reading as when I do a full cycle it does not show anymore. :)
Thanks for your package!

@gaborcsardi
Copy link

You need to run the dev version of nanoparquet, from the GitHub repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants