Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower than CRuby while parsing files through YARP/Prism gem #3293

Closed
faraazahmad opened this issue Oct 12, 2023 · 19 comments
Closed

Slower than CRuby while parsing files through YARP/Prism gem #3293

faraazahmad opened this issue Oct 12, 2023 · 19 comments
Assignees

Comments

@faraazahmad
Copy link

I've got a file where I'm trying to parallely parse all the Ruby files in the discourse codebase.

require 'parallel'
require 'prism'
require 'benchmark'

GLOB = '/Users/faraaz/oss/discourse/**/*.rb'

def bench
    Parallel.each(Dir.glob(GLOB), in_threads: 8) do |file|
        Prism.parse_file(file)
    end
end

Benchmark.bm do |x|
    x.report("Attempt 1") { bench }
    x.report("Attempt 2") { bench }
    x.report("Attempt 3") { bench }
end

I am also using the Prism and Parallel gem to do so. Here's my Gemfile.lock:

GEM
  remote: https://rubygems.org/
  specs:
    parallel (1.23.0)
    prism (0.13.0)

PLATFORMS
  x86_64-darwin

DEPENDENCIES
  parallel (~> 1.23)
  prism (~> 0.13.0)

BUNDLED WITH
   2.4.13

I ran this code using CRuby 3.3.0-preview1, truffleruby-23.0.0 (native) truffleruby-23.1.0 (native). Here are the results:

ruby 3.3.0preview1 (2023-05-12 master a1b01e7701) [x86_64-darwin22]

       user     system      total        real
Attempt 1  1.849014   0.270058   2.119072 (  2.136894)
Attempt 2  1.823381   0.245018   2.068399 (  2.070048)
Attempt 3  1.781599   0.245300   2.026899 (  2.028847)

---
truffleruby 23.0.0, like ruby 3.1.3, Oracle GraalVM Native [x86_64-darwin]

       user     system      total        real
Attempt 1 61.490037  11.793482  73.283519 ( 14.299126)
Attempt 2 47.419287  12.269673  59.688960 ( 11.679390)
Attempt 3 37.302392  14.540728  51.843120 ( 10.538771)

---
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-darwin]

       user     system      total        real
Attempt 1 60.098041  11.371509  71.469550 ( 14.642849)
Attempt 2 48.722091  13.202782  61.924873 ( 12.418994)
Attempt 3 48.938087  11.993571  60.931658 ( 12.815382)

I threw in the GraalVM version for good measure and it seemed to be JIT compiling pretty well but is still slower than CRuby.

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]

       user     system      total        real
Attempt 1 92.124363  11.339075 103.463438 ( 14.062655)
Attempt 2 46.448333   9.770680  56.219013 (  7.803680)
Attempt 3 43.827920  10.542195  54.370115 (  7.585570)
Attempt 4 42.939988  10.571172  53.511160 (  7.550312)
Attempt 5 41.170301  10.427718  51.598019 (  7.432247)
Attempt 6 41.673918  10.460389  52.134307 (  7.421614)

TruffleRuby seems considerably slower than CRuby in these benchmarks. Is this a bug?

@eregon
Copy link
Member

eregon commented Oct 12, 2023

Thank you for the report, we'll investigate.
What changed compared to https://ruby.social/@faraaz/111217973733226631 where you mentioned 60 seconds for 23.0?

@faraazahmad
Copy link
Author

Over there i was running the benchmarks manually so I was parsing each file 5 times, to see how big of a time difference there is, hence the 60 seconds

@eregon
Copy link
Member

eregon commented Oct 12, 2023

Here are the results I got, discourse=5d632fd30a50747ec9d1d1f0940428a15df781a2.
With:

source "https://rubygems.org"

gem 'parallel'
gem 'prism'
require 'parallel'
require 'prism'
require 'benchmark'

puts RUBY_DESCRIPTION

GLOB = "#{Dir.home}/code/discourse/**/*.rb"

def bench
  Parallel.each(Dir.glob(GLOB), in_threads: 8) do |file|
    Prism.parse_file(file)
  end
end

3.times do
  puts Benchmark.measure { bench }
end
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
  2.021850   0.158122   2.179972 (  2.191742)
  1.910226   0.123122   2.033348 (  2.045011)
  1.934800   0.136041   2.070841 (  2.082889)

ruby 3.3.0preview1 (2023-05-12 master a1b01e7701) [x86_64-linux]
  2.037098   0.149761   2.186859 (  2.198666)
  1.922396   0.119217   2.041613 (  2.053616)
  1.983681   0.105089   2.088770 (  2.100635)

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
 59.349771   5.405589  64.755360 (  5.632704)
 51.509924   2.931996  54.441920 (  4.769414)
 51.910072   2.149475  54.059547 (  4.772589)

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
 78.095965   5.389323  83.485288 (  6.333837)
 52.825928   3.358561  56.184489 (  4.465966)
 49.908065   3.105139  53.013204 (  4.212045)
 49.285708   3.119749  52.405457 (  4.172708)
 42.386188   2.482132  44.868320 (  4.026233)
 31.300244   1.631678  32.931922 (  3.646213)
 28.311627   1.933647  30.245274 (  3.637510)
 24.894860   1.711678  26.606538 (  3.543606)
 24.876030   1.451286  26.327316 (  3.287879)
 24.713629   1.640005  26.353634 (  3.416895)

My CRuby results are close to yours.
However TruffleRuby Native results are much faster (~3x), which is most likely due to TruffleRuby Native using Native Image G1 on Linux and that's not available on macOS yet for Native Image.
TruffleRuby JVM is also much faster (~2x) compared to your results, I'm not sure why.

Do you have 8 physical cores? I guess probably 4 which could explain that last difference.
On my linux machine I have 8 physical cores.
In general it's best to not count hyperthreads as those provide very little parallelism.

Talking of parallelism, I also ran the benchmark without Parallel.each:

require 'parallel'
require 'prism'
require 'benchmark'

puts RUBY_DESCRIPTION

GLOB = "#{Dir.home}/code/discourse/**/*.rb"

def bench
  Dir.glob(GLOB) do |file|
    Prism.parse_file(file)
  end
end

3.times do
  puts Benchmark.measure { bench }
end

and that gives:

ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
  1.959116   0.133027   2.092143 (  2.104274)
  1.856273   0.124001   1.980274 (  1.991644)
  1.869340   0.115209   1.984549 (  1.995774)

ruby 3.3.0preview1 (2023-05-12 master a1b01e7701) [x86_64-linux]
  1.935369   0.131056   2.066425 (  2.078142)
  1.848565   0.103314   1.951879 (  1.962795)
  1.846412   0.107251   1.953663 (  1.964553)

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
 45.736257   2.614669  48.350926 ( 12.628425)
 15.363574   0.553676  15.917250 ( 10.551924)
 12.899212   0.578461  13.477673 ( 10.534125)

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
 74.832125   3.108128  77.940253 ( 14.254892)
 29.320961   1.172517  30.493478 ( 11.530793)
 19.883169   0.831649  20.714818 ( 11.318680)
 14.190805   0.639462  14.830267 ( 11.210458)
 11.789571   0.529738  12.319309 ( 11.132813)
 12.350481   0.629661  12.980142 ( 11.234333)

We see basically no difference on CRuby, as expected due to the GIL.
We see a speedup on TruffleRuby (2.2x on Native, 3.2x on JVM) but it's not linear to the number of cores.
I think in this case it is due to this benchmark being very heavy on allocations, and while TruffleRuby/the JVM can allocate in parallel, the GC (well, except SerialGC on SVM on macOS) runs in parallel regardless of the number of Ruby threads. So the GC has the same amount of work with more Ruby threads but does not get faster (or not much, it might help to have multiple TLAB though). So that is the likely bottleneck for scaling here.

That leaves us with why this workload is quite a bit slower on TruffleRuby, I'll keep digging, it's probably something related to deserialization in Prism.

@eregon
Copy link
Member

eregon commented Oct 12, 2023

Here is a flamegraph from 23.1 Native with be ruby --cpusampler=flamegraph bench.rb 3 (and without parallel):
flamegraph

To get the interactive version you need to download the svg file and open it in a browser, GitHub disables the JavaScript parts of it.

@eregon
Copy link
Member

eregon commented Oct 12, 2023

The most obvious thing there is we see 35% of the time is spent in Prism::Source#compute_offsets which is something the Prism C extension on CRuby does not need to do, instead it uses the offsets from the C parser. We should serialize those offsets in Prism so they don't need to be recomputed when deserializing.

@eregon
Copy link
Member

eregon commented Oct 12, 2023

Something else we see when pressing c twice to see the warmup colors is that load_node is never compiled.
That's because of:

$ be ruby --engine.TraceCompilation bench.rb 3 |& grep 'opt failed'
[engine] opt failed id=12203 Prism::Serialize::Loader#load_node                 |Tier 1|Time  344(-18388702+18389045)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 113349. Graph Size: 150001. Limit: 150000.|Timestamp 18389045465543|Src serialize.rb:202
[engine] opt failed id=12203 Prism::Serialize::Loader#load_node                 |Tier 1|Time  328(-18389046+18389374)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 113233. Graph Size: 150007. Limit: 150000.|Timestamp 18389373861547|Src serialize.rb:202

So that's a big bottleneck, as the interpreter is much slower than JITed code, and also it allocates a lot more.
load_node looks like this: https://gist.github.com/eregon/a0ed168abcc6c1671d85d9199ff49b72#file-serialize-rb-L203-L493 (this is from prism master FWIW)

And TruffleRuby does not currently optimize case n when 1 body1 when 2 body2 when 3 body3 to jump to body3 without checking n == 1 and n == 2 before.

I'll try a different approach in Prism and see if that helps.

@eregon
Copy link
Member

eregon commented Oct 12, 2023

Using the simplified single-threaded benchmark (no parallel gem, so no need to run this with Bundler, so it can be run like ruby -I../prism/lib bench.rb 3):

require 'prism'
require 'benchmark'

puts RUBY_DESCRIPTION

GLOB = "#{Dir.home}/code/discourse/**/*.rb"

def bench
  Dir.glob(GLOB) do |file|
    Prism.parse_file(file)
  end
end

N = Integer(ARGV.first || 3)
N.times do
  puts Benchmark.measure { bench }
end

and using prism master.
I get:

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
 46.107483   3.218483  49.325966 ( 12.917207)
 18.562016   0.626841  19.188857 ( 10.795933)
 12.500946   0.552576  13.053522 ( 10.574041)

(basically the same as above)

And by changing load_node to be a lambda per node type I get:

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
 61.172579   2.250903  63.423482 (  8.873968)
 37.859864   2.043808  39.903672 (  7.481593)
 28.697565   0.795674  29.493239 (  6.525924)
 16.106719   0.630867  16.737586 (  6.211813)
 13.460437   0.432854  13.893291 (  6.023979)
  7.502166   0.442928   7.945094 (  5.842586)
  9.473145   0.528005  10.001150 (  5.932664)
  8.603267   0.413352   9.016619 (  5.841062)
  7.952528   0.433553   8.386081 (  5.785633)
  6.912824   0.363138   7.275962 (  5.723259)

Almost a 2x speedup!
And no more opt failed.

@eregon
Copy link
Member

eregon commented Oct 12, 2023

Here is the flamegraph, with ruby -I../prism/lib --cpusampler=flamegraph bench.rb 3:
flamegraph

Now everything hot in the histogram is compiled and 60% is spent in Prism::Source#compute_offsets.

@faraazahmad
Copy link
Author

by changing load_node to be a lambda per node type

I'm curious as to how this helps. Does it help us get around this limitation?:

And TruffleRuby does not currently optimize case n when 1 body1 when 2 body2 when 3 body3 to jump to body3 without checking n == 1 and n == 2 before.

eregon added a commit to eregon/yarp that referenced this issue Oct 12, 2023
* Otherwise load_node is too big to compile and is forced to run in interpreter:
  oracle/truffleruby#3293 (comment)
* For the benchmark at oracle/truffleruby#3293 (comment)
  Before: 10.574041 After: 5.592436
@eregon
Copy link
Member

eregon commented Oct 12, 2023

I've pushed my changes so far to ruby/prism@main...eregon:yarp:tweaks.

I'm curious as to how this helps. Does it help us get around this limitation?

Yes, since the lambdas are in a Array so that's just O(1) Array#[].
But also the lambdas are small enough that they can be JIT-compiled, the 300-lines load_node was too big to be JIT-compiled.


About Prism::Source#compute_offsets I'm not convinced that's the biggest bottleneck because I also did a small benchmark doing only that part:

require 'prism'
require 'benchmark'

puts RUBY_DESCRIPTION

GLOB = "#{Dir.home}/code/discourse/**/*.rb"

def bench
  Dir.glob(GLOB) do |file|
    code = File.binread(file)
    Prism::Source.new(code)
  end
end

N = Integer(ARGV.first || 3)
N.times do
  puts Benchmark.measure { bench }
end

But that is much faster and does not take 60% of the ~5.7 seconds/iteration above:

$ ruby -I../prism/lib compute_offsets.rb 10
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
  4.779324   0.474053   5.253377 (  0.766970)
  3.351564   0.134098   3.485662 (  0.499833)
  1.764714   0.208363   1.973077 (  0.286704)
  1.472017   0.086878   1.558895 (  0.224722)
  2.584139   0.075350   2.659489 (  0.377355)
  2.326343   0.271429   2.597772 (  0.380175)
  1.767078   0.085944   1.853022 (  0.264943)
  1.954663   0.088937   2.043600 (  0.282534)
  1.541035   0.083524   1.624559 (  0.230307)
  1.543260   0.074147   1.617407 (  0.228676)

So I wonder if the profiler is misleading there.
But it could also be a more complicated interaction where the Regexp used for compute_offsets is JIT-compiled late because there is a lot more things to compile on the full benchmark.
In any case it seems worth to avoid recomputing the line offsets again, so I'll work on a way to serialize+deserialize those.

@eregon eregon self-assigned this Oct 13, 2023
@faraazahmad
Copy link
Author

faraazahmad commented Oct 13, 2023

Not sure if this is helps, but if I just create new Source, like you did in the last example, Array#<< barely takes any time. However, if I get to this method by calling Prism#parse only then does it take this long, my amateur guess is that is since its going through call chain: Prism#parse -> Prism#load -> Serialize#load -> Source#new, theres a lot of string allocation happening here, could it be an issue with insufficient TLAB space?

@faraazahmad
Copy link
Author

It definitely takes slightly longer, when I use Parallel.map vs Parallel.each, I'm guessing because in Parallel.map, it stores the result to return it to the main thread. Here's some benchmarks:

Parallel.each

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]
 86.077456  10.028982  96.106438 ( 13.411521)
 52.458730   9.163018  61.621748 (  8.776318)
 50.073878   9.241329  59.315207 (  8.576140)
 50.344392   9.272874  59.617266 (  8.583368)
 45.094790   9.200160  54.294950 (  8.307919)

Parallel.map

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]
117.025375  11.182852 128.208227 ( 17.811838)
 82.708486  10.822153  93.530639 ( 13.395469)
 79.538424  12.084877  91.623301 ( 13.385311)
 72.459464  11.287978  83.747442 ( 12.670169)
 71.959355  10.746600  82.705955 ( 12.457304)

and if I simply use the Thread class and store the result of Prism.parse using Thread.current[]=, it takes even longer:

require 'parallel'
require 'prism'
require 'benchmark'

puts RUBY_DESCRIPTION

GLOB = "#{Dir.home}/oss/discourse/**/*.rb"

def bench_prism
  Parallel.map(Dir.glob(GLOB), in_threads: 4) do |file|
    Prism.parse_file(file)
  end
end

def bench_prism_thread
  threads = []
  Dir.glob(GLOB).each do |file|
    threads << Thread.new { Thread.current[:tree] = Prism.parse_file(file) }
  end

  threads.each(&:join)
end

n = Integer(ARGV[0] || 3)

n.times do
  puts Benchmark.measure { bench_prism_thread }
end

results:

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]
149.042894  16.755828 165.798722 ( 23.293313)
 81.487260  25.510818 106.998078 ( 18.358081)
 80.045864  23.758906 103.804770 ( 15.674315)
 73.355320  29.309650 102.664970 ( 17.059532)
 70.558519  25.700033  96.258552 ( 15.057084)

Following is similar to the above but just without Thread.current[]=:

truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]
170.567457  17.339795 187.907252 ( 25.313308)
 76.937372  19.420802  96.358174 ( 13.899857)
 71.617192  17.945206  89.562398 ( 12.449902)
 73.299595  21.708841  95.008436 ( 13.741568)
 68.685054  19.843938  88.528992 ( 12.565670)

@faraazahmad
Copy link
Author

faraazahmad commented Oct 13, 2023

I seem to be getting similar times in all 3 versions, when using FixedThreadPool from concurrent gem:

require 'parallel'
require 'prism'
require 'benchmark'
require 'concurrent'

puts RUBY_DESCRIPTION

GLOB = "#{Dir.home}/oss/discourse/**/*.rb"

def thread_pool
  pool = Concurrent::FixedThreadPool.new(4)
  Dir.glob(GLOB).each do |file|
    pool.post { Prism.parse_file(file) }
  end
  pool.shutdown
end

n = Integer(ARGV[0] || 3)

n.times do
  puts Benchmark.measure { thread_pool }
end
ruby 3.3.0preview2 (2023-09-14 master e50fcca9a7) [x86_64-darwin22]
  0.053338   0.098111   0.151449 (  0.151505)
  2.020494   0.264774   2.285268 (  2.286550)
  1.847566   0.235160   2.082726 (  2.083523)
  2.036501   0.249373   2.285874 (  2.287998)
  1.811042   0.232954   2.043996 (  2.044978)
                        
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-darwin]
  8.664981   0.387264   9.052245 (  1.492148)
  8.049466   0.674221   8.723687 (  1.126538)
  9.955973   0.848196  10.804169 (  1.389465)
  7.546957   0.678019   8.224976 (  1.063006)
 11.698075   1.076263  12.774338 (  1.645159)
                 
truffleruby 23.1.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-darwin]
  2.262696   0.141141   2.403837 (  0.570721)
  4.783651   0.380178   5.163829 (  1.037267)
  6.059194   0.504666   6.563860 (  1.324881)
  7.596199   0.825799   8.421998 (  1.777206)
  8.759423   1.159531   9.918954 (  1.962771)

Is this legit or am I exiting before all the execution can complete?

@eregon
Copy link
Member

eregon commented Oct 13, 2023

I think shutdown will not wait for the threads or tasks to be done: https://ruby-concurrency.github.io/concurrent-ruby/master/Concurrent/ThreadPoolExecutor.html#shutdown-instance_method
You'd probably need wait_for_termination for that.

eregon added a commit to eregon/yarp that referenced this issue Oct 18, 2023
* Otherwise load_node is too big to compile and is forced to run in interpreter:
  oracle/truffleruby#3293 (comment)
* For the benchmark at oracle/truffleruby#3293 (comment)
  TruffleRuby Native 23.1.0:
  Before: 10.574041 After: 5.592436
  JRuby 9.4.3.0:
  Before: 7.037780 After: 3.995317
  JRuby 9.4.3.0 -Xcompile.invokedynamic=true:
  Before: 7.047832 After: 2.269294
eregon added a commit to eregon/yarp that referenced this issue Oct 26, 2023
* Otherwise load_node is too big to compile and is forced to run in interpreter:
  oracle/truffleruby#3293 (comment)
* For the benchmark at oracle/truffleruby#3293 (comment)
  TruffleRuby Native 23.1.0:
  Before: 10.574041 After: 5.592436
  JRuby 9.4.3.0:
  Before: 7.037780 After: 3.995317
  JRuby 9.4.3.0 -Xcompile.invokedynamic=true:
  Before: 7.047832 After: 2.269294
matzbot pushed a commit to ruby/ruby that referenced this issue Oct 26, 2023
* Otherwise load_node is too big to compile and is forced to run in interpreter:
  oracle/truffleruby#3293 (comment)
* For the benchmark at oracle/truffleruby#3293 (comment)
  TruffleRuby Native 23.1.0:
  Before: 10.574041 After: 5.592436
  JRuby 9.4.3.0:
  Before: 7.037780 After: 3.995317
  JRuby 9.4.3.0 -Xcompile.invokedynamic=true:
  Before: 7.047832 After: 2.269294

ruby/prism@a592ec346a
@eregon
Copy link
Member

eregon commented Feb 14, 2024

From ruby/prism#2402 (comment)

ruby/prism#2381 helps a ton, it's around 2.7s per iteration before (and also faster with YJIT).

Such a big speedup was unexpected, so I investigated.
The problem is before ruby/prism#2381 both the filename String and the contents String were copied multiple times from managed byte[] to native char* and back.
This is because pm_serialize_parse is declared with :pointer argument and we pass it a Ruby String.
And in that case TruffleRuby (at least currently) mutates the String in-place to use a native representation (Truffle::CExt.string_to_ffi_pointer).
But then just after serialization is done there would be a copy due to some .b call which uses dup and internally String#initialize_copy for a native string makes a copy of it (needs to as it is a MutableTruffleString).
And then later it would need to be managed again, for Regexp matching.

Changing the code in Prism ffi.rb to avoid using that Ruby String for pm_serialize_parse for the Prism.parse_file case it looks a lot better.
On 24aae7aa51ce2877adc0718cbe6cedca532f01fe (before ruby/prism#2381):

truffleruby 24.1.0-dev-93693825, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
$ ruby -Ilib -rprism -rbenchmark -e '30.times { p Benchmark.realtime { Dir.glob("lib/**/*.rb") { |f| Prism.parse_file(f) } } }'
5.08163639200211
3.2185959659982473
2.7246216619969346
2.493555128996377
2.5304376729982323
2.626141545995779
2.561074762001226
2.6626226649968885
2.451639045000775
2.442975819998537
2.37616028800403
2.342963579998468
2.492586234002374
2.4081202760062297
2.2901282259990694
2.247561003001465
2.216280453998479
2.2374196180026047
2.20327866399748
2.1944359460030682
2.227331458001572
2.290716622999753
2.23880834600277
2.266132271004608
2.235398062999593
2.2192147110035876
2.2535331350009074
2.1951458570038085
2.170285389998753
2.146207225996477

Still 24aae7aa51ce2877adc0718cbe6cedca532f01fe but with the change in ffi.rb:

2.47568983000383
0.568289217997517
0.30265256599523127
0.38206838199403137
0.46137602499948116
0.3988921729978756
0.31418989599478664
0.19933188000140944
0.20126643000548938
0.2556454750010744
0.3183238799974788
0.22124192000046605
0.20216184399760095
0.1891287110047415
0.18731482300063362
0.21746326200081967
0.14462887900299393
0.1365925350037287
0.17602010799600976
0.14211552999768173
0.13566503999754786
0.23790449000080116
0.14238863500213483
0.12545616099669132
0.1537287609971827
0.1526584540042677
0.1223834429983981
0.12089217000175267
0.15295368800434517
0.11997614600113593

CRuby 3.3:

ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [x86_64-linux]
$ ruby --yjit -Ilib -rprism -rbenchmark -e '10.times { p Benchmark.realtime { Dir.glob("lib/**/*.rb") { |f| Prism.parse_file(f) } } }'                    
0.14406701100233477
0.0955199759991956
0.10004908599512419
0.10911789700185182
0.10410506899643224
0.09660588300175732
0.11423506699793506
0.09826613799668849
0.08650715499970829
0.10329519699735101
$ PRISM_FFI_BACKEND=1 ruby --yjit -Ilib -rprism -rbenchmark -e '10.times { p Benchmark.realtime { Dir.glob("lib/**/*.rb") { |f| Prism.parse_file(f) } } }'
0.24653413399937563
0.2043960579976556
0.16812661400035722
0.1922464890012634
0.1833490009958041
0.16330030000244733
0.19229596100194613
0.1730906020020484
0.17741290900448803
0.17981221600348363

@eregon
Copy link
Member

eregon commented Feb 15, 2024

And in that case TruffleRuby (at least currently) mutates the String in-place to use a native representation (Truffle::CExt.string_to_ffi_pointer).

I checked and this is necessary for the FFI :pointer type but not for the :string type: ffi/ffi#1080
So one good fix there is to change TruffleRuby's FFI backend to not mutate the Ruby String for the :string type.

It also needs changes in Prism so that Prism uses :string instead of :pointer for const char*, that's trivial.

@eregon
Copy link
Member

eregon commented Feb 15, 2024

Argh that idea sort of falls apart because

PRISM_EXPORTED_FUNCTION void pm_serialize_parse(pm_buffer_t *buffer, const uint8_t *source, size_t size, const char *data);

So both source and data could be :string, but that's a problem because :string in FFI only accepts a String with no null bytes ("\0"), and data has null bytes.
We could workaround by manually declaring the types, but it's not so nice.
It'd be nice if FFI would let String with null bytes through or have another type allowing it: ffi/ffi#1081

@eregon
Copy link
Member

eregon commented Feb 15, 2024

With this change in Prism:

diff --git a/lib/prism/ffi.rb b/lib/prism/ffi.rb
index 3418d787..639185f3 100644
--- a/lib/prism/ffi.rb
+++ b/lib/prism/ffi.rb
@@ -19,14 +19,23 @@ module Prism
     # Convert a native C type declaration into a symbol that FFI understands.
     # For example:
     #
-    #     const char * -> :pointer
+    #     const char * -> :string
+    #     pm_buffer_t *-> :pointer
     #     bool         -> :bool
     #     size_t       -> :size_t
     #     void         -> :void
     #
-    def self.resolve_type(type)
+    def self.resolve_type(type, return_type)
       type = type.strip
-      type.end_with?("*") ? :pointer : type.delete_prefix("const ").to_sym
+      if !return_type && (type == "const char *" || type == "const uint8_t *")
+        # This type avoids extra copies when passing Ruby Strings arguments to native code.
+        # We don't use it for the return type because we need the pointer e.g. for pm_string_source().
+        :string
+      elsif type.end_with?("*")
+        :pointer
+      else
+        type.delete_prefix("const ").to_sym
+      end
     end
 
     # Read through the given header file and find the declaration of each of the
@@ -56,10 +65,14 @@ module Prism
 
         # Resolve the type of the argument by dropping the name of the argument
         # first if it is present.
-        arg_types.map! { |type| resolve_type(type.sub(/\w+$/, "")) }
+        arg_types.map! { |type| resolve_type(type.sub(/\w+$/, ""), false) }
+
+        if name == 'pm_serialize_parse'
+          arg_types = [:pointer, :string, :size_t, :pointer]
+        end
 
         # Attach the function using the FFI library.
-        attach_function name, arg_types, resolve_type(return_type)
+        attach_function name, arg_types, resolve_type(return_type, true)
       end
 
       # If we didn't find all of the functions, raise an error.

And fixes in TruffleRuby to optimize :string, we get the expected performance:

$ ruby -Ilib -rprism -rbenchmark -e '30.times { p Benchmark.realtime { Dir.glob("lib/**/*.rb") { |f| Prism.parse_file(f) } } }'
...
0.11793946800025878
0.11649627699807752
0.11632051299966406

(vs ~2.5 before, this is on jvm-ee)

graalvmbot pushed a commit that referenced this issue Feb 15, 2024
…f a FFI call

* See #3293 (comment)
  and ffi/ffi#1080
* As that could cause extra conversions to managed later on.
@eregon
Copy link
Member

eregon commented Feb 15, 2024

Using the single-thread benchmark from #3293 (comment) and ruby/prism#2426

$ ruby -I../prism/lib bench.rb           
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
  2.190895   0.117062   2.307957 (  2.321824)
  2.141946   0.133974   2.275920 (  2.290421)
  2.233112   0.120049   2.353161 (  2.368065)

$ ruby -I../prism/lib bench.rb 10         
truffleruby 24.1.0-dev-93693825, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
 28.447562   1.234428  29.681990 (  4.226516)
 18.374132   0.209007  18.583139 (  2.654558)
 17.659575   0.203733  17.863308 (  2.496900)
 18.335423   0.628889  18.964312 (  2.595021)
 19.378060   0.839551  20.217611 (  2.658917)
 17.002796   0.214466  17.217262 (  2.470285)
 15.576381   0.179887  15.756268 (  2.257773)
 11.338724   0.178438  11.517162 (  2.168090)
  8.109505   0.173258   8.282763 (  2.157561)
  6.059230   0.138583   6.197813 (  2.138763)

So it's about the same performance as CRuby 3.3.0 single-threaded.

Now with the parallel benchmark from #3293 (comment):

$ ruby -I../prism/lib bench_parallel.rb 5
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
  2.244563   0.164503   2.409066 (  2.425282)
  2.192646   0.130990   2.323636 (  2.339039)
  2.047115   0.147734   2.194849 (  2.206869)
  2.246744   0.125164   2.371908 (  2.387293)
  2.127249   0.138846   2.266095 (  2.280787)

$ ruby -I../prism/lib bench_parallel.rb 10
truffleruby 24.1.0-dev-93693825, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
 22.552076   1.075381  23.627457 (  1.969642)
 14.322148   0.595175  14.917323 (  1.229991)
  9.793991   0.343925  10.137916 (  0.825214)
  8.694108   0.291115   8.985223 (  0.760018)
  8.244202   0.221496   8.465698 (  0.700302)
  7.944729   0.268573   8.213302 (  0.689711)
  8.077947   0.286265   8.364212 (  0.697297)
  7.915530   0.271384   8.186914 (  0.668313)
  7.710315   0.256563   7.966878 (  0.649437)
  7.871159   0.258862   8.130021 (  0.671472)

$ ruby -I../prism/lib bench_parallel.rb 30
truffleruby 24.1.0-dev-0e0df31f, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
...
  5.939668   0.246157   6.185825 (  0.497786)
  6.813118   0.250171   7.063289 (  0.569874)
  6.008750   0.352802   6.361552 (  0.514425)
  5.993626   0.230494   6.224120 (  0.515127)
  6.387444   0.261654   6.649098 (  0.531733)
  5.805497   0.216181   6.021678 (  0.490501)
  5.956375   0.254065   6.210440 (  0.508663)
  5.919400   0.244198   6.163598 (  0.502847)

So that's about 3.5x-5x as fast! (with 8 threads)
It's not linear scaling, I suspect due to the reasons mentioned at the end of #3293 (comment).

Out of curiosity I also tried with in_processes: 8 on CRuby, and that gives timings like 0.515503.

@eregon eregon closed this as completed Feb 15, 2024
graalvmbot pushed a commit that referenced this issue Feb 15, 2024
…f a FFI call

* See #3293 (comment)
  and ffi/ffi#1080
* As that could cause extra conversions to managed later on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants