Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serious performance regression for method pow(a, m) #3544

Closed
jzakiya opened this issue Apr 20, 2024 · 7 comments
Closed

Serious performance regression for method pow(a, m) #3544

jzakiya opened this issue Apr 20, 2024 · 7 comments
Assignees

Comments

@jzakiya
Copy link

jzakiya commented Apr 20, 2024

The code below shows this issue.
On Truffleruby, using pow(a, m) is orders of magnitude slower than on [C|J]Ruby, where it's the fastest.

# Enable YJIT if using CRuby >= 3.3"
RubyVM::YJIT.enable if RUBY_ENGINE == 'ruby' and RUBY_VERSION.to_f >= 3.3

def modinv(a0, m0)
  return 1 if m0 == 1
  a, m = a0, m0
  x0, inv = 0, 1
  while a > 1
    inv -= (a / m) * x0
    a, m = m, a % m
    x0, inv = inv, x0
  end
  inv += m0 if inv < 0
  inv
end

def modinv1(r, m)
  inv = r
  while (r * inv) % m != 1; inv %= m; inv *= r end
  inv % m
end

def modinv2(r, m)
  inv = r
  while (r * inv) % m != 1; inv = (inv * r) end
  inv #% m
end

def modinv3(r, m, rescnt)
  (r ** (rescnt-1)) % m
end

def modinv4(r, m, rescnt)
  r.pow(rescnt-1, m)
end

def tm; t = Time.now; yield; Time.now - t end

resp7 = [11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,
          97,101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,
         167,169,173,179,181,187,191,193,197,199,209,211]

print resp7.map { |r| modinv(r, 210) }
puts
print resp7.map { |r| modinv1(r, 210) }
puts
print resp7.map { |r| modinv2(r, 210) }
puts
print resp7.map { |r| modinv3(r, 210, 48) }
puts
print resp7.map { |r| modinv4(r, 210, 48) }
puts

puts tm{ 100000.times {resp7.each { |r| modinv(r, 210) } } }

puts tm{ 100000.times {resp7.each { |r| modinv1(r, 210) } } }

puts tm{ 100000.times {resp7.each { |r| modinv2(r, 210) } } }

puts tm{ 100000.times {resp7.each { |r| modinv3(r, 210, 48) } } }

puts tm{ 100000.times {resp7.each { |r| modinv4(r, 210, 48) } } }
@andrykonchin
Copy link
Member

Thank you for reporting the issue!

Could you please share your benchmarking results?

@eregon
Copy link
Member

eregon commented Apr 22, 2024

Which version of TruffleRuby are you using?

Related: #1999
We do modular exponentiation now:

Primitive.mod_pow(self, e, m)

But always convert to BigInteger, maybe that's the overhead?
If that's the case we could use similar logic to CRuby/JRuby for the small integers cases, but would be good to make sure that's the issue and how much slower it is before importing that code.

@jzakiya
Copy link
Author

jzakiya commented Apr 22, 2024

First code correction.

def modinv2(r, m)
  inv = r
  while (r * inv) % m != 1; inv = (inv * r) % m end
  inv #% m
end

Here are times on my system for Ruby 3.3 w/wo YJIT, JRuby 9.4.6.0 and TruffleRuby 24.0.0, via rvm.
Total times in secs, smaller the faster.

Ruby 3.3 w/o YJIT
1.544888113
1.054920322
0.954105064
3.531902812
0.369263064

Ruby 3.3 w/YJIT
0.375549548
0.383313453
0.36227265
3.213001315
0.235034272

JRuby 9.4.6.0
0.9543389999999999
0.99756
0.870005
1.6784890000000001
0.266971

TruffleRuby 24.0.0
0.222029
0.33000100000000004
0.19171500000000002
2.120432
4.006869

@eregon
Copy link
Member

eregon commented Apr 23, 2024

With a slightly modified script which:

  • uses Benchmark.realtime to use a monotonic clock
  • reports 3 timings for each method (to get an idea if things are warmed up or not yet, and get an idea of the error in the measurements)
  • uncomments the #% m for modinv2, otherwise the result differs from other methods
  • cleans up the check for RubyVM::YJIT.enable
require 'benchmark'

# Enable YJIT if using CRuby >= 3.3"
RubyVM::YJIT.enable if defined?(RubyVM::YJIT.enable)
puts RUBY_DESCRIPTION

def modinv(a0, m0)
  return 1 if m0 == 1
  a, m = a0, m0
  x0, inv = 0, 1
  while a > 1
    inv -= (a / m) * x0
    a, m = m, a % m
    x0, inv = inv, x0
  end
  inv += m0 if inv < 0
  inv
end

def modinv1(r, m)
  inv = r
  while (r * inv) % m != 1; inv %= m; inv *= r end
  inv % m
end

def modinv2(r, m)
  inv = r
  while (r * inv) % m != 1; inv = (inv * r) % m end
  inv % m
end

def modinv3(r, m, rescnt)
  (r ** (rescnt-1)) % m
end

def modinv4(r, m, rescnt)
  r.pow(rescnt-1, m)
end

def tm(&b)
  puts
  3.times do
    puts Benchmark.realtime(&b)
  end
end

resp7 = [11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,
          97,101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,
         167,169,173,179,181,187,191,193,197,199,209,211]

print resp7.map { |r| modinv(r, 210) }
puts
print resp7.map { |r| modinv1(r, 210) }
puts
print resp7.map { |r| modinv2(r, 210) }
puts
print resp7.map { |r| modinv3(r, 210, 48) }
puts
print resp7.map { |r| modinv4(r, 210, 48) }
puts

tm{ 100000.times {resp7.each { |r| modinv(r, 210) } } }

tm{ 100000.times {resp7.each { |r| modinv1(r, 210) } } }

tm{ 100000.times {resp7.each { |r| modinv2(r, 210) } } }

tm{ 100000.times {resp7.each { |r| modinv3(r, 210, 48) } } }

tm{ 100000.times {resp7.each { |r| modinv4(r, 210, 48) } } }

I get:

On Intel:
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [x86_64-linux]
1.347028863034211
1.3474754459457472
1.3452114629908465

1.2710674100089818
1.2754993910202757
1.2733817050466314

1.3521089899586514
1.351813138986472
1.3528226349735633

7.390963806945365
7.495342062029522
7.4823689730255865

0.8994149370118976
0.8945350690046325
0.9004962909966707

truffleruby 24.0.0, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
0.3218597859959118
0.24441085202852264
0.21928530297009274

0.20125181798357517
0.18273464200319722
0.1594632159685716

0.1859166320064105
0.16585971100721508
0.13765042601153255

4.247611464001238
4.127755031979177
4.079420825000852

8.350773187004961
8.356869303970598
8.30608620395651
On AMD Ryzen 7 3700X 8-Core Processor
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) +YJIT [x86_64-linux]
0.6688195530005032
0.6663745859996197
0.6647927419999178

0.7047924210000929
0.7027385810006308
0.70747054200001

0.6540912820000813
0.6548124430000826
0.6545112009998775

5.6871033949992125
5.716956408999977
5.703890113999478

0.46341663200018957
0.46297978599977796
0.46605407200058835

truffleruby 24.0.1, like ruby 3.2.2, Oracle GraalVM Native [x86_64-linux]
0.2904605989997435
0.207193976999406
0.18607605099987268

0.1610029299999951
0.13540141500016034
0.11250887000005605

0.13582420100010495
0.10616527599995607
0.08965392699974473

2.6358709440000894
2.63863470600063
2.630714566000279

5.454730677999578
5.439872395000748
5.455230954999934

truffleruby 24.0.1, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
0.30870855699959066
0.19738048799990793
0.18506349099970976

0.17732479499954934
0.12824568500036548
0.12031344499973784

0.13241845700031263
0.09785182599989639
0.09070625899948936

2.0937121340002705
1.7953665820004971
1.7865803080003388

2.732251799000551
2.6282954050002445
2.6161076440002944

I noticed YJIT is much slower in my measurements, maybe you are on a different architecture, I guess darwin-aarch64?

But overall it's similar results, TruffleRuby is fastest everywhere except for the last method, modinv4.
You mentioned regression, do you mean a regression compared to an older version of TruffleRuby, or you mean regression as "slower than CRuby/JRuby"?

Either way modinv4 amounts to (r between 11 and 211).pow(47, 210) so that's all small integers as inputs, so we should optimize that case too like CRuby/JRuby, it seems the java.math.BigInteger#modPow code doesn't optimize that case well unfortunately.

@eregon
Copy link
Member

eregon commented Apr 23, 2024

@andrykonchin Could you port https://github.com/jruby/jruby/blob/9.4.4.0/core/src/main/java/org/jruby/RubyInteger.java#L935-L944 to TruffleRuby? (or from CRuby, but probably more work)

@andrykonchin andrykonchin self-assigned this Apr 23, 2024
@jzakiya
Copy link
Author

jzakiya commented Apr 23, 2024

My system is a Lenovo Legion Slimjet laptop (2022), AMD Ryzen 9 5900HX, with Linux.

The issue is that pow, which is used in modinv4, is significantly slower in Truffleruby than in [C|J]Ruby.
In [C|J]Ruby using pow is the fastest among different implementations, but is slowest in Truffleruby.

@eregon
Copy link
Member

eregon commented Apr 25, 2024

Fixed in #3552

Now modinv4 is as fast as modinv1 and modinv2:

On AMD Ryzen 7 3700X 8-Core Processor

truffleruby 24.1.0-dev-52596ca3, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
0.35175757099932525
0.21934590799719444
0.19075998200059985

0.20475720999820624
0.13995916100247996
0.11859457800164819

0.16480916699947556
0.10902587799864705
0.09228703799817595

2.012781329001882
1.7254886529990472
1.7228568719983741

0.21183235100033926
0.13329798599806963
0.11889045199859538

Same but running 5 times to ensure enough warmup since this is on jargraal and not libgraal:
truffleruby 24.1.0-dev-52596ca3, like ruby 3.2.2, Oracle GraalVM JVM [x86_64-linux]
0.36994781300018076
0.221259508998628
0.18763062200014247
0.19057154599795467
0.18439673299872084

0.19098085299992817
0.1328596249986731
0.11979981799959205
0.11720118900120724
0.11687311399873579

0.1548915470011707
0.10623463299998548
0.10084365199872991
0.09816986200166866
0.09051773600003798

2.040468414998031
1.7358531079989916
1.755651747000229
1.6976114799981588
1.7026588670014462

0.22171137699842802
0.13052473000061582
0.11817796299874317
0.11785270799737191
0.11395291900043958

As a bonus the new code is written in Ruby (which in this case is a lot faster than Java's BigInteger#modPow!) and IMO looks a lot cleaner than the equivalents in CRuby / JRuby.

@eregon eregon closed this as completed Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants