Remove `String#casecmp` part #123

AlexWayfer · 2017-04-14T14:55:50Z

The String#casecmp method does not work with Unicode (even in Ruby 2.4.1), this is written in the documentation.

Example:

'Привет'.casecmp('привет') # => -1

There is a method String#casecmp?, which works with Unicode.

Example:

'Привет'.casecmp?('привет') # => true

But String.casecmp? is slower:

Warming up --------------------------------------
String#downcase + ==   233.440k i/100ms
      String#casecmp   274.247k i/100ms
     String#casecmp?   219.906k i/100ms
Calculating -------------------------------------
String#downcase + ==      5.746M (± 1.7%) i/s -     28.947M in   5.039252s
      String#casecmp      6.942M (± 1.8%) i/s -     34.829M in   5.019073s
     String#casecmp?      4.517M (± 2.6%) i/s -     22.650M in   5.017864s

Comparison:
      String#casecmp:  6941676.9 i/s
String#downcase + ==:  5745893.8 i/s - 1.21x  slower
     String#casecmp?:  4517314.3 i/s - 1.54x  slower

Code

require 'benchmark/ips'

SLUG = 'ABCD'

def slow
  SLUG.downcase == 'abcd'
end

def fast
  SLUG.casecmp('abcd') == 0
end

def another
  SLUG.casecmp?('abcd')
end

Benchmark.ips do |x|
  x.report('String#downcase + ==') { slow }
  x.report('String#casecmp')       { fast }
  x.report('String#casecmp?')      { another }
  x.compare!
end

So, String#downcase + == is good compromise.

The text was updated successfully, but these errors were encountered:

kbrock · 2018-04-27T15:06:18Z

maybe a compromise here is to have String#casecmp (no unicode support)

so we have the comparison, but the problem is front and center?

AlexWayfer · 2018-04-27T15:16:47Z

maybe a compromise here is to have String#casecmp (no unicode support)

so we have the comparison, but the problem is front and center?

What? I think guides should promotes international (Unicode) ways, not ASCII-only (read as English-only).

nirvdrum · 2018-07-21T15:33:40Z

Many use cases are ASCII-only. I think it's still valid to retain benchmarks for those situations. Ruby even provides String#ascii_only? to help guard such cases.

As an addendum, every Ruby implementation optimizes for ASCII. It's not a value statement. It's just that ASCII characters are byte wide, which makes optimizing for them very easy. UTF-16 and UTF-32 are also somewhat optimized given they have fixed width encodings. UTF-8 complicates things by virtue of variable width characters. As a simple test, benchmark String#length with a string with all ASCII characters and the same string with the last character replaced by a variable-width one; you'll see a massive difference.

I'm sensitive to ensuring code works with all character sets. But to the extent that this is a set of benchmarks that are context-sensitive, I think it makes sense to lay out that context. Really, all the String benchmarks should be enhanced to run with ASCII-only strings and strings with multi-byte characters in UTF-8.

mateusdeap · 2022-11-04T00:22:01Z

I agree with @kbrock and @nirvdrum. There's no good reason to restrict things to UTF-8 because all things should be international. They should follow the actual use cases.

@AlexWayfer do you think you can edit your PR to keep the old casecmp entry and add the case that is compatible with UTF-8?

AlexWayfer · 2022-11-04T03:15:24Z

There's no good reason to restrict things to UTF-8 because all things should be international.

The strings in Ruby are UTF-8 by default.

Do you know about Ruby's # encoding: UTF-8 magic comment? It's not here anymore because of the changed default.

I think it's a very good reason.

They should follow the actual use cases.

And rules about different non-clear use cases should not exist, at least be turned off by default (if we're talking about linting).

@AlexWayfer do you think you can edit your PR to keep the old casecmp entry and add the case that is compatible with UTF-8?

I can, but I don't want. For what? The suggested "by international community" rule doesn't work with a lot of lanuagues, hungreds probably. The significant majority of our world.

I don't want to suggest to use a method which doesn't work correctly in most languages (strings are about texts, texts are written in languages) for a little performance. And I don't want to receive such suggestions.

In the same project in one case you'll be warned and it'll work and you're fine. In another, with another language, probably dynamically (what is difficult to investigate and analyze) it will be broken (just false for no reason, when it's obviously true).

@mateusdeap doi you think you can create a PR to Ruby to fix the casecmp behaviour for Unicode (the Ruby's default)? Then I'm gonna edit my PR. Otherwise… it just doesn't work, "there is no good reason to suggest things to performance because some of them are in ASCII".

nirvdrum · 2022-11-05T06:37:52Z

To clear up confusion, String#casecmp works fine with UTF-8 strings. What it ignores is non-ASCII characters. You can check if a string consists only of ASCII data via String#ascii_only?. This boils down to Ruby's Unicode case mapping support. String#casecmp should arguably take the case mapping options that String#downcase or String#upcase do. It implicitly works like String#downcase(:ascii).

I just want to clear up that "ASCII" here doesn't literally mean the US-ASCII encoding, but rather UTF-8 with characters that are drawn from ASCII (as UTF-8 is ASCII-compatible).

I have a few other parting thoughts:

I don't think this repo should be a considered recommendation for anything. If you want to enforce one method over the other, Rubocop would be a better tool for doing that. As far as I can tell, this project's benchmarks are intended to show the performance difference between calls under different contexts. In most cases, it's a curiosity at best and maybe gives an indication to Ruby developers where things could be sped up.
On that note, there should be no difference in using String#casecmp? and String#casecmp with ASCII-only data, which would eliminate the need to choose. Currently, there is, so this benchmark has value in that sense. This would be a good target for any of the various Ruby JITs to optimize.
Like it or not, there's a huge body of ASCII text out there -- it's a big part of why UTF-8 even exists. CRuby embraces that and optimizes for ASCII-only text throughout most String operations. While I agree the default operation should be the one that works for all characters, both exist for different use cases. To the best of my knowledge, String#casecmp is not deprecated in Ruby.

AlexWayfer mentioned this issue Apr 14, 2017

Remove (disable) Perfomace/Casecmp (at least for Ruby 2.4) rubocop/rubocop#4277

Closed

ouranos mentioned this issue Jun 13, 2017

Ability - case insensitive check of admin_role maestrano/mno-enterprise#316

Merged

AlexWayfer mentioned this issue Jul 19, 2018

Remove String#casecmp rule #160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `String#casecmp` part #123

Remove `String#casecmp` part #123

AlexWayfer commented Apr 14, 2017

kbrock commented Apr 27, 2018

AlexWayfer commented Apr 27, 2018 •

edited

Loading

nirvdrum commented Jul 21, 2018

mateusdeap commented Nov 4, 2022

AlexWayfer commented Nov 4, 2022

nirvdrum commented Nov 5, 2022

Remove String#casecmp part #123

Remove String#casecmp part #123

Comments

AlexWayfer commented Apr 14, 2017

kbrock commented Apr 27, 2018

AlexWayfer commented Apr 27, 2018 • edited Loading

nirvdrum commented Jul 21, 2018

mateusdeap commented Nov 4, 2022

AlexWayfer commented Nov 4, 2022

nirvdrum commented Nov 5, 2022

Remove `String#casecmp` part #123

Remove `String#casecmp` part #123

AlexWayfer commented Apr 27, 2018 •

edited

Loading