Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add :fast_ascii mode to String.valid?/2 #12360

Merged
merged 5 commits into from
Feb 10, 2023
Merged

Conversation

mtrudel
Copy link
Contributor

@mtrudel mtrudel commented Jan 23, 2023

Based on the discussion on #12354, this PR adds an optional :fast_ascii option to String.valid?/2 (based on the bit56 algorithm discussed there). I've confirmed that this implementation yields the same benefits as observed previously:

Benchmark (OTP26, ARM)
iex(3)> Benchee.run(
...(3)>   %{
...(3)>     "stock" => fn {valid, input} -> ^valid = String.valid?(input) end,
...(3)>     "fast_ascii" => fn {valid, input} -> ^valid = String.valid?(input, :fast_ascii) end,
...(3)>   },
...(3)>   time: 10,
...(3)>   memory_time: 2,
...(3)>   inputs: %{
...(3)>     1 => {false, String.duplicate("a", 0) <> <<128::8>>},
...(3)>     4 => {false, String.duplicate("a", 3) <> <<128::8>>},
...(3)>     8 => {false, String.duplicate("a", 7) <> <<128::8>>},
...(3)>     16 => {false, String.duplicate("a", 15) <> <<128::8>>},
...(3)>     32 => {false, String.duplicate("a", 31) <> <<128::8>>},
...(3)>     64 => {false, String.duplicate("a", 63) <> <<128::8>>},
...(3)>     128 => {false, String.duplicate("a", 127) <> <<128::8>>},
...(3)>     256 => {false, String.duplicate("a", 255) <> <<128::8>>},
...(3)>     512 => {false, String.duplicate("a", 511) <> <<128::8>>},
...(3)>     1024 => {false, String.duplicate("a", 1023) <> <<128::8>>},
...(3)>     2048 => {false, String.duplicate("a", 2047) <> <<128::8>>},
...(3)>     4096 => {false, String.duplicate("a", 4095) <> <<128::8>>}
...(3)>   }
...(3)> )
Operating System: macOS
CPU Information: Apple M1
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.15.0-dev
Erlang 26.0-rc0

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: 1, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096
Estimated total run time: 5.60 min

Benchmarking fast_ascii with input 1 ...
Benchmarking fast_ascii with input 4 ...
Benchmarking fast_ascii with input 8 ...
Benchmarking fast_ascii with input 16 ...
Benchmarking fast_ascii with input 32 ...
Benchmarking fast_ascii with input 64 ...
Benchmarking fast_ascii with input 128 ...
Benchmarking fast_ascii with input 256 ...
Benchmarking fast_ascii with input 512 ...
Benchmarking fast_ascii with input 1024 ...
Benchmarking fast_ascii with input 2048 ...
Benchmarking fast_ascii with input 4096 ...
Benchmarking stock with input 1 ...
Benchmarking stock with input 4 ...
Benchmarking stock with input 8 ...
Benchmarking stock with input 16 ...
Benchmarking stock with input 32 ...
Benchmarking stock with input 64 ...
Benchmarking stock with input 128 ...
Benchmarking stock with input 256 ...
Benchmarking stock with input 512 ...
Benchmarking stock with input 1024 ...
Benchmarking stock with input 2048 ...
Benchmarking stock with input 4096 ...

##### With input 1 #####
Name                 ips        average  deviation         median         99th %
stock             2.33 M      429.50 ns  ±8326.31%         333 ns         500 ns
fast_ascii        1.75 M      571.29 ns  ±7050.58%         417 ns        1375 ns

Comparison:
stock             2.33 M
fast_ascii        1.75 M - 1.33x slower +141.79 ns

Memory usage statistics:

Name          Memory usage
stock              0.95 KB
fast_ascii         1.21 KB - 1.28x memory usage +0.27 KB

**All measurements for memory usage were the same**

##### With input 4 #####
Name                 ips        average  deviation         median         99th %
stock             2.27 M      440.72 ns  ±8391.93%         333 ns         500 ns
fast_ascii        1.70 M      588.49 ns  ±5044.09%         458 ns        1375 ns

Comparison:
stock             2.27 M
fast_ascii        1.70 M - 1.34x slower +147.77 ns

Memory usage statistics:

Name          Memory usage
stock              0.95 KB
fast_ascii         1.21 KB - 1.28x memory usage +0.27 KB

**All measurements for memory usage were the same**

##### With input 8 #####
Name                 ips        average  deviation         median         99th %
stock             2.22 M      449.63 ns  ±8262.54%         333 ns         500 ns
fast_ascii        1.75 M      571.52 ns  ±6966.38%         417 ns        1417 ns

Comparison:
stock             2.22 M
fast_ascii        1.75 M - 1.27x slower +121.89 ns

Memory usage statistics:

Name          Memory usage
stock              0.95 KB
fast_ascii         1.21 KB - 1.28x memory usage +0.27 KB

**All measurements for memory usage were the same**

##### With input 16 #####
Name                 ips        average  deviation         median         99th %
stock             2.12 M      472.24 ns  ±8088.62%         375 ns         542 ns
fast_ascii        1.74 M      575.32 ns  ±6022.48%         417 ns        1375 ns

Comparison:
stock             2.12 M
fast_ascii        1.74 M - 1.22x slower +103.08 ns

Memory usage statistics:

Name          Memory usage
stock              0.95 KB
fast_ascii         1.21 KB - 1.28x memory usage +0.27 KB

**All measurements for memory usage were the same**

##### With input 32 #####
Name                 ips        average  deviation         median         99th %
stock             1.91 M      523.51 ns  ±7420.64%         417 ns         583 ns
fast_ascii        1.69 M      591.28 ns  ±4932.73%         458 ns        1416 ns

Comparison:
stock             1.91 M
fast_ascii        1.69 M - 1.13x slower +67.78 ns

Memory usage statistics:

Name          Memory usage
stock              0.95 KB
fast_ascii         1.21 KB - 1.28x memory usage +0.27 KB

**All measurements for memory usage were the same**

##### With input 64 #####
Name                 ips        average  deviation         median         99th %
fast_ascii        1.74 M      575.27 ns  ±5064.74%         417 ns        1416 ns
stock             1.69 M      591.09 ns  ±4528.66%         500 ns         667 ns

Comparison:
fast_ascii        1.74 M
stock             1.69 M - 1.03x slower +15.82 ns

Memory usage statistics:

Name          Memory usage
fast_ascii         1.21 KB
stock              0.95 KB - 0.78x memory usage -0.26563 KB

**All measurements for memory usage were the same**

##### With input 128 #####
Name                 ips        average  deviation         median         99th %
fast_ascii        1.63 M      612.55 ns  ±4946.12%         459 ns        1333 ns
stock             1.33 M      751.79 ns  ±3756.57%         666 ns         833 ns

Comparison:
fast_ascii        1.63 M
stock             1.33 M - 1.23x slower +139.23 ns

Memory usage statistics:

Name          Memory usage
fast_ascii         1.21 KB
stock              0.95 KB - 0.78x memory usage -0.26563 KB

**All measurements for memory usage were the same**

##### With input 256 #####
Name                 ips        average  deviation         median         99th %
fast_ascii        1.56 M        0.64 μs  ±4380.88%        0.50 μs        0.79 μs
stock             0.93 M        1.07 μs  ±2111.72%        0.96 μs        1.17 μs

Comparison:
fast_ascii        1.56 M
stock             0.93 M - 1.67x slower +0.43 μs

Memory usage statistics:

Name          Memory usage
fast_ascii         1.21 KB
stock              0.95 KB - 0.78x memory usage -0.26563 KB

**All measurements for memory usage were the same**

##### With input 512 #####
Name                 ips        average  deviation         median         99th %
fast_ascii        1.45 M        0.69 μs  ±4132.72%        0.54 μs        0.79 μs
stock             0.57 M        1.76 μs  ±1141.26%        1.63 μs        1.83 μs

Comparison:
fast_ascii        1.45 M
stock             0.57 M - 2.54x slower +1.07 μs

Memory usage statistics:

Name          Memory usage
fast_ascii         1.21 KB
stock              0.95 KB - 0.78x memory usage -0.26563 KB

**All measurements for memory usage were the same**

##### With input 1024 #####
Name                 ips        average  deviation         median         99th %
fast_ascii        1.18 M        0.85 μs  ±3472.48%        0.71 μs        0.88 μs
stock            0.194 M        5.15 μs  ±2568.66%        3.04 μs           5 μs

Comparison:
fast_ascii        1.18 M
stock            0.194 M - 6.09x slower +4.30 μs

Memory usage statistics:

Name          Memory usage
fast_ascii         1.21 KB
stock              0.95 KB - 0.78x memory usage -0.26563 KB

**All measurements for memory usage were the same**

##### With input 2048 #####
Name                 ips        average  deviation         median         99th %
fast_ascii        1.01 M        0.99 μs  ±2266.22%        0.88 μs        1.17 μs
stock            0.169 M        5.93 μs   ±239.39%        5.54 μs       17.62 μs

Comparison:
fast_ascii        1.01 M
stock            0.169 M - 5.96x slower +4.94 μs

Memory usage statistics:

Name          Memory usage
fast_ascii         1.21 KB
stock              0.95 KB - 0.78x memory usage -0.26563 KB

**All measurements for memory usage were the same**

##### With input 4096 #####
Name                 ips        average  deviation         median         99th %
fast_ascii      699.67 K        1.43 μs  ±1280.99%        1.33 μs        1.54 μs
stock            92.49 K       10.81 μs    ±75.20%       10.62 μs       12.16 μs

Comparison:
fast_ascii      699.67 K
stock            92.49 K - 7.56x slower +9.38 μs

Memory usage statistics:

Name          Memory usage
fast_ascii         1.21 KB
stock              0.95 KB - 0.78x memory usage -0.26563 KB

**All measurements for memory usage were the same**

lib/elixir/lib/string.ex Outdated Show resolved Hide resolved
lib/elixir/lib/string.ex Outdated Show resolved Hide resolved
@sabiwara
Copy link
Contributor

Hi @mtrudel! These optimizations look quite promising 🤩

I was thinking of an alternative where we wouldn't need to introduce an extra mode argument, by just optimistically running the ASCII-only loop first and run the slower loop on the first mismatch (basically, ASCII until proven otherwise). The obvious downside is that it won't try to optimize mixed inputs that might still contain a lot of ASCII.
The benefit would be that the user doesn't need to be concerned about it, and it should hopefully pick a reasonable strategy for both ASCII and mixed.

  # optimistic loop, able to process big ASCII-only binaries very fast
  def valid?(<<a::56, rest::bits>>) when Bitwise.band(0x80808080808080, a) == 0 do
    valid?(rest)
  end

  # slower loop for other cases
  def valid?(other) when is_binary(other), do: valid_non_only_ascii?(other)

  defp valid_non_only_ascii?(<<_::utf8, rest::bits>>), do: valid_non_only_ascii?(rest)
  defp valid_non_only_ascii?(<<>>), do: true
  defp valid_non_only_ascii?(_), do: false

These early benchmarks look promising, especially with inlining, but I didn't check with various inputs and haven't installed OTP26.

WDYT?

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 24, 2023

Hi @mtrudel! These optimizations look quite promising 🤩

I was thinking of an alternative where we wouldn't need to introduce an extra mode argument, by just optimistically running the ASCII-only loop first and run the slower loop on the first mismatch (basically, ASCII until proven otherwise). The obvious downside is that it won't try to optimize mixed inputs that might still contain a lot of ASCII. The benefit would be that the user doesn't need to be concerned about it, and it should hopefully pick a reasonable strategy for both ASCII and mixed.

  # optimistic loop, able to process big ASCII-only binaries very fast
  def valid?(<<a::56, rest::bits>>) when Bitwise.band(0x80808080808080, a) == 0 do
    valid?(rest)
  end

  # slower loop for other cases
  def valid?(other) when is_binary(other), do: valid_non_only_ascii?(other)

  defp valid_non_only_ascii?(<<_::utf8, rest::bits>>), do: valid_non_only_ascii?(rest)
  defp valid_non_only_ascii?(<<>>), do: true
  defp valid_non_only_ascii?(_), do: false

These early benchmarks look promising, especially with inlining, but I didn't check with various inputs and haven't installed OTP26.

WDYT?

That's a really interesting idea! I like it, but there are three aspects of it that may preclude it:

  1. I suspect a lot of strings are almost entirely ASCII, with only a few non-ASCII characters (such as serialized JSON where the only non-ASCII characters are a diacritic or two in a 'name' field, or a comment field with a single emoji character in it). The approach you outline would bail after encountering the first such character, which would narrow the benefit to truly all-ascii strings. By way of illustration, passing your comment through this version of the function would have caused validation to switch to the slow path on the first line (this one too: 🤣)

  2. There's an aspect of non-determinism to the approach that feels out of place in an 'explicit is better than implicit' worldview (even moreso in a standard library function). Users may find runtimes to vary wildly on seemingly similar inputs, and surfacing the different types of input and their expected runtimes here may be difficult to make clear to users.

  3. The runtime of this approach on earlier OTPs is generally worse than the existing implementation for all inputs. Without giving users a switch for this, we'd be decreasing performance for a good set of users.

Happy to hew either way here, depending on how others feel!

@sabiwara
Copy link
Contributor

@mtrudel thank you for the detailed answers, these are all great points.

  1. Indeed, I see how it would fail to optimize a lot of legit cases (good job using my own comment to convince me 😂)
  2. Great point, I can see the determinism argument being used for :fast_ascii too actualy. Engineers might assume :fast_ascii makes sense after an early benchmark using English, and later get degraded performance when non-English speakers start using the software and JSON payloads get filled with massive chunks of non-ASCII text (Arabic, Chinese...). Maybe if we can have some non-ASCII (e.g. user input), we should assume there could be a lot of it?
  3. I see. Maybe this could be addressed by adding a compile-time check on System.otp_release() I suppose?

Happy to hew either way here, depending on how others feel!

Same for me, I don't have any strong opinion, just wanted to share the idea.

@josevalim
Copy link
Member

I think this patch is good to go, thank you! One last concern is: if the Erlang/OTP team decide to accept an optimized UTF-8 validation (from the simdutf8 library), then this patch may be pointless. According to the paper, the simdutf8 is several times faster than the ascii check implemented in C. So my suggestion is to keep this around until Erlang/OTP 26 is out. :)

@dvic
Copy link
Contributor

dvic commented Jan 25, 2023

Just dropping my 2 cents regarding the naming of the argument mode: even though the documentation says that the validation should be the some, mode makes it sound like it changes the behaviour and not the implementation. Not a huge deal but maybe a better name for the argument is algorithm?

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 25, 2023

Not a huge deal but maybe a better name for the argument is algorithm?

I'd taken the naming from String.downcase/2, though as you mention that actually changes the behaviour not just the implementation. I'll update this here.

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 25, 2023

if the Erlang/OTP team decide to accept an optimized UTF-8 validation (from the simdutf8 library), then this patch may be pointless.

Agreed! The absolute best approach here would be to have a :unicode.valid?/1 function backed by a native implementation such as simdutf8 (or, frankly, even any of the other implementations as a pure-NIF). If / when that happens (I'm still planning on working it up and submitting it upstream, but it's not going to happen 'soon'), we'd still want to keep these versions around as long as Elixir supports Erlang/OTP versions that predate its addition. So I don't think this work is wasted in the meantime.

So my suggestion is to keep this around until Erlang/OTP 26 is out. :)

Sure! Their release milestone doesn't mention anything about it, but seeing as this work is of no benefit on earlier versions, there's no real rush. Whatever you think is easiest!

@josevalim
Copy link
Member

Oh, I thought you were planning to submit a PR with simdutf8 for Erlang/OTP 26. I typically do a draft PR, only to show the numbers, and if they approve it I tidy everything up with tests, docs, and so on. But no worries.

@mtrudel
Copy link
Contributor Author

mtrudel commented Jan 25, 2023

Oh, I thought you were planning to submit a PR with simdutf8 for Erlang/OTP 26. I typically do a draft PR, only to show the numbers, and if they approve it I tidy everything up with tests, docs, and so on. But no worries.

I very much do plan to submit such a PR, but looking at all the things I'm trying to get done for ElixirConf EU (big news on the Bandit front!), (plus my real job 😄) means I'm not going to be able to do it on that timeline. I'm hoping, roughly, to get this done May-ish.


Note that the `:fast_ascii` algorithm does not affect correctness, you can expect the output of
`String.valid?/2` to be the same regardless of algorithm. The only difference to be expected is
one of performance, which can be expected to improve roughly quadratically in string length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improve roughly quadratically

I'm struggling a bit to understand this one, given that both algorithms are linear in string length.
I would have assumed the ratio to be capped by a maximum constant value?
Sorry if I misunderstood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point! I messed up here by taking an engineering approach ('the graph's fitting curve is quadratic!') rather than a first principles approach (in which both approaches are obviously linear by inspection, as you correctly point out).

My error was basically failing 'chart literacy 101'. Here's the chart that I based my conclusion on:

image

Looks quadratic, gets fit well by a quadratic trend line, so it must be quadratic. That's as far as my thought process went. But look at the x axis! It's not linear! If I graph the same data as a proper scatter plot (ie: on a linear x axis), we get:

image

which fits linear-ish (despite visually not fitting well, the R value is pretty strong).

So yeah. The improvement is actually linear as you suspect.

I'll update the docs to reflect this. Good catch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying ❤️
Based on theory and the new graph, I wonder if we won't see a glass ceiling appear if we add a couple order magnitudes on string length?

lib/elixir/lib/string.ex Outdated Show resolved Hide resolved
Co-authored-by: peter madsen <petermm@gmail.com>
@mtrudel
Copy link
Contributor Author

mtrudel commented Feb 10, 2023

Is this issue deadlocked? Just to be clear, from my perspective this is ready to go. Happy to do more work here if there's something missing...

@josevalim josevalim merged commit d83c57f into elixir-lang:main Feb 10, 2023
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@mtrudel
Copy link
Contributor Author

mtrudel commented Mar 8, 2023

Note to future other humans interested in this: http://0x80.pl/notesen/2023-03-06-swar-find-any.html

@codeadict
Copy link

Is adding the simdutf8 algorithm to OTP still desired? I can dedicate some time to this

@mtrudel
Copy link
Contributor Author

mtrudel commented Feb 2, 2024

Is adding the simdutf8 algorithm to OTP still desired? I can dedicate some time to this

Yes! That would be extremely welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants