Add SIMD escape #160

zzau13 · 2018-11-16T02:00:22Z

askama_escape/src/lib.rs

zzau13 · 2018-11-17T18:58:43Z

Wait alignment solution, please comment unsafe solution.

djc · 2018-11-17T19:03:11Z

So I was thinking about this last night and I think we should separate out the long and short benchmarks so we can compare the effect of using AVX2 instructions on them separately (based on my reading, I think AVX2 stuff might actually slow down short string escaping). Can you do that? I'll also want to chew a bit on this code since I myself don't have any experience with AVX2 and I want to really understand it before we merge this.

Also I think next we might want to explore searching based on the PCMPESTRI instruction from SSE 4.2, which might be faster than AVX2 according to what's described here.

zzau13 · 2018-11-17T19:42:51Z

Ok, I do the separation in the bench.
picohttpparser is implemented with that I look at it.

djc · 2018-11-21T14:55:31Z

So, high level conclusions just from looking at your benchmark data:

The SSE implementation doesn't really pay for itself in terms of complexity / performance
AVX2 is faster even for the shortest strings you've tested (5 characters)
AVX2 is only neglibly faster if there's not a lot to escape

Very good stuff! In light of these findings, I'd like to change the benchmarks a little:

Lengths of 1 bytes, 10 bytes, and 5 MB (to have more granularity about smaller strings)
At 10 bytes and 5 MB, add strings that have 3% escaped characters

My question here is if the AVX2 stuff is even worth having, or if it's really only effective at levels of escaped characters that are unlikely to come up in the wild.

(I'm really sorry if we end up not using some of this painstakingly developed code, but I also want to be realistic about the complexity it adds and whether it's worth it in terms of real-world performance.)

djc · 2018-11-21T14:56:11Z

Also, can you squash the code prettification from the last commit into the related earlier commits?

zzau13 · 2018-11-21T21:25:48Z

If I always finish squash, if I leave the commit is so you can see the changes easily.

As I mentioned, it is less effective for string length less than the instruction, 16 or 32.
The result of the benchs: vectorized is log constant as a function of to_string(), so it is much more convenient for a production environment than the scalar.

If the e2e would stop escaping numbers or relocate string, duplicated to_string(), is very difficult to have the pointer aligned, would go much faster. But we need to reimplement From of the generic type Display for Num, String, str... with rust-lang/rust#31844
Let's put a bench of a forum or blog and the result will be spectacular. But in these cases it is not appreciated, it penalizes as I told you before starting. As I said, the easiest thing is to make an template operator to mark long strings, than more 10-16 characters.

My opinion is that we should improve MarkupDisplay and limit the extreme cases.

The benchs are already uploaded.
I do not know what happens to AppVeyor.

zzau13 · 2018-11-22T14:02:40Z

We are comparing the efficiency between a truck and a motorcycle to carry a candy.

It does not make any sense and it is also that this result warned by gitter when you asked me. Now you can not tell that all the work does not work because what I said has happened.

djc · 2018-11-22T14:20:57Z

Yes, you were right. Just to make sure I understand, are you angry and/or frustrated with me for this?

zzau13 · 2018-11-22T15:52:11Z

I'm not angry nor frustrated, but is very unefficient that after talking and knowing about that possible issue you comment on it like a problem or something unexpected.

zzau13 · 2018-11-22T18:09:51Z

Wtf, Bill, what do we do with that memory? I have icons but not in my memory, perfect.

djc · 2018-11-22T18:51:31Z

Wtf, Bill, what do we do with that memory? I have icons but not in my memory, perfect.

Huh? Not sure what you mean.

zzau13 · 2018-11-22T18:59:00Z

In Windows, AppVeyor, it gives an equality error in two strings with the same bytes. But I still have 4k icons and these useful things. I was messing with Windows. I removed the simd support to windows.

zzau13 · 2018-11-23T00:26:35Z

Well, what do you want to do with this pull request?
For my part, I do not have any more changes.

djc · 2018-11-23T07:06:43Z

I'm not angry nor frustrated, but is very unefficient that after talking and knowing about that possible issue you comment on it like a problem or something unexpected.

It's not a problem per se, but I still didn't know what to expect. That might be because I don't have a lot of experience doing actual performance optimization, even though I know quite a bit about what kinds of things can cause performance issues. So I had expected that AVX2 would be substantially faster at large strings even if there wasn't a lot to escape. Now that it isn't, I'm no longer sure if there's value in it.

What do you think we should do? Do you think adding the SSE and AVX versions of the code will lead to better performance on real-world workloads? For that matter, have you tried and compared this branch on an actual project you've used Askama for?

My inclination is to see if I can figure out a better benchmark, in the sense that it should be a more reflective mix of longer and shorter strings as well as the relative frequencies of characters from the different relative spaces. For example, it occurs to me now with the benchmark results we have that the trick of ignoring byte values from outside the range of escapable characters is very effective, but that means that digits and non-escaped interpunction will have a different effect on the benchmark results -- yet, the current benchmark data does not contain this class of character at all.

zzau13 · 2018-11-23T19:32:20Z

When I do not like sse, but the avx if I like it. It is still very green and does not give the expected performance.

I will continue investigating vectorized operations with @tizgafa and later we will try again on askama.

Viewing the results, prioritize askama in the recognition of elements not escapeable and the heuristic of output size.

How about?

zzau13 · 2018-11-23T22:26:53Z

I close it. The experimentation of the escape with simd a new repo. In addition to Askama, we will devote ourselves to implement all kinds of algorithms with this tool, in case you want to raise another algorithms.

djc reviewed Nov 16, 2018

View reviewed changes

askama_escape/src/lib.rs Outdated Show resolved Hide resolved

zzau13 changed the title ~~Add avx escape~~ Add SIMD escape Nov 19, 2018

bott added 3 commits November 22, 2018 18:04

Add sized cases in escape bench

c35091c

Add test cases in askama_escape

f187b51

Add avx2 escape

f0bfd73

Add sse4.2 escape

a900a6f

zzau13 closed this Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SIMD escape #160

Add SIMD escape #160

zzau13 commented Nov 16, 2018 •

edited

Loading

zzau13 commented Nov 17, 2018

djc commented Nov 17, 2018

zzau13 commented Nov 17, 2018

djc commented Nov 21, 2018

djc commented Nov 21, 2018

zzau13 commented Nov 21, 2018 •

edited

Loading

zzau13 commented Nov 22, 2018

djc commented Nov 22, 2018

zzau13 commented Nov 22, 2018

zzau13 commented Nov 22, 2018

djc commented Nov 22, 2018

zzau13 commented Nov 22, 2018

zzau13 commented Nov 23, 2018

djc commented Nov 23, 2018

zzau13 commented Nov 23, 2018

zzau13 commented Nov 23, 2018

Add SIMD escape #160

Add SIMD escape #160

Conversation

zzau13 commented Nov 16, 2018 • edited Loading

zzau13 commented Nov 17, 2018

djc commented Nov 17, 2018

zzau13 commented Nov 17, 2018

djc commented Nov 21, 2018

djc commented Nov 21, 2018

zzau13 commented Nov 21, 2018 • edited Loading

zzau13 commented Nov 22, 2018

djc commented Nov 22, 2018

zzau13 commented Nov 22, 2018

zzau13 commented Nov 22, 2018

djc commented Nov 22, 2018

zzau13 commented Nov 22, 2018

zzau13 commented Nov 23, 2018

djc commented Nov 23, 2018

zzau13 commented Nov 23, 2018

zzau13 commented Nov 23, 2018

zzau13 commented Nov 16, 2018 •

edited

Loading

zzau13 commented Nov 21, 2018 •

edited

Loading