You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ugrep can run faster by refactoring the search logic to break up the large code block in advance() into separate functions that get called quicker e.g. by a switch or function pointer to skip conditionals. Breaking up this large function helps the compiler a lot to optimize this code better than having to analyze a large function body.
A bit of experimentation shows significant speed improvements are attainable on ARM64 NEON at least. So it is worth the effort to refactor this code that is not fully optimized by the compiler.
Even adding a dummy printf() statement runs the code faster (!) despite the overhead of IO. So yeah, compiler optimizations aren't kicking in a much as I want to at the moment. On a more serious note, this is not new to me. I taught several years of graduate level high-performance computing. I will more closely follow (my own) advice with the next release cycles. It's just work, not difficult to do.
With these optimizations and omitting line counting when possible, such as for option -c, when searching a 13GB file we can go from
$ time ugrep -c rol en.txt1171415 4.54 real 2.86 user 1.40 sys
to a much lower timing
$ time ugrep -c rol en.txt1171415 2.40 real 0.83 user 1.39 sys
which runs 90% faster on AArch64/NEON. Other search options will benefit anywhere from 20% to 100% speedup on AArch64/NEON. Because the compiler's register allocation, instruction scheduling and alias analysis are improved, I expect these changes will also speed up searching with SSE2/AVX2. A quick test confirms this, with the same runs on Intel MacOS giving a 15% speed up and a 90% speed up when searching for the word the.
Now I have to find time to work on this. Stay tuned!
The text was updated successfully, but these errors were encountered:
OK, implemented and mostly tested over the weekend. Still some work to do. The executable is not larger, but faster. This update will be a lot faster on ARM devices that support NEON and AArch64.
updated SIMD algorithms
improved selection and specialization based on pattern characteristics
faster line counting, especially NEON/AArch64 is now super fast with new vector code that I came up with, including a fast alternative for vaddvq_s8 for horizontal vector addition on NEON
fix an obscure pattern match bug I found today in testing using a large generative test set I wrote some time ago to hit ugrep hard (that's how I found a bug in rg which I mention in one of my articles)
This shows that ugrep is (one of) the fastest grep. Please note that no grep can (and should) absolutely claim to be always the fastest, because there are different algorithms involved with pros and cons.
ugrep can run faster by refactoring the search logic to break up the large code block in
advance()
into separate functions that get called quicker e.g. by a switch or function pointer to skip conditionals. Breaking up this large function helps the compiler a lot to optimize this code better than having to analyze a large function body.A bit of experimentation shows significant speed improvements are attainable on ARM64 NEON at least. So it is worth the effort to refactor this code that is not fully optimized by the compiler.
Even adding a dummy
printf()
statement runs the code faster (!) despite the overhead of IO. So yeah, compiler optimizations aren't kicking in a much as I want to at the moment. On a more serious note, this is not new to me. I taught several years of graduate level high-performance computing. I will more closely follow (my own) advice with the next release cycles. It's just work, not difficult to do.With these optimizations and omitting line counting when possible, such as for option
-c
, when searching a 13GB file we can go fromto a much lower timing
which runs 90% faster on AArch64/NEON. Other search options will benefit anywhere from 20% to 100% speedup on AArch64/NEON. Because the compiler's register allocation, instruction scheduling and alias analysis are improved, I expect these changes will also speed up searching with SSE2/AVX2. A quick test confirms this, with the same runs on Intel MacOS giving a 15% speed up and a 90% speed up when searching for the word
the
.Now I have to find time to work on this. Stay tuned!
The text was updated successfully, but these errors were encountered: