-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make it easier to maintain fork with additional regex engines #1488
Comments
Thanks very much for filing this issue. I'm glad you did before doing more work So firstly, the regex-geek side of me finds this very exciting. I've always OK, so I'm just going to start by responding to a few points.
While true, it can be trivially increased via
Oh interesting. Is
Hmm. Did this work well? It seems like this should be the right way to go..?
This is a good idea.
Ah yeah, that sounds annoying. Unfortunately, this is correct and intended In general, the
Your TODO list looks like a great start. There might be more things, e.g.,
My guess is that Hyperscan doesn't have Unicode mode enabled by default, where
Looks great! I'm impressed if this is your first Rust project. I don't see
Right, so... Unfortunately, I'm not sure it's a great idea. I think it would
I would really hate to see a world that didn't combine ripgrep with Hyperscan. |
Thanks for the answer !
Hah good idea, I missed that option. Increasing the size does allow the regexps to compile, but it just makes performances worst (at least 2 times worst, I stopped the process after a while).
Yes this is unfortunate. Since the goal of hyperscan is mainly to detect if a payload is malicious or not it makes sense to not care about SOM... but not in our case.
Hah fair enough, I didn't see it that way. I'll rollback the crate to using callback return values then. At this speed the real threshold is more the disk read speed anyway so it won't make that huge of a difference!
Yes, this is indeed better thanks (like the old
I'm disapointed by this answer ! But I understand you, maintaining free software is an ungrateful task. I could maintain a patchset but in the current state of thing it's going to be messy.. A good compromise would be to have the code infrastructure merged in the main project to handle tasks 1 and 2 (without the hyperscan bits) first and then have a maintenable patch. |
Interesting. As a sanity check, did you also set
Task 1 sounds good. I think the values should be For the patchset, I think it would be helpful to try to put as much as the Hyperscan bits in a separate file as possible, and as little as you can in Anyway, good luck. I'm happy to accept PRs that will make maintaining a patchset easier, so long as they aren't too disruptive. I do really want to support your use case because I think it's awesome to integrate ripgrep and Hyperscan. |
Sorry for the response delay.
This highly depends on the regexp used I guess. In my tests, I guess it's just more efficient to send to the default regexp engine
I'll send a MR for task 1 in a couple of days, thanks. For task 2, what was worrying me was the I guess I'll just do in the patch a hackish
Yes, thanks indeed. |
Please do not do this. It's undefined behavior to create a I see what you mean though and how that could be annoying. I'm trying to think of an easy way to maintain a patch here. Hmmm... There is definitely a fairly strong assumption that |
This is in preparation for adding a new --engine flag which is intended to eventually supplant --auto-hybrid-regex. While there are no immediate plans to add more regex engines to ripgrep, this is intended to make it easier to maintain a patch to ripgrep with an additional regex engine. See #1488 for more details.
Thanks! |
For future reference the patchset is here : https://git.sr.ht/~pierrenn/ripgrep |
I needed a CLI tool to parse a massive amount of regexps (and improve my rust at the same time) so I made a crate implementing the
Matcher
trait : https://git.sr.ht/~pierrenn/grep-hyperscanFrom my (sporadic) tests, its starts to be useful when having at least 1000 regexps to parse more than 10GB of data. I had 4.5k on 150GB soo... Since the data comes from disk reads, using
hyperscan
basically limits your speed to your disk speed.Plus there is a limit to the size of the compiled expressions in
ripgrep
, so usinghyperscan
allows to bypass that.Ideally it would be cool if it can be integrated in
ripgrep
. I prefered to open this issue before doing a PR to talk about it and gauge interest. Details are below.Implementation
It's just an implementation of
find_at
sincehyperscan
doesn't support groups (new_captures
is implemented usingNoCaptures
).I thought of 3 possible ways to implement
find_at
:hyperscan
has aHS_FLAG_SINGLEMATCH
which would be great forfind_at
. However this is incompatible with the flagHS_FLAG_SOM_LEFTMOST
which is required to get thefrom
of theMatch
.hyperscan
. However, this means a call to hyperstack each time we have a match. An (outdated...) implementation of this idea is available in the branchideas/single_match
find_at
, scan it in one go withhyperscan
, remember the matches into aVecDeque
, and consume the deque at each new call. I guessed first that when we returnOk(None)
fromfind_at
, the next haystack will be a new one. However (and this is weird?) sometimesfind_at
get sent a new haystack while the "current" is not termined. From testing it seems to only be the same haystack with EOL added at the end, or the right-most part of the original haystack. Thus, we also start a newhyperscan
run when we see a new haystack length. Avoiding the successive calls tohyperscan
allows according to my (sporadic...) benchmarks to speed up the overall match by 10-20%. Ideally, it would be great if we could requireripgrep
to only send once each haystack (or the minimal amount of data), but no idea how to do that.Things to do for integration
I think the following tasks should be done for integration:
-Y/--hyperscan
but this kind of clumsy... IMHO it would be best to have an option--engine=
which acceptsdefault|pcre2|hyperscan
(and this would allow to easily add other engines such as chimera) (default to default, and the-e
shortcut is already taken soo.. ?)-f
to read a text file OR anhyperscan
database. Most of the running time spend by hyperscan is actually to compile the list of text regexp pattern to it's own format DB (see benchmarks below). Plus a lot of DB comes in the already compiled form. Plus sometimes you want to rerun the same regexps on different files...-d/--hyper-write=filename
, disabled by default)HS_FLAG_ALLOWEMPTY
in https://intel.github.io/hyperscan/dev-reference/api_constants.html#pattern-flags (--hyper-allow-empty
, default false)HS_FLAG_UTF8
in same URL (--hyper-utf8
, default false)HS_FLAG_UCP
in same URL. (--hyper-unicode-property
, default false)Do you see something else ? Is there anything to change/which is not OK ? I'll edit the tasks accordingly.
Benchmarking
I has to parse around ~150GB of HTML scraped webpage through ~4500 regexps, so that was my benchmark. The format for hyperscan regexp and default regexp is different, so I used 2 set of regexps for benchmarking. Plus there is a limit to the amount of regexps that the default engine can handle, and since my list of regexp is in the shape:
where I have a 4.5k list of web domains (this is to find possible fediverse accounts in a webpage). Using a basic list like that is too big for the default engine, so I used:
(some.domain1.com|some.other.domain2.net|...)/@[\w.\+\-]+
.The default regexps are here : https://termbin.com/xdse , the hyperscan regexps are here : https://termbin.com/62ov
I used a subset of 15GB of data for testing. Parsing the regexps with the default engine, it takes around 8:20mn (best case). Using the hyperscan engine it takes less than 30 seconds to parse the files (that's basically the speed of my SSD) AND 5 minutes to compile the regexps. That's why we need a flag to deserialize/serialize regexps so using the hyperscan engine becomes easier.
Sorry for the (too!) long issue. The reason I opened this is:
The text was updated successfully, but these errors were encountered: