-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex::bytes::escape #451
Comments
Could you please describe your use case? Both |
Yeah, exactly. I'm using In my benchmarks such precompiled regex turns out to be ~4x faster than native libc |
FWIW this is what my current implementation looks like: fn escape(b: &[u8]) -> String {
let mut regex_str = String::with_capacity(6 + b.len() * 4);
regex_str += "(?-u:";
for b in b {
write!(regex_str, "\\x{:02X}", b).unwrap();
}
regex_str += ")";
return regex_str;
} It could be somewhat optimised to not escape valid ASCII chars that can appear in regex, but such representation is even better for |
Oh I see. Generally
That is interesting. Typically the speed comes from careful selection of a "rare" byte to give to memchr. Make sure your benchmarks give you sufficient coverage, because if they don't, you might have cases where a regex is slower than memmem. That is, memmem might be slower in some cases than regex, but it is likely much more consistent overall. Alas, there are no general principles here, since it depends a lot on your pattern and corpus. |
I tried on several small (<20 bytes) as well as large (~6.2MB) inputs, although in both cases my needle is relatively small compared to input, and seen pretty much consistent 4x win. Maybe part of it comes from calling into extern function as opposed to inlinable in libc case, but it's still a win.
Do you mean same as what I said in
or something different? Do you mind me sending a PR with such generic byte escape for now, and documenting that exact escaping rules are subject to change? Then you'd have flexibility to change it if needed. |
I think these questions are a good reason why this function might not be a good fit. You're right, of course, because if I have the pattern I guess I don't mind if you submit a PR, but I generally don't like having long standing open PRs, so I may close it eventually. If that's OK with you, then feel free to submit it! |
I meant sending a PR that would be accepted as a solution for now until someone says they need a different representation 😄 |
@RReverser I'd prefer to hold off for now. |
Okay. Do you want to keep the issue open to see if others have thoughts on this? |
Definitely. |
FWIW, i.e. desired behaviour is |
I think I agree with this, in that, this seems like the kind of behavior I might expect. @RReverser What do you think? Also, I'm not quite sure I want to call this The other concern I have here is that we're trying to bundle two distinct and orthogonal concerns into a single function:
So, what happens if someone wants (2) but not (1)? Should we expose yet another function? |
For avoidance of doubt in my previous post, this is simplified but representative of my use case:
I'm not sure I see the difference from plain Similarly
This introduces "meta bytes" in |
@alecmocatta We're talking way past each other. I'm not sure how to reconcile it, so I'll be brief:
|
I'm fine with either proposed solution; for generic bytes I do think that hex representation makes more sense, but I can also see cases where a pretty-printed output might be useful for debugging ASCII strings etc. |
FWIW I found the reason of that difference later and it's coming from the fact that That said, I still think |
What about naming this function |
I don't think this is a common use case. It still seems to me like this function is pretty easy to write if you need it, and you can give it exactly the semantics you want by doing so. |
I am going to close this issue for now. It feels to me like the design space is a little too big and the use case is a little too niche. I'd rather see folks write their own version of this function if they need it. |
Would it make sense to provide
regex::bytes::escape
similar toregex::escape
but so that it could accept any&[u8]
not&str
?Currently the quick workaround is to manually collect bytes into a regex string using
write!(s, "\\x{:02X}", b)
but it would be nice to have a built-in method for that.The text was updated successfully, but these errors were encountered: