-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for working with bytes
and bytearray
#42
Conversation
Thanks for the PR; seems like a good extended use-case, since now that I check the underlying Rust library does actually work with arbitrary byte slices. Here are my initial thoughts; I did not read the code in detail, just looked at the general approach. Feel free to push back on any of them if you disagree. 1. Using the buffer API?Possibly instead of hard-coding support for A relevant example of using the buffer API from PyO3, probably something similar would be the right thing here: https://gitlab.com/tahoe-lafs/pycddl/-/blob/main/src/lib.rs#L49 This is more nice-to-have than a blocker, but it is nice to have:
2. API designI would be happier having two separate classes, one for strings and one for (byte) buffers. Allowing for a string against a byte haystack, or vice-versa, is a little weird. And you could document the semantics, but ... there's also just a lot of nested logic in here which makes me worry more about testing. Say: import ahocorasick_rs
ac = ahocorasick_rs.BytesAhoCorasick([b"xxx", b"yyy"]) 3. TestsThere should be some new tests, once previous items are settled. |
Thinking out loud without necessarily strongly advocating one way or the other... FWIW, the main Now in Rust, converting The other thing that sticks out to me is that this library reports codepoint offsets when searching strings, but this PR I believe returns byte offsets when searching |
Python strings internally are not UTF-8, they are either 1 or 2 or 4 bytes wide depending on contents (IIRC). This is an unexposed implementation detail, though, which has changed at least once. So there is a UTF-8 conversion step necessary for Rust interop. I had not thought of the offsets issue but yeah, for bytes you'd want byte offsets, not UTF-8 offsets. I can imagine a third use case of "here is a haystack which is UTF-8-encoded bytes, search it with these string needles", but that's a different use case than bytes+bytes, and probably deserves its own explicit |
Right. That was my understanding.
Can you say why this is a different use case than "just search bytes"? If both cases are just "here's some bytes, search them," then I think they work the same. And you're want byte offsets for both. This issue on the Rust aho-corasick library tracker might help provide some additional context (apologies if you already know all of it): BurntSushi/aho-corasick#72 |
Assuming needles are strings, one difference between passing in a UTF-8-encoded byte array as haystack vs a Python string as a haystack is that dealing with the latter requires allocating a UTF-encoded copy of the string. And maybe since that's what we currently do, having an optimized non-allocating path for an extended use case is no problem. Mostly I think it's just API design scars from the Python 2 to 3 transition, when string type went from bytes (with no particular encoding!) to unicode. I just don't like APIs that can accept either bytes or strings in a Python context. |
Indeed. I have similar scars. Painful. |
I didn't know about the buffer protocol. Thanks for the pointer! However, as implemented, this PR actually only allows using a
As noted above, matching
Will do, once the correct approach is agreed upon. |
Yeah, thought about it some more, and I still think I'd rather have separate classes for strings and bytes. Docs are simpler, type signatures are clearer, and the code will run (slightly?) faster. |
Also note I am about to merge another PR that will make some relevant changes. |
Any news on this? |
@insightfulbit what is your use case, out of curiosity? |
I need to look for patterns from network packet data captured during network traffic analysis ("pcap") Now I'd need to decode each network packet in order to search on it. By having the ability to search from a bytes format I'd avoid this |
Might need to do this with new approach, given I want separate classes. |
4ca8c26
to
867cbc7
Compare
Sorry for not following up on this for so long. I refactored the implementation to support various byte sequences through the buffer API, using a different type as requested. I also added tests, mostly based on the existing ones, but also covering some edge cases that are specific to this type. However, I'm not comfortable with the safety caveat around getting a |
b2adc3a
to
232d26f
Compare
This allows the library to be used for searching raw binary data as well.
232d26f
to
1d3fb37
Compare
Thank you! I will try to review soon. |
Thank you! I will start pushing updates to your branch, assuming I don't hit permission issues. |
Gotta run, will get back to this soon. Notes to myself:
|
The constructor should be fine because of the way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, and sorry this take so long to get to. I will address the review comments and then merge.
Also need to file follow-up issue about the edge case where releasing the GIL is possible. |
I filed #94 as a followup. |
This makes it useful for searching raw binary data as well, at the cost of having to do type checking manually.