Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for working with bytes and bytearray #42

Merged
merged 22 commits into from
Jan 5, 2024

Conversation

iscgar
Copy link

@iscgar iscgar commented Jan 26, 2023

This makes it useful for searching raw binary data as well, at the cost of having to do type checking manually.

@itamarst
Copy link
Collaborator

Thanks for the PR; seems like a good extended use-case, since now that I check the underlying Rust library does actually work with arbitrary byte slices.

Here are my initial thoughts; I did not read the code in detail, just looked at the general approach. Feel free to push back on any of them if you disagree.

1. Using the buffer API?

Possibly instead of hard-coding support for bytes and bytearray, a better approach might be to use the Python buffer API. That way you'd support both of those, plus NumPy arrays, etc. with just one handler.

A relevant example of using the buffer API from PyO3, probably something similar would be the right thing here: https://gitlab.com/tahoe-lafs/pycddl/-/blob/main/src/lib.rs#L49

This is more nice-to-have than a blocker, but it is nice to have:

  1. For haystacks, for example, you could mmap() a file and pass that in.
  2. It would also reduces the testing surface, since you don't need separate tests for different types.

2. API design

I would be happier having two separate classes, one for strings and one for (byte) buffers. Allowing for a string against a byte haystack, or vice-versa, is a little weird. And you could document the semantics, but ... there's also just a lot of nested logic in here which makes me worry more about testing.

Say:

import ahocorasick_rs
ac = ahocorasick_rs.BytesAhoCorasick([b"xxx", b"yyy"])

3. Tests

There should be some new tests, once previous items are settled.

@BurntSushi
Copy link

Thinking out loud without necessarily strongly advocating one way or the other...

FWIW, the main aho-corasick crate doesn't really separate the "bytes" from the "string" use case. Instead, it just defines all of the search routines as generic over anything that can be cheaply converted to a &[u8]. (A &str in Rust is just a &[u8], but required to be valid UTF-8.) That means you can actually build an Aho-Corasick automaton with &[u8] patterns and search them in &str haystacks, or build an Aho-Corasick automaton with &str patterns and search them in &[u8] haystacks. That is, you can freely mix and match.

Now in Rust, converting &str to &[u8] is quite literally free. In Python, I believe that is not the case, at least, not in all circumstances. (Please correct me if I'm wrong about that.)

The other thing that sticks out to me is that this library reports codepoint offsets when searching strings, but this PR I believe returns byte offsets when searching bytes. I believe that sounds right---given the design of Python strings and their affinity for codepoint offsets---but that indeed might be a good reason to split the top-level API into two types. Namely, if you keep one type, you'll get subtly different results depending on whether the haystack is or isn't a Unicode string. Unless... Python already has APIs like that. (It's been a while since I've written Python, and most of my time was spent with Python 2. So I'm not fully up to date on Python 3 idioms, especially around strings.)

@itamarst
Copy link
Collaborator

Python strings internally are not UTF-8, they are either 1 or 2 or 4 bytes wide depending on contents (IIRC). This is an unexposed implementation detail, though, which has changed at least once. So there is a UTF-8 conversion step necessary for Rust interop.

I had not thought of the offsets issue but yeah, for bytes you'd want byte offsets, not UTF-8 offsets.

I can imagine a third use case of "here is a haystack which is UTF-8-encoded bytes, search it with these string needles", but that's a different use case than bytes+bytes, and probably deserves its own explicit find_* function call.

@BurntSushi
Copy link

Python strings internally are not UTF-8

Right. That was my understanding.

I can imagine a third use case of "here is a haystack which is UTF-8-encoded bytes, search it with these string needles", but that's a different use case than bytes+bytes, and probably deserves its own explicit find_* function call.

Can you say why this is a different use case than "just search bytes"? If both cases are just "here's some bytes, search them," then I think they work the same. And you're want byte offsets for both.

This issue on the Rust aho-corasick library tracker might help provide some additional context (apologies if you already know all of it): BurntSushi/aho-corasick#72

@itamarst
Copy link
Collaborator

itamarst commented Jan 26, 2023

Assuming needles are strings, one difference between passing in a UTF-8-encoded byte array as haystack vs a Python string as a haystack is that dealing with the latter requires allocating a UTF-encoded copy of the string. And maybe since that's what we currently do, having an optimized non-allocating path for an extended use case is no problem.

Mostly I think it's just API design scars from the Python 2 to 3 transition, when string type went from bytes (with no particular encoding!) to unicode. I just don't like APIs that can accept either bytes or strings in a Python context.

@BurntSushi
Copy link

Mostly I think it's just API design scars from the Python 2 to 3 transition, when string type went from bytes (with no particular encoding!) to unicode. I just don't like APIs that can accept either bytes or strings in a Python context.

Indeed. I have similar scars. Painful.

@iscgar
Copy link
Author

iscgar commented Jan 27, 2023

1. Using the buffer API?

Possibly instead of hard-coding support for bytes and bytearray, a better approach might be to use the Python buffer API. That way you'd support both of those, plus NumPy arrays, etc. with just one handler.

I didn't know about the buffer protocol. Thanks for the pointer! However, as implemented, this PR actually only allows using a bytes haystack with bytes needles, and an str haystack with str needles, but explicitly disallows mixing and matching str and bytes. The reason for that is the need to return code point offsets for an str haystack, which may not be possible if e.g. the bytes needles are not valid UTF-8 sequences, as a match can then happen in the middle of a codepoint. While using the buffer protocol might simplify things, I think that even if we start returning byte offsets everywhere, there is no sane way to handle being handed arbitrary bytes through the buffer protocol, as the offsets that we will return might not make sense in the context of the original type.

2. API design

I would be happier having two separate classes, one for strings and one for (byte) buffers. Allowing for a string against a byte haystack, or vice-versa, is a little weird. And you could document the semantics, but ... there's also just a lot of nested logic in here which makes me worry more about testing.

As noted above, matching bytes against str and vice versa is explicitly disallowed by this PR and a user trying to do this will get a TypeError.

3. Tests

There should be some new tests, once previous items are settled.

Will do, once the correct approach is agreed upon.

@itamarst
Copy link
Collaborator

Yeah, thought about it some more, and I still think I'd rather have separate classes for strings and bytes. Docs are simpler, type signatures are clearer, and the code will run (slightly?) faster.

@itamarst
Copy link
Collaborator

Also note I am about to merge another PR that will make some relevant changes.

@prorealize
Copy link

Any news on this?

@itamarst
Copy link
Collaborator

itamarst commented Oct 1, 2023

@insightfulbit what is your use case, out of curiosity?

@prorealize
Copy link

I need to look for patterns from network packet data captured during network traffic analysis ("pcap")

Now I'd need to decode each network packet in order to search on it. By having the ability to search from a bytes format I'd avoid this

@itamarst
Copy link
Collaborator

itamarst commented Oct 5, 2023

Might need to do this with new approach, given I want separate classes.

@iscgar iscgar force-pushed the support_bytes_and_bytearray branch from 4ca8c26 to 867cbc7 Compare October 5, 2023 21:13
@iscgar
Copy link
Author

iscgar commented Oct 5, 2023

Sorry for not following up on this for so long. I refactored the implementation to support various byte sequences through the buffer API, using a different type as requested. I also added tests, mostly based on the existing ones, but also covering some edge cases that are specific to this type.

However, I'm not comfortable with the safety caveat around getting a &[u8] from PyBuffer, even though the current implementation is safe, because nothing actually prevents a misuse and that might bite us in the future. I went ahead with it anyway because I couldn't think of a better solution that avoids having to create expensive Vec<u8> copies (especially of the haystack, which in my use case can be a few hundred MiBs), but I would appreciate ideas about ways to solve it without sacrificing performance.

@iscgar iscgar force-pushed the support_bytes_and_bytearray branch 3 times, most recently from b2adc3a to 232d26f Compare October 6, 2023 13:13
This allows the library to be used for searching raw binary data as well.
@iscgar iscgar force-pushed the support_bytes_and_bytearray branch from 232d26f to 1d3fb37 Compare October 6, 2023 13:14
@itamarst
Copy link
Collaborator

itamarst commented Oct 6, 2023

Thank you! I will try to review soon.

@itamarst
Copy link
Collaborator

itamarst commented Oct 6, 2023

Thank you! I will start pushing updates to your branch, assuming I don't hit permission issues.

@itamarst
Copy link
Collaborator

itamarst commented Oct 6, 2023

Gotta run, will get back to this soon. Notes to myself:

  • mypy failures can be fixed with type: ignore, it's weird I can't reproduce locally though since CI is clearly correct.
  • add changelog, add docs, including safety invariants
  • explain why I think allowing passing in potentially-mutable byte buffers to constructor is fine, and/or consider safer version where we don't release GIL.

@itamarst itamarst temporarily deployed to release October 6, 2023 21:12 — with GitHub Actions Inactive
@iscgar
Copy link
Author

iscgar commented Oct 18, 2023

The constructor should be fine because of the way PyBufferBytes is used in the iterator, so unless we explicitly call .as_ref() inside one of the iterator closures and then release the GIL while the &[u8] we got is alive (or spawn a thread and release the GIL there while the iterator is being consumed), the only use of the AsRef impl is when AhoCorasickBuilder actually consumes the iterator. However, the haystack is a bit more problematic, because the matches iterator doesn't even know about PyBufferBytes and only holds a &[u8] which we got by explicitly calling .as_ref() on it, and since &[u8] is Ungil, the current code is only safe because we currently don't release the GIL while the &[u8] is alive, but there's nothing preventing us from doing so.

@itamarst itamarst changed the base branch from main to 50-port-to-ahocorasick-rs-v10 January 5, 2024 13:01
@itamarst itamarst changed the base branch from 50-port-to-ahocorasick-rs-v10 to main January 5, 2024 13:01
Copy link
Collaborator

@itamarst itamarst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, and sorry this take so long to get to. I will address the review comments and then merge.

tests/test_ac_bytes.py Outdated Show resolved Hide resolved
tests/test_ac_bytes.py Outdated Show resolved Hide resolved
tests/test_ac_bytes.py Outdated Show resolved Hide resolved
tests/test_ac_bytes.py Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@itamarst
Copy link
Collaborator

itamarst commented Jan 5, 2024

Also need to file follow-up issue about the edge case where releasing the GIL is possible.

@itamarst
Copy link
Collaborator

itamarst commented Jan 5, 2024

I filed #94 as a followup.

@itamarst itamarst merged commit 58545e9 into G-Research:main Jan 5, 2024
21 checks passed
@iscgar iscgar deleted the support_bytes_and_bytearray branch March 8, 2024 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants