Add support for working with `bytes` and `bytearray` #42

iscgar · 2023-01-26T13:17:33Z

This makes it useful for searching raw binary data as well, at the cost of having to do type checking manually.

itamarst · 2023-01-26T16:19:39Z

Thanks for the PR; seems like a good extended use-case, since now that I check the underlying Rust library does actually work with arbitrary byte slices.

Here are my initial thoughts; I did not read the code in detail, just looked at the general approach. Feel free to push back on any of them if you disagree.

1. Using the buffer API?

Possibly instead of hard-coding support for bytes and bytearray, a better approach might be to use the Python buffer API. That way you'd support both of those, plus NumPy arrays, etc. with just one handler.

A relevant example of using the buffer API from PyO3, probably something similar would be the right thing here: https://gitlab.com/tahoe-lafs/pycddl/-/blob/main/src/lib.rs#L49

This is more nice-to-have than a blocker, but it is nice to have:

For haystacks, for example, you could mmap() a file and pass that in.
It would also reduces the testing surface, since you don't need separate tests for different types.

2. API design

I would be happier having two separate classes, one for strings and one for (byte) buffers. Allowing for a string against a byte haystack, or vice-versa, is a little weird. And you could document the semantics, but ... there's also just a lot of nested logic in here which makes me worry more about testing.

Say:

import ahocorasick_rs
ac = ahocorasick_rs.BytesAhoCorasick([b"xxx", b"yyy"])

3. Tests

There should be some new tests, once previous items are settled.

BurntSushi · 2023-01-26T16:30:52Z

Thinking out loud without necessarily strongly advocating one way or the other...

FWIW, the main aho-corasick crate doesn't really separate the "bytes" from the "string" use case. Instead, it just defines all of the search routines as generic over anything that can be cheaply converted to a &[u8]. (A &str in Rust is just a &[u8], but required to be valid UTF-8.) That means you can actually build an Aho-Corasick automaton with &[u8] patterns and search them in &str haystacks, or build an Aho-Corasick automaton with &str patterns and search them in &[u8] haystacks. That is, you can freely mix and match.

Now in Rust, converting &str to &[u8] is quite literally free. In Python, I believe that is not the case, at least, not in all circumstances. (Please correct me if I'm wrong about that.)

The other thing that sticks out to me is that this library reports codepoint offsets when searching strings, but this PR I believe returns byte offsets when searching bytes. I believe that sounds right---given the design of Python strings and their affinity for codepoint offsets---but that indeed might be a good reason to split the top-level API into two types. Namely, if you keep one type, you'll get subtly different results depending on whether the haystack is or isn't a Unicode string. Unless... Python already has APIs like that. (It's been a while since I've written Python, and most of my time was spent with Python 2. So I'm not fully up to date on Python 3 idioms, especially around strings.)

itamarst · 2023-01-26T17:11:00Z

Python strings internally are not UTF-8, they are either 1 or 2 or 4 bytes wide depending on contents (IIRC). This is an unexposed implementation detail, though, which has changed at least once. So there is a UTF-8 conversion step necessary for Rust interop.

I had not thought of the offsets issue but yeah, for bytes you'd want byte offsets, not UTF-8 offsets.

I can imagine a third use case of "here is a haystack which is UTF-8-encoded bytes, search it with these string needles", but that's a different use case than bytes+bytes, and probably deserves its own explicit find_* function call.

BurntSushi · 2023-01-26T17:24:04Z

Python strings internally are not UTF-8

Right. That was my understanding.

I can imagine a third use case of "here is a haystack which is UTF-8-encoded bytes, search it with these string needles", but that's a different use case than bytes+bytes, and probably deserves its own explicit find_* function call.

Can you say why this is a different use case than "just search bytes"? If both cases are just "here's some bytes, search them," then I think they work the same. And you're want byte offsets for both.

This issue on the Rust aho-corasick library tracker might help provide some additional context (apologies if you already know all of it): BurntSushi/aho-corasick#72

itamarst · 2023-01-26T18:26:57Z

Assuming needles are strings, one difference between passing in a UTF-8-encoded byte array as haystack vs a Python string as a haystack is that dealing with the latter requires allocating a UTF-encoded copy of the string. And maybe since that's what we currently do, having an optimized non-allocating path for an extended use case is no problem.

Mostly I think it's just API design scars from the Python 2 to 3 transition, when string type went from bytes (with no particular encoding!) to unicode. I just don't like APIs that can accept either bytes or strings in a Python context.

BurntSushi · 2023-01-26T18:30:20Z

Mostly I think it's just API design scars from the Python 2 to 3 transition, when string type went from bytes (with no particular encoding!) to unicode. I just don't like APIs that can accept either bytes or strings in a Python context.

Indeed. I have similar scars. Painful.

iscgar · 2023-01-27T00:13:30Z

1. Using the buffer API?

Possibly instead of hard-coding support for bytes and bytearray, a better approach might be to use the Python buffer API. That way you'd support both of those, plus NumPy arrays, etc. with just one handler.

I didn't know about the buffer protocol. Thanks for the pointer! However, as implemented, this PR actually only allows using a bytes haystack with bytes needles, and an str haystack with str needles, but explicitly disallows mixing and matching str and bytes. The reason for that is the need to return code point offsets for an str haystack, which may not be possible if e.g. the bytes needles are not valid UTF-8 sequences, as a match can then happen in the middle of a codepoint. While using the buffer protocol might simplify things, I think that even if we start returning byte offsets everywhere, there is no sane way to handle being handed arbitrary bytes through the buffer protocol, as the offsets that we will return might not make sense in the context of the original type.

2. API design

I would be happier having two separate classes, one for strings and one for (byte) buffers. Allowing for a string against a byte haystack, or vice-versa, is a little weird. And you could document the semantics, but ... there's also just a lot of nested logic in here which makes me worry more about testing.

As noted above, matching bytes against str and vice versa is explicitly disallowed by this PR and a user trying to do this will get a TypeError.

3. Tests

There should be some new tests, once previous items are settled.

Will do, once the correct approach is agreed upon.

itamarst · 2023-01-30T14:43:37Z

Yeah, thought about it some more, and I still think I'd rather have separate classes for strings and bytes. Docs are simpler, type signatures are clearer, and the code will run (slightly?) faster.

itamarst · 2023-01-30T17:17:33Z

Also note I am about to merge another PR that will make some relevant changes.

prorealize · 2023-10-01T12:57:31Z

Any news on this?

itamarst · 2023-10-01T19:05:06Z

@insightfulbit what is your use case, out of curiosity?

prorealize · 2023-10-01T19:36:26Z

I need to look for patterns from network packet data captured during network traffic analysis ("pcap")

Now I'd need to decode each network packet in order to search on it. By having the ability to search from a bytes format I'd avoid this

itamarst · 2023-10-05T14:35:00Z

Might need to do this with new approach, given I want separate classes.

iscgar · 2023-10-05T21:24:14Z

Sorry for not following up on this for so long. I refactored the implementation to support various byte sequences through the buffer API, using a different type as requested. I also added tests, mostly based on the existing ones, but also covering some edge cases that are specific to this type.

However, I'm not comfortable with the safety caveat around getting a &[u8] from PyBuffer, even though the current implementation is safe, because nothing actually prevents a misuse and that might bite us in the future. I went ahead with it anyway because I couldn't think of a better solution that avoids having to create expensive Vec<u8> copies (especially of the haystack, which in my use case can be a few hundred MiBs), but I would appreciate ideas about ways to solve it without sacrificing performance.

This allows the library to be used for searching raw binary data as well.

itamarst · 2023-10-06T18:12:41Z

Thank you! I will try to review soon.

itamarst · 2023-10-06T18:22:17Z

Thank you! I will start pushing updates to your branch, assuming I don't hit permission issues.

…earray

itamarst · 2023-10-06T18:30:57Z

Gotta run, will get back to this soon. Notes to myself:

mypy failures can be fixed with type: ignore, it's weird I can't reproduce locally though since CI is clearly correct.
add changelog, add docs, including safety invariants
explain why I think allowing passing in potentially-mutable byte buffers to constructor is fine, and/or consider safer version where we don't release GIL.

iscgar · 2023-10-18T09:30:01Z

The constructor should be fine because of the way PyBufferBytes is used in the iterator, so unless we explicitly call .as_ref() inside one of the iterator closures and then release the GIL while the &[u8] we got is alive (or spawn a thread and release the GIL there while the iterator is being consumed), the only use of the AsRef impl is when AhoCorasickBuilder actually consumes the iterator. However, the haystack is a bit more problematic, because the matches iterator doesn't even know about PyBufferBytes and only holds a &[u8] which we got by explicitly calling .as_ref() on it, and since &[u8] is Ungil, the current code is only safe because we currently don't release the GIL while the &[u8] is alive, but there's nothing preventing us from doing so.

itamarst

Thank you, and sorry this take so long to get to. I will address the review comments and then merge.

tests/test_ac_bytes.py

src/lib.rs

itamarst · 2024-01-05T13:52:38Z

Also need to file follow-up issue about the edge case where releasing the GIL is possible.

…earray

itamarst · 2024-01-05T14:19:25Z

I filed #94 as a followup.

iscgar force-pushed the support_bytes_and_bytearray branch from 4ca8c26 to 867cbc7 Compare October 5, 2023 21:13

iscgar force-pushed the support_bytes_and_bytearray branch 3 times, most recently from b2adc3a to 232d26f Compare October 6, 2023 13:13

Support working with byte sequences such as bytes and bytearray

1d3fb37

This allows the library to be used for searching raw binary data as well.

iscgar force-pushed the support_bytes_and_bytearray branch from 232d26f to 1d3fb37 Compare October 6, 2023 13:14

Merge remote-tracking branch 'origin/main' into support_bytes_and_byt…

d360983

…earray

pythonspeed added 6 commits October 6, 2023 16:51

Fix type errors

3479d9a

Get correct type checks on older versions of Python.

3c59f28

Also check pysrc

d972995

Newer maturin

1654543

Don't support 3.7 anymore

e14b034

Fix maturin version

ab50970

Correctly type check on older versions of Python

fc579d8

itamarst temporarily deployed to release October 6, 2023 21:12 — with GitHub Actions Inactive

itamarst changed the base branch from main to 50-port-to-ahocorasick-rs-v10 January 5, 2024 13:01

itamarst changed the base branch from 50-port-to-ahocorasick-rs-v10 to main January 5, 2024 13:01

itamarst approved these changes Jan 5, 2024

View reviewed changes

pythonspeed added 10 commits January 5, 2024 08:54

Address minor review comments.

2912cc8

Address minor review comments.

aa8ac0d

Small simplification suggested by clippy.

27156e4

Simplify.

4c781e3

More extensive checking.

3292bef

Having checked-in examples is useful

8aaf6d3

More examples.

dc8db86

Merge remote-tracking branch 'origin/main' into support_bytes_and_byt…

df58d05

…earray

Update dependencies.

7731f77

Switch to latest Rust.

ab625f6

pythonspeed added 3 commits January 5, 2024 09:20

Document the new feature.

a4b9edf

Make the test more robust.

3d5cf90

Unused import

410d55f

itamarst temporarily deployed to release January 5, 2024 14:43 — with GitHub Actions Inactive

itamarst merged commit 58545e9 into G-Research:main Jan 5, 2024
21 checks passed

iscgar deleted the support_bytes_and_bytearray branch March 8, 2024 11:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for working with `bytes` and `bytearray` #42

Add support for working with `bytes` and `bytearray` #42

iscgar commented Jan 26, 2023

itamarst commented Jan 26, 2023

BurntSushi commented Jan 26, 2023

itamarst commented Jan 26, 2023

BurntSushi commented Jan 26, 2023

itamarst commented Jan 26, 2023 •

edited

Loading

BurntSushi commented Jan 26, 2023

iscgar commented Jan 27, 2023 •

edited

Loading

1. Using the buffer API?

2. API design

3. Tests

itamarst commented Jan 30, 2023

itamarst commented Jan 30, 2023

prorealize commented Oct 1, 2023

itamarst commented Oct 1, 2023

prorealize commented Oct 1, 2023

itamarst commented Oct 5, 2023

iscgar commented Oct 5, 2023

itamarst commented Oct 6, 2023

itamarst commented Oct 6, 2023

itamarst commented Oct 6, 2023

iscgar commented Oct 18, 2023

itamarst left a comment

itamarst commented Jan 5, 2024

itamarst commented Jan 5, 2024

Add support for working with bytes and bytearray #42

Add support for working with bytes and bytearray #42

Conversation

iscgar commented Jan 26, 2023

itamarst commented Jan 26, 2023

1. Using the buffer API?

2. API design

3. Tests

BurntSushi commented Jan 26, 2023

itamarst commented Jan 26, 2023

BurntSushi commented Jan 26, 2023

itamarst commented Jan 26, 2023 • edited Loading

BurntSushi commented Jan 26, 2023

iscgar commented Jan 27, 2023 • edited Loading

1. Using the buffer API?

2. API design

3. Tests

itamarst commented Jan 30, 2023

itamarst commented Jan 30, 2023

prorealize commented Oct 1, 2023

itamarst commented Oct 1, 2023

prorealize commented Oct 1, 2023

itamarst commented Oct 5, 2023

iscgar commented Oct 5, 2023

itamarst commented Oct 6, 2023

itamarst commented Oct 6, 2023

itamarst commented Oct 6, 2023

iscgar commented Oct 18, 2023

itamarst left a comment

Choose a reason for hiding this comment

itamarst commented Jan 5, 2024

itamarst commented Jan 5, 2024

Add support for working with `bytes` and `bytearray` #42

Add support for working with `bytes` and `bytearray` #42

itamarst commented Jan 26, 2023 •

edited

Loading

iscgar commented Jan 27, 2023 •

edited

Loading