Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permit use of (?-u) in byte-regex strategies (#336) #337

Merged
merged 2 commits into from
Sep 24, 2023

Conversation

zackw
Copy link
Contributor

@zackw zackw commented Jun 29, 2023

It is desirable to be able to generate, from a regex, byte sequences that are not necessarily valid UTF-8. For example, suppose you have a parser that accepts any string generated by the regex

[0-9]+(\.[0-9]*)?

Then, in your test suite, you might want to generate strings from the complementary regular language, which you could do with the regex

(?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)

However, this will still only generate valid UTF-8 strings. Maybe you are parsing directly from byte sequences read from disk, in which case you want to test the parser’s ability to reject invalid UTF-8 as well as valid UTF-8 but not within the accepted language. Then you want this slight variation:

(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)

But this regex will be rejected by bytes_regex, because by default regex_syntax::Parser errors out on any regex that potentially matches invalid UTF-8. The application — i.e. proptest — must opt into use of such regexes. This patch makes proptest do just that, for bytes_regex only.

There should be no change to the behavior of any existing test suite, because opting to allow use of (?-u) does not change the semantics of any regex that doesn’t contain (?-u), and any existing regex that does contain (?-u) must be incapable of generating invalid UTF-8 for other reasons, or regex_syntax::Parser would be rejecting it. (For example, (?-u:[a-z]) cannot generate invalid UTF-8.)

This patch also adds a bunch of tests for bytes_regex, which AFAICT was not being tested at all. Some of these use the new functionality and others don’t. There is quite a bit of code duplication in the test helper functions — do_test and do_test_bytes are almost identical, as are generate_values_matching_regex and generate_byte_values_matching_regex. I am not good enough at generic metaprogramming in Rust to factor out the duplication.

@cameron1024
Copy link
Member

Thanks for the PR 😅 I'll have a look at this more closely when I have time, but upon first inspection it looks sensible.

If I haven't reviewed by the end of this weekend feel free to ping me 😅

Copy link
Contributor

@tzemanovic tzemanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is sound. Even though as @matthew-russo pointed out there is bytes_regex_parsed already, it requires to pull in regex_syntax to be used (side note, this crate should probably be re-exported as it appears in the public api), so extending the fn bytes_regex without a breaking change is a nice solution to allow this to be done directly. Additionally, having the extra doc on how to do this might help someone in future

@matthew-russo
Copy link
Member

Agh thanks for reminding me I need to look at this -- will do it this evening after work

It is desirable to be able to generate, from a regex, byte sequences
that are not necessarily valid UTF-8.  For example, suppose you have a
parser that accepts any string generated by the regex

```
[0-9]+(\.[0-9]*)?
```

Then, in your test suite, you might want to generate strings from the
complementary regular language, which you could do with the regex

```
(?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)
```

However, this will still only generate valid UTF-8 strings.  Maybe you
are parsing directly from byte sequences read from disk, in which case
you want to test the parser’s ability to reject invalid UTF-8 _as well
as_ valid UTF-8 but not within the accepted language.  Then you want
this slight variation:

```
(?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*)
```

But this regex will be rejected by `bytes_regex`, because by default
`regex_syntax::Parser` errors out on any regex that potentially
matches invalid UTF-8.  The application — i.e. proptest — must opt
into use of such regexes.  This patch makes proptest do just that, for
`bytes_regex` only.

There should be no change to the behavior of any existing test suite,
because opting to allow use of `(?-u)` does not change the semantics
of any regex that _doesn’t_ contain `(?-u)`, and any existing regex
that _does_ contain `(?-u)` must be incapable of generating invalid
UTF-8 for other reasons, or `regex_syntax::Parser` would be rejecting it.
(For example, `(?-u:[a-z])` cannot generate invalid UTF-8.)

This patch also adds a bunch of tests for `bytes_regex`, which AFAICT
was not being tested at all.  Some of these use the new functionality
and others don’t.  There is quite a bit of code duplication in the
test helper functions — `do_test` and `do_test_bytes` are almost
identical, as are `generate_values_matching_regex` and
`generate_byte_values_matching_regex`.  I am not good enough at
generic metaprogramming in Rust to factor out the duplication.
@zackw
Copy link
Contributor Author

zackw commented Sep 18, 2023

Rebased on latest trunk, sorry for the delay.

Commit
rust-lang/regex@706b07d
renamed ParserBuilder::allow_invalid_utf8
to ParserBuilder::utf8
and inverted the sense of its argument.

Separate commit for review purposes; should be squashed before landing
to preserve bisectability of trunk.
@zackw
Copy link
Contributor Author

zackw commented Sep 18, 2023

Added another commit to fix up the code for an API change. I cannot run the test suite locally right now due to what looks like a bug in message-io...

error[E0716]: temporary value dropped while borrowed
   --> .../.cargo/registry/src/index.crates.io-6f17d22bba15001f/message-io-0.17.0/src/adapters/udp.rs:241:22
    |
241 |                 &mut [io::IoSliceMut::new(&mut input_buffer)],
    |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ creates a temporary value which is freed while still in use
...
244 |             );
    |              - temporary value is freed at the end of this statement
245 |
246 |             match result {
    |                   ------ borrow later used here
    |
    = note: consider using a `let` binding to create a longer lived value

... but this is sufficient to make proptest build again as a dependency of a larger project that uses the feature this PR adds, and that project's testsuite passes.

The additional commit should be squashed before merging.

@zackw zackw closed this Sep 18, 2023
@zackw zackw reopened this Sep 18, 2023
@zackw
Copy link
Contributor Author

zackw commented Sep 18, 2023

Fatfingered the close button, sorry for the noise.

@matthew-russo matthew-russo merged commit 370b3a0 into proptest-rs:master Sep 24, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants