-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Permit use of (?-u) in byte-regex strategies (#336) #337
Conversation
Thanks for the PR 😅 I'll have a look at this more closely when I have time, but upon first inspection it looks sensible. If I haven't reviewed by the end of this weekend feel free to ping me 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is sound. Even though as @matthew-russo pointed out there is bytes_regex_parsed
already, it requires to pull in regex_syntax
to be used (side note, this crate should probably be re-exported as it appears in the public api), so extending the fn bytes_regex
without a breaking change is a nice solution to allow this to be done directly. Additionally, having the extra doc on how to do this might help someone in future
Agh thanks for reminding me I need to look at this -- will do it this evening after work |
It is desirable to be able to generate, from a regex, byte sequences that are not necessarily valid UTF-8. For example, suppose you have a parser that accepts any string generated by the regex ``` [0-9]+(\.[0-9]*)? ``` Then, in your test suite, you might want to generate strings from the complementary regular language, which you could do with the regex ``` (?s:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*) ``` However, this will still only generate valid UTF-8 strings. Maybe you are parsing directly from byte sequences read from disk, in which case you want to test the parser’s ability to reject invalid UTF-8 _as well as_ valid UTF-8 but not within the accepted language. Then you want this slight variation: ``` (?s-u:|[^0-9].*|[0-9]+[^0-9.].*|[0-9]+\.[0-9]*[^0-9].*) ``` But this regex will be rejected by `bytes_regex`, because by default `regex_syntax::Parser` errors out on any regex that potentially matches invalid UTF-8. The application — i.e. proptest — must opt into use of such regexes. This patch makes proptest do just that, for `bytes_regex` only. There should be no change to the behavior of any existing test suite, because opting to allow use of `(?-u)` does not change the semantics of any regex that _doesn’t_ contain `(?-u)`, and any existing regex that _does_ contain `(?-u)` must be incapable of generating invalid UTF-8 for other reasons, or `regex_syntax::Parser` would be rejecting it. (For example, `(?-u:[a-z])` cannot generate invalid UTF-8.) This patch also adds a bunch of tests for `bytes_regex`, which AFAICT was not being tested at all. Some of these use the new functionality and others don’t. There is quite a bit of code duplication in the test helper functions — `do_test` and `do_test_bytes` are almost identical, as are `generate_values_matching_regex` and `generate_byte_values_matching_regex`. I am not good enough at generic metaprogramming in Rust to factor out the duplication.
Rebased on latest trunk, sorry for the delay. |
Commit rust-lang/regex@706b07d renamed ParserBuilder::allow_invalid_utf8 to ParserBuilder::utf8 and inverted the sense of its argument. Separate commit for review purposes; should be squashed before landing to preserve bisectability of trunk.
Added another commit to fix up the code for an API change. I cannot run the test suite locally right now due to what looks like a bug in message-io...
... but this is sufficient to make proptest build again as a dependency of a larger project that uses the feature this PR adds, and that project's testsuite passes. The additional commit should be squashed before merging. |
Fatfingered the close button, sorry for the noise. |
It is desirable to be able to generate, from a regex, byte sequences that are not necessarily valid UTF-8. For example, suppose you have a parser that accepts any string generated by the regex
Then, in your test suite, you might want to generate strings from the complementary regular language, which you could do with the regex
However, this will still only generate valid UTF-8 strings. Maybe you are parsing directly from byte sequences read from disk, in which case you want to test the parser’s ability to reject invalid UTF-8 as well as valid UTF-8 but not within the accepted language. Then you want this slight variation:
But this regex will be rejected by
bytes_regex
, because by defaultregex_syntax::Parser
errors out on any regex that potentially matches invalid UTF-8. The application — i.e. proptest — must opt into use of such regexes. This patch makes proptest do just that, forbytes_regex
only.There should be no change to the behavior of any existing test suite, because opting to allow use of
(?-u)
does not change the semantics of any regex that doesn’t contain(?-u)
, and any existing regex that does contain(?-u)
must be incapable of generating invalid UTF-8 for other reasons, orregex_syntax::Parser
would be rejecting it. (For example,(?-u:[a-z])
cannot generate invalid UTF-8.)This patch also adds a bunch of tests for
bytes_regex
, which AFAICT was not being tested at all. Some of these use the new functionality and others don’t. There is quite a bit of code duplication in the test helper functions —do_test
anddo_test_bytes
are almost identical, as aregenerate_values_matching_regex
andgenerate_byte_values_matching_regex
. I am not good enough at generic metaprogramming in Rust to factor out the duplication.