Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
automata: fix incorrect offsets reported by reverse inner optimization
Sadly it seems that my days of squashing optimization bugs are still before me. In this particular case, the reverse inner literal optimization (which is a new optimization introduced in regex 1.9) resulted in reporting incorrect match offsets in some cases. The offending case here is: $ regex-cli find match meta --no-table -p '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' -y '888:77:66' 0:1:9:888:77:66 The above reports a match at 1..9, but the correct match is 0..9. The problem here is that the reverse inner literal optimization is being applied, which splits the regex into three (conceptual) pieces: 1. `(?:(\d+)[:.])?(\d{1,2})` 2. `[:.]` 3. `(\d{2})` The reverse inner optimization works by looking for occurrences of (2) first, then matching (1) in reverse to find the start position of the match and then searching for (3) in the forward direction to find the end of the match. The problem in this particular case is that (2) matches at position `3` in the `888:77:66` haystack. Since the first section of numbers is optional, the reverse inner optimization believes a match exists at offset `1` by virtue of matching (1) in reverse. That is, the `(\d{1,2})` matches at 1..3 while the `(?:(\d+)[:.])?` doesn't match at all. The reverse search here is correct in isolation, but it leads to an overall incorrect result by stopping the search early. The issue is that the true leftmost match requires (2) to match at 6..7, but since it matched at 3..4 first, it is considered first and leads to an incorrect overall match. To fix this, we add another "trip wire" to the reverse inner optimization (of which there are already several) that tries to detect cases where it cannot prove that the match it found is actually the leftmost match. Namely, if it reports a match offset greater than the start of the search and otherwise *could* have kept searching, then we don't know whether we have the true leftmost match. In that case, we bail on the optimization and let a slower path take over. This is yet another example of how the nature of regex searching, and in particular leftmost searching, inhibits the composition of different regex strategies. Or at least, makes them incredibly subtle. Fixes #1060
- Loading branch information