The loosey-goosey "base64-anything" matching introduced in 3.0.0 turned
out to be problematic because we several examples of things that looked
like base64 encodings but weren't and which generated new issues thanks
to the extended combined alphabet. Simple tweaks looked problematic
(read "nondeterministic") going forward.
This fix is a little bit more invasive.
* We scan for BASE64 and BASE64URL encodings separately (using the same
flakey expressions as pre-3.0.0 did) so they cannot bleed into each
other. The cost of this is that we now make 3 (vs 2) regex scans of
input text. A number of unit tests required revision to account for
the additional calls.
* To offset this cost, we further adjust the regex expressions to match
a minimum of 20 characters, offloading much of the work from function
`util.find_strings_by_regex()` which spent most of its time throwing
away short strings. Nothing in the tartufo codebase allows users to
supply a different minimum length; I added a warning for folks who
might be building other code on top of tartufo.
* Additionally, `scanner.scan_entropy()` used to split lines into words
and scan the words individually. Since the supplied regexs will do
this by themselves anyway, we eliminate the `line.split()` step and
eliminate a pass over the input text. This required fixes to a number
of unit tests where we generally make fewer calls on longer targets.
* But wait! There's more! Many strings are both base64 and base64url,
and we don't want to generate duplicate (or overlapping) issues.
That resulted in the following two changes...
* `scanner.evaluate_entropy_string()` now returns an optional issue
instead of being a generator (that returned either 1 or 0 issues).
This allows its only caller to inspect the return before passing it
back to higher-level callers.
* `scanner.scan_entropy()` now jumps through deduplication hoops:
- results from both base64 and base64url scans are combined in a set
to eliminate strict duplicates; as a side effect, it is possible that
lines with multiple issues (in the same line) may return the issues
in a different order.
- target strings are evaluated from longest to shortest, and we no
longer report an issue for a substring if we already reported an issue
on the longer string containing it. We do not strictly test that the
substring is located within the longer string, so a line that contained
"bad-string even-more-bad-string" would not generate an issue for
the initial "bad-string" even though it appears before the subsequent
"even-more-bad-string" (which would get an issue).
- We also test for duplicate hex encoding issues (because a hex string
is also a base64 string) -- this could not happen with default
sensitivity (backstory omitted) but is nice to close off.
My testing shows variations in the noise level when tartufo scans itself,
but it seems like the extra overhead should be approximately balanced by
the efficiency gains and we'll see a wash. Time will tell.
Note it is *still* possible to get new issues (not reported by 2.x) if a
string begins or ends with `-` or `_` and the old string didn't generate
an issue but the new longer string does.