Entropy scanning fixups #319

The loosey-goosey "base64-anything" matching introduced in 3.0.0 turned out to be problematic because we several examples of things that looked like base64 encodings but weren't and which generated new issues thanks to the extended combined alphabet. Simple tweaks looked problematic (read "nondeterministic") going forward. This fix is a little bit more invasive. * We scan for BASE64 and BASE64URL encodings separately (using the same flakey expressions as pre-3.0.0 did) so they cannot bleed into each other. The cost of this is that we now make 3 (vs 2) regex scans of input text. A number of unit tests required revision to account for the additional calls. * To offset this cost, we further adjust the regex expressions to match a minimum of 20 characters, offloading much of the work from function `util.find_strings_by_regex()` which spent most of its time throwing away short strings. Nothing in the tartufo codebase allows users to supply a different minimum length; I added a warning for folks who might be building other code on top of tartufo. * Additionally, `scanner.scan_entropy()` used to split lines into words and scan the words individually. Since the supplied regexs will do this by themselves anyway, we eliminate the `line.split()` step and eliminate a pass over the input text. This required fixes to a number of unit tests where we generally make fewer calls on longer targets. * But wait! There's more! Many strings are both base64 and base64url, and we don't want to generate duplicate (or overlapping) issues. That resulted in the following two changes... * `scanner.evaluate_entropy_string()` now returns an optional issue instead of being a generator (that returned either 1 or 0 issues). This allows its only caller to inspect the return before passing it back to higher-level callers. * `scanner.scan_entropy()` now jumps through deduplication hoops: - results from both base64 and base64url scans are combined in a set to eliminate strict duplicates; as a side effect, it is possible that lines with multiple issues (in the same line) may return the issues in a different order. - target strings are evaluated from longest to shortest, and we no longer report an issue for a substring if we already reported an issue on the longer string containing it. We do not strictly test that the substring is located within the longer string, so a line that contained "bad-string even-more-bad-string" would not generate an issue for the initial "bad-string" even though it appears before the subsequent "even-more-bad-string" (which would get an issue). - We also test for duplicate hex encoding issues (because a hex string is also a base64 string) -- this could not happen with default sensitivity (backstory omitted) but is nice to close off. My testing shows variations in the noise level when tartufo scans itself, but it seems like the extra overhead should be approximately balanced by the efficiency gains and we'll see a wash. Time will tell. Note it is *still* possible to get new issues (not reported by 2.x) if a string begins or ends with `-` or `_` and the old string didn't generate an issue but the new longer string does.

Commits on Jan 12, 2022

CHANGELOG updated

rscottbailey committed Jan 12, 2022

Configuration menu

View commit details

Copy full SHA for 90cd64e

Browse repository at this point

Copy the full SHA

90cd64e View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entropy scanning fixups #319

Entropy scanning fixups #319

Commits on Jan 11, 2022

Commits on Jan 12, 2022