Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entropy scanning fixups #319

Closed
wants to merge 2 commits into from
Closed

Entropy scanning fixups #319

wants to merge 2 commits into from

Commits on Jan 11, 2022

  1. Entropy scanning fixups

    The loosey-goosey "base64-anything" matching introduced in 3.0.0 turned
    out to be problematic because we several examples of things that looked
    like base64 encodings but weren't and which generated new issues thanks
    to the extended combined alphabet. Simple tweaks looked problematic
    (read "nondeterministic") going forward.
    
    This fix is a little bit more invasive.
    
    * We scan for BASE64 and BASE64URL encodings separately (using the same
      flakey expressions as pre-3.0.0 did) so they cannot bleed into each
      other.  The cost of this is that we now make 3 (vs 2) regex scans of
      input text. A number of unit tests required revision to account for
      the additional calls.
    * To offset this cost, we further adjust the regex expressions to match
      a minimum of 20 characters, offloading much of the work from function
      `util.find_strings_by_regex()` which spent most of its time throwing
      away short strings. Nothing in the tartufo codebase allows users to
      supply a different minimum length; I added a warning for folks who
      might be building other code on top of tartufo.
    * Additionally, `scanner.scan_entropy()` used to split lines into words
      and scan the words individually. Since the supplied regexs will do
      this by themselves anyway, we eliminate the `line.split()` step and
      eliminate a pass over the input text. This required fixes to a number
      of unit tests where we generally make fewer calls on longer targets.
    * But wait! There's more! Many strings are both base64 and base64url,
      and we don't want to generate duplicate (or overlapping) issues.
      That resulted in the following two changes...
    * `scanner.evaluate_entropy_string()` now returns an optional issue
      instead of being a generator (that returned either 1 or 0 issues).
      This allows its only caller to inspect the return before passing it
      back to higher-level callers.
    * `scanner.scan_entropy()` now jumps through deduplication hoops:
      - results from both base64 and base64url scans are combined in a set
        to eliminate strict duplicates; as a side effect, it is possible that
        lines with multiple issues (in the same line) may return the issues
        in a different order.
      - target strings are evaluated from longest to shortest, and we no
        longer report an issue for a substring if we already reported an issue
        on the longer string containing it. We do not strictly test that the
        substring is located within the longer string, so a line that contained
        "bad-string even-more-bad-string" would not generate an issue for
        the initial "bad-string" even though it appears before the subsequent
        "even-more-bad-string" (which would get an issue).
      - We also test for duplicate hex encoding issues (because a hex string
        is also a base64 string) -- this could not happen with default
        sensitivity (backstory omitted) but is nice to close off.
    
    My testing shows variations in the noise level when tartufo scans itself,
    but it seems like the extra overhead should be approximately balanced by
    the efficiency gains and we'll see a wash. Time will tell.
    
    Note it is *still* possible to get new issues (not reported by 2.x) if a
    string begins or ends with `-` or `_` and the old string didn't generate
    an issue but the new longer string does.
    rscottbailey committed Jan 11, 2022
    Configuration menu
    Copy the full SHA
    d4c3f07 View commit details
    Browse the repository at this point in the history

Commits on Jan 12, 2022

  1. CHANGELOG updated

    rscottbailey committed Jan 12, 2022
    Configuration menu
    Copy the full SHA
    90cd64e View commit details
    Browse the repository at this point in the history