Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entropy scanning optimizations and base64url support #274

Merged
merged 13 commits into from
Nov 15, 2021
Merged

Conversation

rbailey-godaddy
Copy link
Contributor

@rbailey-godaddy rbailey-godaddy commented Nov 9, 2021

To help us get this pull request reviewed and merged quickly, please be sure to include the following items:

  • Tests (if applicable)
  • Documentation (if applicable)
  • Changelog entry
  • A full explanation here in the PR description of the work done

PR Type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update (formatting, local variables)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Backward Compatibility

Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?

  • Yes (backward compatible) -- with minor/rare corner cases noted below
  • No (breaking changes)

Issue Linking

closes #177
closes #273

What's new?

Several key components have been rewritten for better performance and improved maintainability:

  • get_strings_of_set() has been replaced with find_string_encodings() -- Previously we performed matching ourselves, a character at a time. I'm betting a standard regex will be at least as or more efficient. It's also more accurate; in particular the base64 pad character = now is allowed only at the end of an encoding, as specified by RFC4648.
  • calculate_entropy() -- Previously we used the encoding alphabet (unnecessarily) as a loop range, and scanned the input data once for every character in the alphabet. Now we scan the input once and use a Counter to generate probability data.
  • evaluate_entropy_string() no longer requires the encoding alphabet as input
  • scan_entropy() was reorganized to avoid buffering intermediate results
  • BASE64_CHARS and HEX_CHARS have been replaced by BASE64_REGEX and HEX_REGEX respectively.
  • Finally, BASE64_REGEX has been extended to include - and _ so it discovers base64url strings in addition to base64 strings.

All unit tests pass after only changes to fix up call signatures. Possible regressions are these, none of which are expected to appear in real life:

  • A string containing = in the middle would have been recognized as base64 but now it will be recognized as two base64 strings (the first ending with the = and the second containing the remainder of the string). = is not allowed to appear in a base64 (or base64url) string except at the end.
  • A base64 string adjacent to either - or _ will now be recognized as a longer string containing those characters. These are not valid base64 but are valid base64url and the scanner does not distinguish between the two (or a mutant combination of both of them).

@rbailey-godaddy rbailey-godaddy requested a review from a team as a code owner November 9, 2021 21:14
@rbailey-godaddy rbailey-godaddy requested review from irodelta, sushantmimani and mayuriesha and removed request for a team November 9, 2021 21:14
Copy link
Contributor

@tarkatronic tarkatronic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the review comments, I think we need a whole lot more investigation around this before steaming ahead with it.

Although the regex-based word splitting, if that can be pulled out into a discrete enhancement, that would clearly be faster!

tartufo/scanner.py Show resolved Hide resolved
tartufo/scanner.py Show resolved Hide resolved
@rbailey-godaddy rbailey-godaddy changed the title Entropy scanning optimizations Entropy scanning optimizations and base64url support Nov 12, 2021
Several key components have been rewritten for better performance and
improved maintainability:

* `get_strings_of_set()` has been replaced with `find_string_encodings()` --
  Previously we performed matching ourselves, a character at a time. I'm
  betting a standard regex will be at least as or more efficient. It's
  also more accurate; in particular the base64 pad character `=` now is
  allowed only at the end of an encoding, as specified by RFC4648.
* `calculate_entropy()` -- Previously we used the encoding alphabet
  (unnecessarily) as a loop range, and scanned the input data once for
  every character in the alphabet. Now we scan the input once and use a
  Counter to generate probability data.
* `evaluate_entropy_string()` no longer requires the encoding alphabet
  as input
* `scan_entropy()` was reorganized to avoid buffering intermediate
  results
* `BASE64_CHARS` and `HEX_CHARS` have been replaced by `BASE64_REGEX`
  and `HEX_REGEX` respectively.

All unit tests pass after only changes to fix up call signatures. The
only functional difference is that the old implementation believed `=`
could appear in the middle of a base64 encoding, and the new
implementation does not.
Clean up linter complaint caused by previous typing fix
This might seem weird and out of place, but I think it's useful because
it explains some underlying code behavior that is not at all obvious at
first glance.
These tests assert the basic mathematical properties of entropy scoring.
* Remove extraneous comment for reuse elsewhere
* Rename `find_string_encodings()` to `find_strings_by_regex()`
* Use `string.*` constants for building test cases
 Closes #177

 We extend the "base64" regex so it recognizes base64url also. As a side
 effect, some mutant strings that are almost but not quite base64 may
 be recognized also, but this is expected to be rare in real life and at
 worst results in analysis of substrings that would otherwise be
 ignored.
You saw it here. pylint: "You have too many tests!"

Split test class to keep the number of test methods in each below pylint's
maximum limit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants