Entropy scanning optimizations and base64url support #274

rbailey-godaddy · 2021-11-09T21:14:58Z

To help us get this pull request reviewed and merged quickly, please be sure to include the following items:

Tests (if applicable)
Documentation (if applicable)
Changelog entry
A full explanation here in the PR description of the work done

PR Type

What kind of change does this PR introduce?

Backward Compatibility

Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?

Yes (backward compatible) -- with minor/rare corner cases noted below
No (breaking changes)

Issue Linking

closes #177
closes #273

What's new?

Several key components have been rewritten for better performance and improved maintainability:

get_strings_of_set() has been replaced with find_string_encodings() -- Previously we performed matching ourselves, a character at a time. I'm betting a standard regex will be at least as or more efficient. It's also more accurate; in particular the base64 pad character = now is allowed only at the end of an encoding, as specified by RFC4648.
calculate_entropy() -- Previously we used the encoding alphabet (unnecessarily) as a loop range, and scanned the input data once for every character in the alphabet. Now we scan the input once and use a Counter to generate probability data.
evaluate_entropy_string() no longer requires the encoding alphabet as input
scan_entropy() was reorganized to avoid buffering intermediate results
BASE64_CHARS and HEX_CHARS have been replaced by BASE64_REGEX and HEX_REGEX respectively.
Finally, BASE64_REGEX has been extended to include - and _ so it discovers base64url strings in addition to base64 strings.

All unit tests pass after only changes to fix up call signatures. Possible regressions are these, none of which are expected to appear in real life:

A string containing = in the middle would have been recognized as base64 but now it will be recognized as two base64 strings (the first ending with the = and the second containing the remainder of the string). = is not allowed to appear in a base64 (or base64url) string except at the end.
A base64 string adjacent to either - or _ will now be recognized as a longer string containing those characters. These are not valid base64 but are valid base64url and the scanner does not distinguish between the two (or a mutant combination of both of them).

tarkatronic

As noted in the review comments, I think we need a whole lot more investigation around this before steaming ahead with it.

Although the regex-based word splitting, if that can be pulled out into a discrete enhancement, that would clearly be faster!

tartufo/scanner.py

Several key components have been rewritten for better performance and improved maintainability: * `get_strings_of_set()` has been replaced with `find_string_encodings()` -- Previously we performed matching ourselves, a character at a time. I'm betting a standard regex will be at least as or more efficient. It's also more accurate; in particular the base64 pad character `=` now is allowed only at the end of an encoding, as specified by RFC4648. * `calculate_entropy()` -- Previously we used the encoding alphabet (unnecessarily) as a loop range, and scanned the input data once for every character in the alphabet. Now we scan the input once and use a Counter to generate probability data. * `evaluate_entropy_string()` no longer requires the encoding alphabet as input * `scan_entropy()` was reorganized to avoid buffering intermediate results * `BASE64_CHARS` and `HEX_CHARS` have been replaced by `BASE64_REGEX` and `HEX_REGEX` respectively. All unit tests pass after only changes to fix up call signatures. The only functional difference is that the old implementation believed `=` could appear in the middle of a base64 encoding, and the new implementation does not.

Clean up linter complaint caused by previous typing fix

This might seem weird and out of place, but I think it's useful because it explains some underlying code behavior that is not at all obvious at first glance.

These tests assert the basic mathematical properties of entropy scoring.

* Remove extraneous comment for reuse elsewhere * Rename `find_string_encodings()` to `find_strings_by_regex()` * Use `string.*` constants for building test cases

Closes #177 We extend the "base64" regex so it recognizes base64url also. As a side effect, some mutant strings that are almost but not quite base64 may be recognized also, but this is expected to be rare in real life and at worst results in analysis of substrings that would otherwise be ignored.

You saw it here. pylint: "You have too many tests!" Split test class to keep the number of test methods in each below pylint's maximum limit.

rbailey-godaddy requested a review from a team as a code owner November 9, 2021 21:14

rbailey-godaddy requested review from irodelta, sushantmimani and mayuriesha and removed request for a team November 9, 2021 21:14

tarkatronic suggested changes Nov 10, 2021

View reviewed changes

tartufo/scanner.py Show resolved Hide resolved

tartufo/scanner.py Show resolved Hide resolved

rbailey-godaddy requested a review from tarkatronic November 10, 2021 02:23

rbailey-godaddy mentioned this pull request Nov 10, 2021

Improving TruffleHog's Entropy Calculation trufflesecurity/trufflehog#168

Closed

rbailey-godaddy force-pushed the base64url branch from fcda3f3 to 0ddca92 Compare November 12, 2021 23:06

rbailey-godaddy changed the title ~~Entropy scanning optimizations~~ Entropy scanning optimizations and base64url support Nov 12, 2021

rscottbailey added 11 commits November 15, 2021 10:52

Linting cleanups

957b917

Clean up overlooked docstring

6beb4ff

Typing fix for python 3.6 compatibility

4f9d81d

Shoot me now

0f8d54a

Clean up linter complaint caused by previous typing fix

Add some more unit tests

077325d

Linter fixups

5875ea6

Add a theory comment

b61f571

This might seem weird and out of place, but I think it's useful because it explains some underlying code behavior that is not at all obvious at first glance.

Add more entropy unit tests

7d1415a

These tests assert the basic mathematical properties of entropy scoring.

PR review cleanups

fdafd30

* Remove extraneous comment for reuse elsewhere * Rename `find_string_encodings()` to `find_strings_by_regex()` * Use `string.*` constants for building test cases

rbailey-godaddy force-pushed the base64url branch from 6e2fca9 to 9f10019 Compare November 15, 2021 15:54

rscottbailey added 2 commits November 15, 2021 11:00

Merge fixups

8ae1820

Unit test fixups

ecc5dd0

You saw it here. pylint: "You have too many tests!" Split test class to keep the number of test methods in each below pylint's maximum limit.

tarkatronic approved these changes Nov 15, 2021

View reviewed changes

tarkatronic merged commit 28813a9 into v3.x Nov 15, 2021

tarkatronic deleted the base64url branch November 15, 2021 17:26

This was referenced Nov 15, 2021

tartufo entropy scanning does not handle base64url encoding #177

Closed

Optimize entropy measuring functions #273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entropy scanning optimizations and base64url support #274

Entropy scanning optimizations and base64url support #274

rbailey-godaddy commented Nov 9, 2021 •

edited

Loading

tarkatronic left a comment

Entropy scanning optimizations and base64url support #274

Entropy scanning optimizations and base64url support #274

Conversation

rbailey-godaddy commented Nov 9, 2021 • edited Loading

PR Type

Backward Compatibility

Issue Linking

What's new?

tarkatronic left a comment

Choose a reason for hiding this comment

rbailey-godaddy commented Nov 9, 2021 •

edited

Loading