-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entropy scanning optimizations and base64url support #274
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tarkatronic
suggested changes
Nov 10, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the review comments, I think we need a whole lot more investigation around this before steaming ahead with it.
Although the regex-based word splitting, if that can be pulled out into a discrete enhancement, that would clearly be faster!
fcda3f3
to
0ddca92
Compare
Several key components have been rewritten for better performance and improved maintainability: * `get_strings_of_set()` has been replaced with `find_string_encodings()` -- Previously we performed matching ourselves, a character at a time. I'm betting a standard regex will be at least as or more efficient. It's also more accurate; in particular the base64 pad character `=` now is allowed only at the end of an encoding, as specified by RFC4648. * `calculate_entropy()` -- Previously we used the encoding alphabet (unnecessarily) as a loop range, and scanned the input data once for every character in the alphabet. Now we scan the input once and use a Counter to generate probability data. * `evaluate_entropy_string()` no longer requires the encoding alphabet as input * `scan_entropy()` was reorganized to avoid buffering intermediate results * `BASE64_CHARS` and `HEX_CHARS` have been replaced by `BASE64_REGEX` and `HEX_REGEX` respectively. All unit tests pass after only changes to fix up call signatures. The only functional difference is that the old implementation believed `=` could appear in the middle of a base64 encoding, and the new implementation does not.
Clean up linter complaint caused by previous typing fix
This might seem weird and out of place, but I think it's useful because it explains some underlying code behavior that is not at all obvious at first glance.
These tests assert the basic mathematical properties of entropy scoring.
* Remove extraneous comment for reuse elsewhere * Rename `find_string_encodings()` to `find_strings_by_regex()` * Use `string.*` constants for building test cases
Closes #177 We extend the "base64" regex so it recognizes base64url also. As a side effect, some mutant strings that are almost but not quite base64 may be recognized also, but this is expected to be rare in real life and at worst results in analysis of substrings that would otherwise be ignored.
6e2fca9
to
9f10019
Compare
You saw it here. pylint: "You have too many tests!" Split test class to keep the number of test methods in each below pylint's maximum limit.
tarkatronic
approved these changes
Nov 15, 2021
This was referenced Nov 15, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To help us get this pull request reviewed and merged quickly, please be sure to include the following items:
PR Type
What kind of change does this PR introduce?
Backward Compatibility
Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?
Issue Linking
closes #177
closes #273
What's new?
Several key components have been rewritten for better performance and improved maintainability:
get_strings_of_set()
has been replaced withfind_string_encodings()
-- Previously we performed matching ourselves, a character at a time. I'm betting a standard regex will be at least as or more efficient. It's also more accurate; in particular the base64 pad character=
now is allowed only at the end of an encoding, as specified by RFC4648.calculate_entropy()
-- Previously we used the encoding alphabet (unnecessarily) as a loop range, and scanned the input data once for every character in the alphabet. Now we scan the input once and use a Counter to generate probability data.evaluate_entropy_string()
no longer requires the encoding alphabet as inputscan_entropy()
was reorganized to avoid buffering intermediate resultsBASE64_CHARS
andHEX_CHARS
have been replaced byBASE64_REGEX
andHEX_REGEX
respectively.BASE64_REGEX
has been extended to include-
and_
so it discovers base64url strings in addition to base64 strings.All unit tests pass after only changes to fix up call signatures. Possible regressions are these, none of which are expected to appear in real life:
=
in the middle would have been recognized as base64 but now it will be recognized as two base64 strings (the first ending with the=
and the second containing the remainder of the string).=
is not allowed to appear in a base64 (or base64url) string except at the end.-
or_
will now be recognized as a longer string containing those characters. These are not valid base64 but are valid base64url and the scanner does not distinguish between the two (or a mutant combination of both of them).