-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: report issues as discovered instead of buffering #227
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🦅
cccc2b1
to
d07ca54
Compare
d07ca54
to
032ed0b
Compare
b9eebb3
to
9952ab5
Compare
mypy is happy and unit tests have been fixed up so they pass again (mostly by liberal application of `list()` to force execution of the generator). This still needs work to fix the logic where it is assumed that the `issues` property is a list, namely where we try to take the length of it or consume it more than once.
Add new `issue_count` property to scanner class. This is incremented as issues are yielded by the low level routines, and allows the caller to tell how many issues we got without inspecting the pipeline multiple times. We needed one non-obvious fix in util.py to ensure we have consumed all issues before checking the count. (Before, it would have been zero nearly always.)
Previously we marshalled all issues into a single `echo()` call, which resulted in buffering all output until the completion of the scan. Now we use a loop (like "compact") and issue a separate `echo()` for each issue, allowing early output to be produced prior to the generation of later issues. This also gets rid of the unwanted "empty output for no issues" behavior introduced by an earlier commit.
This works because only one part of the output is generated, so we can output a (partial) static blob to start, then an issue at a time, and finally terminate the object to make it valid. Because this is a little hinky, the corresponding unit tests have been adjusted to actually generate (and capture) the entire stdout stream. We then deserialize (to verify syntax) and compare the resulting dict with the original input data to confirm there was no content corruption.
Address the last remaining wart; make `echo_result()` buffer issues (even as it streams them to console) and return the list so it's possible to write all issues to files at the end if desired. This returns `write_outputs()` to its original semantics, namely consumption of a List rather than a Generator. Arguably it might be nicer to create files as we go, but that would constitute a behavior change (files not all done right at the end) and need some code refactoring.
Revise test to reflect use of `-1` vs `None` to indicate scan not started
This commit restores the `._issues` member (a list) and reverts the `.issues` property to a list instead of a generator. Several instances of caller references to `.issues` have been restored to their original form. Behind the covers, a new `._completed` member (and matching property) tracks whether or not the scan has completed; `.issues` checks this and forces the scan to run to completion if needed. The low level generators now append to `._issues` instead of incrementing `._issue_count` (which is no longer needed), and `scan()` clears state information when it starts and then sets `._completed` when finished. To make it all work, `echo_result()` now loops on `.scan()` instead of `.issues` so output is still "live" although `.issues` is not.
If `scan()` is called (again) after having completed a scan, return immediately without doing anything. The proper way to look at issues (again) is to consult `.issues`. This limits, but does not eliminate, the potential for craziness if a caller decides to interleave calls to `.scan()` and references to `.issues`.
We squelch "too many instance attributes" on `ScannerBase` because we really really need a `_completed` flag so we tell pylint to suck it up.
Slow burn from @tarkatronic review comment. If `scan()` is called again, yield from the cached `._issues` list instead of doing nothing. This looks like the old behavior except it doesn't actually rescan.
477a1fd
to
1f43307
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of this is probably just a bit of pontification (specifically the thread safety stuff), but I do want to be sure we don't add more side-effects where they did not previously exist.
The real object doesn't have `issue_count` any longer.
Cache issues inside `scan()` instead of the lower-level routines (which now are generators without side effects).
Multiple competing calls to `scan()` are now guaranteed to execute sequentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting in all the hard work on this @rbailey-godaddy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🦅
This is a reimplementation of PR #108 which streams output (including JSON).
To help us get this pull request reviewed and merged quickly, please be sure to include the following items:
PR Type
What kind of change does this PR introduce?
Backward Compatibility
Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?
Issue Linking
What's new?
Issues are now written to the console as soon as they are detected, while the scan is in progress, instead of buffering all of them and reporting only when the scan has completed. This is generally backward compatible:
issues
property (and not thescan()
method) will see no changes in behavior.scan()
method (and not theissues
property) will see no changes in behavior, although the method is now a generator that returns issues in succession rather than a function which returns a list. A caller who doeslen(scanner.scan())
(or anything else that assumes the result is a list) should usescanner.issues
(which is a list) instead.scan()
andissues
, where allscan()
references occur before allissues
references, will see no changes in behavior.These cases are not backward compatible:
scan()
multiple times, expecting to see the scan re-run each time, will observe different behavior. The first series of invocations identifies (and caches) issues, but subsequent invocations re-use the cached issues instead of actually rescanning the target. It looks the same (if you allow for generator vs. list) but you are not actually doing repeated scans.issues
and then callsscan()
will see that thescan()
call now will iterate over the cached issue list instead of rescanning. Once the caller has referenced theissues
property, any subsequentscan()
calls should be replaced with references toissues
.scan()
and then referencingissues
will return the full list of issues, but causescan()
to generate from the cached issues list, meaning that the initial issues will appear to be duplicated. The caller should either useissues
exclusively, or ensure all references toissues
occur after all calls toscan()
.This change was accomplished using the following strategy:
ScannerBase
methods which previously appended to the_issues
member have been converted to generators which yield individual issues; the lowest-level methods still append to_issues
(so as to cache a record of all issues) in addition to yielding the issue. In particular,scan()
is now a generator._completed
member (and matchingcompleted
property) track whether a scan is still in progress (i.e. whether or not thescan()
generator is exhausted); this is initiallyFalse
and set toTrue
byscan()
when it finally exits.issues
property now tests to see if the scan is still in progress, and forcibly drains the generator if needed prior to returning the issues list (which otherwise could be incomplete).echo_output()
now loops onscan()
instead ofissues
in order to obtain issues individually as they are detected, allowing output to be produced as soon as it is encountered; some code reordering was performed to ensure no references toissues
occur before all references toscan()
have been made.echo_output()
were refactored to output individual issues instead of consolidating output to a single large result at the end -- in the most complicated of these, JSON output is handled by generating it in pieces: an initial part, a part for each issue, and a terminal part -- the resulting output is unchanged, but it now appears incrementally;