-
-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ERROR: failed to run post-scan plugin: consolidate: and False positive on very long lines #2726
Comments
After running the script again, I received more info, this time pointing to the file that caused the error:
That file happens to be the largest one with 548 MB. And this time, scancode actually ended and produced a summary and output for the other files. |
@FrankHeimes Thank you for the report. That's a sizeable DLL indeed and the likely cause for troubles. The difficulty in this case is that there is a delicate balance to find between possibly skipping such a file entirely and then missing out on some important information or finding a way to get some scan data (possible DLL metadata and basic file info) and not other (such as license and copyright details) Another approach could be to split such large file in arbitrary chunks (say 5 to 10MB) and run scans as usual more efficiently on these fragments and have a special check if there are any scannable data and results near the chunk boudndaries that would need restiching and rescanning some chunk regions. Yet another one could be to have a command line option to skip file above a certain size entirely. What would be your take there? |
@pombredanne IMHO, binary files warrant type specific scanners, because they usually have a specific structure. So copyright data can't just appear at random locations in those files. And if it does, then it is random data! For example, the (C) character followed by some arbitrary printable characters notoriously triggers false positive matches when using trivial scanners. Taking the structure of a file into account, it may be possible to just seek beyond 99.9% of the contents of a file to examine the relevant parts. This way it doesn't matter if the file is 4kB or 4GB in size. |
Last night, I ran scancode on the boost sources. As a result, it reported the consolidate error on these files:
These files are just 1.3MB and 2.2MB in size and appear to have "innocent" content. |
@FrankHeimes you are nailing it! The thing is that each format may need specific ways. But in general compressed data does not have much one can squeeze out.... but as it happens I once found GPL references in the paths from a compressed and unextracted Zip central file directory. And I routinely find proper license and copyright in ELF and DLLs. I guess one approach is to at least to find a way to ignore most compressed files. |
The culprit is the copyright detection on these large files. The process for this is roughly explained here: The process consists in:
The issue is that the candidates detection is based on lines. And very long line mean very long time to lex and parse and possibly find nothing. One solution would be to break very long lines in chunks, which is a strategy adopted for license detection and seen in actiion here https://github.com/nexB/scancode-toolkit/blob/c09309f99c27de4ddb0c1e6e3619b833ceb2aa6e/src/textcode/analysis.py#L138 In the short term, adding a On my laptop (Intel(R) Xeon(R) CPU E3-1505M v6 @ 3.00GHz, 32GB RAM) I got vector200.hpp to scan alright with a timeout of 300 seconds:
|
Description
After scanning the Qt code base, scancode failed to run the final steps:
How To Reproduce
C:\My\scancode-toolkit-30.0.0
"C:\My\scancode-toolkit-30.0.0\Scripts\scancode" -clpieu --license-score 60 --license-text --license-text-diagnostics --only-findings --strip-root --classify --json D:\SourceCodeLicenses.json --summary --generated --consolidate -n 30 --ignore-author "\.rc$|::|\(User Name\) CString|AppDomainManager|Read\(\)\. DataRecord|^the [A-Z][A-Za-z ]+$|Fred Flintstone\(FST\)|Microsoft Visual|Cortana" --ignore-copyright-holder "Microsoft|BCGSoft|Cortana|Basler.*(Basler|Vision)|Allied Vision|Stemmer" --ignore */BCG/* --ignore */ConfigurationManagement/Certificate.pfx/* --ignore */Salut/* --ignore */doc/* --ignore */tutorials/* --ignore *.acf --ignore *.appxmanifest --ignore *.aux --ignore *.bin --ignore *.bmp --ignore *.config --ignore *.cur --ignore *.dat --ignore *.db --ignore *.def --ignore *.hlsl --ignore *.hlsli --ignore *.ico --ignore *.ifc --ignore *.ilk --ignore *.ipch --ignore *.ism --ignore *.jpg --ignore *.lib --ignore *.manifest --ignore *.mc --ignore *.metagen --ignore *.mp4 --ignore *.nls --ignore *.obj --ignore *.pch --ignore *.pchast --ignore *.pdb --ignore *.pfx --ignore *.png --ignore *.pri --ignore *.resfiles --ignore *.resources --ignore *.rh --ignore *.rsp --ignore *.ruleset --ignore *.snk --ignore *.svd --ignore *.tlb --ignore *.tlh --ignore *.tli --ignore *.tlog --ignore *.ver --ignore *.winmd --ignore *.xbf --ignore *.xdc --ignore *.xsd D:\ExtractedQtPackage\Qt
Note that scanning other packages (e. g. MKL from Intel) using the same command succeeds.
System configuration
AMD Ryzen 9 3950X (16 core, 32 thread), 32GB RAM, M.2 SSD 970 EVO Plus 1TB
Windows 10 Enterprise LTSC (1809), Python 3.9, Scancode 30.0.0, downloaded and extracted to C:\My
The text was updated successfully, but these errors were encountered: