-
-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treat 'comments' and 'actual code' differently while scanning a file. #1995
Comments
That's a great idea!
You have to be extremely careful when you use synthetic, made up examples: in most cases, these do not exists in the real world. And using these, you may end up not solving any problem and/or over fitting your solution to a non-problem That said, I am all for it: you can splits things in eventually three categories: The difficulty to gauge is whether this will help with either speed or accuracy for actual license detection on real world data. |
That's a really great insight to learn ! II'l be more carefull next time.
I guess if we gather enough evidence that almost all the licence statements are found in comments, we may avoid scanning the actual code (not sure if this would be good, maybe add something as a 'shallow scan' option ?) This would speed up things as most of the file is made up of code rather than comments. |
@MankaranSingh the idea of a lighter, lesser assurance driven by a command option can make sense. Sometimes it can make sense to get a quicker feel for what's in there. |
@pombredanne yes, some codebases would specifically benefit from this, there can be following outcomes:
the amount of percentages of outcomes is unknown though. |
This is something to think of when fixing the issue: #2013 |
Short Description
Examine the following code sample:
although, they don't really signify those licences, but scancode outputs following:
It would be better if scancode separates comments/docstrings and actual code in a file before scanning since the the licence or stuff like that are almost always found in comments/docstrings.
For this we need to accurately detect programming language (pygments doesn't) and scan accordingly since now we know what character(s) (for that programming language) is used to add comment/docstring.
Also, this would result in faster scans.
Similar issues:
#1933
Possible Labels
Select Category
The text was updated successfully, but these errors were encountered: