Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat 'comments' and 'actual code' differently while scanning a file. #1995

Open
MankaranSingh opened this issue Apr 2, 2020 · 6 comments
Open

Comments

@MankaranSingh
Copy link
Contributor

MankaranSingh commented Apr 2, 2020

Short Description

Examine the following code sample:

def main():
    for i in range(20):
        print("cc", "by", "nd")


gpl = 20
print(gpl*80)

although, they don't really signify those licences, but scancode outputs following:

  • cc-by-4.0
  • GPL 2.0
  • GPL 1.0 or later

It would be better if scancode separates comments/docstrings and actual code in a file before scanning since the the licence or stuff like that are almost always found in comments/docstrings.

For this we need to accurately detect programming language (pygments doesn't) and scan accordingly since now we know what character(s) (for that programming language) is used to add comment/docstring.

Also, this would result in faster scans.

Similar issues:
#1933

Possible Labels

  • new feature

Select Category

  • Enhancement [x]
  • Add License/Copyright []
  • Scan Feature [x]
  • Packaging []
  • Documentation []
  • Expand Support []
  • Other []
@MankaranSingh MankaranSingh changed the title Treat 'comments' and 'actual code' differently while scanning. Treat 'comments' and 'actual code' differently while scanning a file. Apr 2, 2020
@pombredanne
Copy link
Member

That's a great idea!

although, they don't really signify those licences, but scancode outputs following:

You have to be extremely careful when you use synthetic, made up examples: in most cases, these do not exists in the real world. And using these, you may end up not solving any problem and/or over fitting your solution to a non-problem

That said, I am all for it: you can splits things in eventually three categories:
code proper, literals variables in code and comments, though license statements may span code and literals.

The difficulty to gauge is whether this will help with either speed or accuracy for actual license detection on real world data.

@MankaranSingh
Copy link
Contributor Author

You have to be extremely careful when you use synthetic, made up examples: in most cases, these do not exists in the real world.

That's a really great insight to learn ! II'l be more carefull next time.

The difficulty to gauge is whether this will help with either speed or accuracy for actual license detection on real world data.

I guess if we gather enough evidence that almost all the licence statements are found in comments, we may avoid scanning the actual code (not sure if this would be good, maybe add something as a 'shallow scan' option ?) This would speed up things as most of the file is made up of code rather than comments.

@pombredanne
Copy link
Member

@MankaranSingh the idea of a lighter, lesser assurance driven by a command option can make sense. Sometimes it can make sense to get a quicker feel for what's in there.

@MankaranSingh
Copy link
Contributor Author

@pombredanne yes, some codebases would specifically benefit from this, there can be following outcomes:

  • Less false positives with improved speed
  • no affect on accuracy with improved speed
  • less accuracy but improved speed
  • full scan with extra info of category (comment, variable, literals, actual code) so user can decide for himself.

the amount of percentages of outcomes is unknown though.
But I really can see this could be worth working upon.

@chinyeungli
Copy link
Contributor

chinyeungli commented Apr 20, 2020

This is something to think of when fixing the issue: #2013

@armijnhemel
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants