Treat 'comments' and 'actual code' differently while scanning a file. #1995

MankaranSingh · 2020-04-02T08:01:30Z

Short Description

Examine the following code sample:

def main():
    for i in range(20):
        print("cc", "by", "nd")


gpl = 20
print(gpl*80)

although, they don't really signify those licences, but scancode outputs following:

cc-by-4.0
GPL 2.0
GPL 1.0 or later

It would be better if scancode separates comments/docstrings and actual code in a file before scanning since the the licence or stuff like that are almost always found in comments/docstrings.

For this we need to accurately detect programming language (pygments doesn't) and scan accordingly since now we know what character(s) (for that programming language) is used to add comment/docstring.

Also, this would result in faster scans.

Similar issues:
#1933

Possible Labels

new feature

Select Category

Enhancement [x]
Add License/Copyright []
Scan Feature [x]
Packaging []
Documentation []
Expand Support []
Other []

pombredanne · 2020-04-02T08:55:15Z

That's a great idea!

although, they don't really signify those licences, but scancode outputs following:

You have to be extremely careful when you use synthetic, made up examples: in most cases, these do not exists in the real world. And using these, you may end up not solving any problem and/or over fitting your solution to a non-problem

That said, I am all for it: you can splits things in eventually three categories:
code proper, literals variables in code and comments, though license statements may span code and literals.

The difficulty to gauge is whether this will help with either speed or accuracy for actual license detection on real world data.

MankaranSingh · 2020-04-02T09:27:45Z

You have to be extremely careful when you use synthetic, made up examples: in most cases, these do not exists in the real world.

That's a really great insight to learn ! II'l be more carefull next time.

The difficulty to gauge is whether this will help with either speed or accuracy for actual license detection on real world data.

I guess if we gather enough evidence that almost all the licence statements are found in comments, we may avoid scanning the actual code (not sure if this would be good, maybe add something as a 'shallow scan' option ?) This would speed up things as most of the file is made up of code rather than comments.

pombredanne · 2020-04-02T09:53:04Z

@MankaranSingh the idea of a lighter, lesser assurance driven by a command option can make sense. Sometimes it can make sense to get a quicker feel for what's in there.

MankaranSingh · 2020-04-02T12:15:49Z

@pombredanne yes, some codebases would specifically benefit from this, there can be following outcomes:

Less false positives with improved speed
no affect on accuracy with improved speed
less accuracy but improved speed
full scan with extra info of category (comment, variable, literals, actual code) so user can decide for himself.

the amount of percentages of outcomes is unknown though.
But I really can see this could be worth working upon.

chinyeungli · 2020-04-20T23:59:09Z

This is something to think of when fixing the issue: #2013

armijnhemel · 2020-08-27T09:54:05Z

Extreme example: https://github.com/jslicense/spdx-correct.js/blob/master/index.js

MankaranSingh added the new feature label Apr 2, 2020

MankaranSingh changed the title ~~Treat 'comments' and 'actual code' differently while scanning.~~ Treat 'comments' and 'actual code' differently while scanning a file. Apr 2, 2020

pombredanne mentioned this issue Mar 5, 2022

RFC: a plan for false positive license detection #2878

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat 'comments' and 'actual code' differently while scanning a file. #1995

Treat 'comments' and 'actual code' differently while scanning a file. #1995

MankaranSingh commented Apr 2, 2020 •

edited

Loading

pombredanne commented Apr 2, 2020

MankaranSingh commented Apr 2, 2020

pombredanne commented Apr 2, 2020

MankaranSingh commented Apr 2, 2020

chinyeungli commented Apr 20, 2020 •

edited

Loading

armijnhemel commented Aug 27, 2020

Treat 'comments' and 'actual code' differently while scanning a file. #1995

Treat 'comments' and 'actual code' differently while scanning a file. #1995

Comments

MankaranSingh commented Apr 2, 2020 • edited Loading

Short Description

Possible Labels

Select Category

pombredanne commented Apr 2, 2020

MankaranSingh commented Apr 2, 2020

pombredanne commented Apr 2, 2020

MankaranSingh commented Apr 2, 2020

chinyeungli commented Apr 20, 2020 • edited Loading

armijnhemel commented Aug 27, 2020

MankaranSingh commented Apr 2, 2020 •

edited

Loading

chinyeungli commented Apr 20, 2020 •

edited

Loading