Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Code Size in MB? #183

Closed
DarwinJS opened this issue Jul 30, 2020 · 21 comments
Closed

Support for Code Size in MB? #183

DarwinJS opened this issue Jul 30, 2020 · 21 comments
Labels
enhancement New feature or request

Comments

@DarwinJS
Copy link

Some commercial security scanning tools now charge by file volume in MB (and some charge by lines).

I would like to use this tool in CI to do an assessment of possible costs to scan my code in given security tools.

I am wondering if that measurement could be taken and reported as well since this engine is already iterating through the code for purposes of counting?

I was also thinking of building a CI plugin around this similar to these: https://gitlab.com/guided-explorations/ci-cd-plugin-extensions.

@DarwinJS
Copy link
Author

DarwinJS commented Aug 1, 2020

I'm not a go expert, but in looking through the code I think a first iteration on just the counting part might be:

  1. For each case statement in switch currentState record the current Location in a global (maybe something like: LastStateChangeLocation = fileJob.Location)
  2. In the same case statement, for case SCode, SString, SCommentCode, SMulticommentCode: subtract the last recorded position from the current position and add it to a counter. (maybe something like: fileJob.CodeBytesCounter = fileJob.CodeBytesCounter + (fileJob.Location - LastStateChangeLocation))

I don't understand if summing location will be roughly equivalent to bytes or if a conversion is needed.

@DarwinJS
Copy link
Author

DarwinJS commented Aug 1, 2020

Forgot to mention that the GitLab CI/CD plugin example was roughed out last week: https://gitlab.com/guided-explorations/ci-cd-plugin-extensions/ci-cd-plugin-extension-scc

@boyter
Copy link
Owner

boyter commented Aug 2, 2020

Lot to go though there!

So the number of bytes is already there, so assuming they charge even for lines that are comments that should allow you to get the result you want from this.

If however they only charge per code lines there would need to be some changes. Would need to know this before knowing which one to do. In either case there might need to be a modified output to give you this value. Certainly something that seems useful and worth added in.

BTW thanks for writing that plugin, and I have added a link on the main page https://github.com/boyter/scc/#interesting-use-cases for others to get the benefit.

@boyter boyter added the enhancement New feature or request label Aug 2, 2020
@DarwinJS
Copy link
Author

DarwinJS commented Aug 3, 2020

@boyter - I just looked for the most well known scanner that counts MBs. It was actually not easy to find the information, but a FAQ answer indicated that they only charge for the code part of files.

My guess is that most others who charge just for lines of code will be careful not to count non-code lines since there will be concern about cost.

I had thought of this and maybe the following might be a path forward:

  1. Provide the data size of just the lines of code.
  2. If there is a user request citing a tool that does just file size, have a parameter to add file sizes in case some scanners count that way?

For the one vendor, part of the reason for charging for MBs is that they also do static analysis of built binaries.

@boyter
Copy link
Owner

boyter commented Aug 3, 2020

Are you looking to consume this information through the JSON output or some other one? Because it's certainly easier to add this information to those than to the default stdout. Although, there is the COCOMO stats, so it might be worth adding another section below, which would be a good place to include things like #177

@DarwinJS
Copy link
Author

DarwinJS commented Aug 4, 2020

It would be great to have it in all. HTML displays the smoothest when it is uploaded to GitLab artifacts storage. Json would be grat for further programmatic analysis.

FYI I just submitted a requested article to Acloud.guru's guest blog that includes scc. They have over 2 millions subscribers ;)

@boyter
Copy link
Owner

boyter commented Aug 5, 2020

Yeah fair enough. Adding to the HTML and JSON isn't so hard but the stdout is a bit more problematic. Ill have a look at implementing though.

Oh that would be neat :) let me know if it makes it out and ill be sure to share as widely as I can.

@boyter
Copy link
Owner

boyter commented Aug 5, 2020

So you should now be able to get the bytes per file and rollups for all the usual suspects, JSON, HTML, CSV

For the moment this is only on a branch https://github.com/boyter/scc/compare/Issue183 and does not include byte counts just for code (its everything).

I'll be adding the code ones soon which will just be the byte count of anything not considered a comment on the same branch and then merge in when its looking good.

@boyter
Copy link
Owner

boyter commented Aug 6, 2020

@DarwinJS does the output for what is there work for your use case?

Its not the count of the code itself, just the file, but should give you a reasonable idea...

Im inclined to add the latter part after getting this release out due to how much work it turns out to be.

@boyter
Copy link
Owner

boyter commented Aug 6, 2020

This has been merged in as is into master. So if you build from there you should get the the byte count out of JSON/CSV and HTML.

@DarwinJS
Copy link
Author

DarwinJS commented Aug 6, 2020

Thank you!

Do files that have two code types get counted in both?

I am having a challenge judging the accuracy - I've been using the www-gitlab-com repo - but I see it has .haml files and some other formats that are text, but maybe not counted by scc.

So when I try to compare scc totals on lines to the output of this command I am getting vastly different total lines for my whole code base versus scc:

find . -type f -exec wc -l {} \; | awk '{ SUM += $0} END { print SUM }'

Do you use a sample repository where the above line count should be very closed to total lines by scc and where du -sh would show the same as total bytes because all the code files in the repo are supported languages?

This is a great start. I need to find out if tools that charge by MB are for sure adding up lines versus files.

@boyter
Copy link
Owner

boyter commented Aug 6, 2020

I don't think scc has .haml files :) if you can point out a spec i'll add it though.

this seems pretty good actually https://haml.info/docs/yardoc/file.REFERENCE.html

Generally any repository will not match because of the .git folder though. The values scc spits out though are based on the bytes it actually read IE it opened the file and read that many bytes, so it should be 100% accurate for what it processed.

@boyter
Copy link
Owner

boyter commented Aug 6, 2020

HAML support added in to master. Try running again.

@DarwinJS
Copy link
Author

DarwinJS commented Aug 6, 2020

This is great thanks!

It would be great to get MBs in stdout. It is quite common for all of us to suffer from "what you see is all there is" (WYSIATI) - so that default display can play a big role in whether folks ever understand that you have the useful feature of counting MBs.

What do you think about dropping one or more of the less useful columns for MBs and require a command line to swap back? So display MBs instead of Raw "Lines" (just picked it because some other common CLOC utils don't have that) and then a switch to get raw lines back?

I have a workaround in the form of using w3m (13MB) for the console output with this (assumes alpine go lang container):

apk update; apk add w3m
time scc . --not-match .*md -f html -o loc.html #Many CI systems can show this kind of artifact in browser.
time scc . --not-match .*md -f json -o loc.json #For futher data analysis.
w3m -dump loc.html
echo "Total MBs of files in checkout:"
du -sh

@DarwinJS
Copy link
Author

DarwinJS commented Aug 6, 2020 via email

@boyter
Copy link
Owner

boyter commented Aug 7, 2020

So adding the MB's is totally possible to the normal output, assuming you don't want it per type, so something like

───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Go                          34      7793     1277       329     6187       1291
───────────────────────────────────────────────────────────────────────────────
Total                       34      7793     1277       329     6187       1291
───────────────────────────────────────────────────────────────────────────────
Processed 156232 bytes == 0.156232 Megabytes == 0.000156232 Gigabytes
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $183,083
Estimated Schedule Effort 8.048178 months
Estimated People Required 2.694676
───────────────────────────────────────────────────────────────────────────────

Might do the trick, although maybe just the byte to megabyte conversion is all thats needed. Glad to hear HAML solved a bunch of issues there. Of course there is also the question as to what version of megabyte to use there, although I suspect that could be solved with another flag.

@boyter
Copy link
Owner

boyter commented Aug 7, 2020

$ scc -i go
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Go                          34      7800     1278       329     6193       1291
───────────────────────────────────────────────────────────────────────────────
Total                       34      7800     1278       329     6193       1291
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $183,270
Estimated Schedule Effort 8.051291 months
Estimated People Required 2.696377
───────────────────────────────────────────────────────────────────────────────
Processed 313926 bytes, 0.314 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

A quick implementation preview.

@boyter
Copy link
Owner

boyter commented Aug 7, 2020

Try having a look at whats in master where it has this as an output for you.

I think I want to make the SI be optional and you can choose either SI or 1024 as the division for KB + add in these https://xkcd.com/394/

@boyter
Copy link
Owner

boyter commented Aug 7, 2020

OK only took a little bit of effort so now you can change the type from SI to binary to mixed, and of course all the XKCD ones :)

@DarwinJS
Copy link
Author

DarwinJS commented Aug 7, 2020

Nice - this shows that it can do MBs on default output!

What do you think about a note under the MBs count that says "For bytes count per language, use a data output format."

@boyter
Copy link
Owner

boyter commented Aug 9, 2020

I think that might be better off covered in the documentation personally. I don't think that cluttering the stdout for this is acceptable, as I would hope anyone looking into integrations is checking the options anyway.

@boyter boyter closed this as completed Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants