Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scc ignores files with no extension, such as libc++ header files #162

Closed
tbodt opened this issue Mar 13, 2020 · 14 comments
Closed

scc ignores files with no extension, such as libc++ header files #162

tbodt opened this issue Mar 13, 2020 · 14 comments
Labels
enhancement New feature or request

Comments

@tbodt
Copy link

tbodt commented Mar 13, 2020

Describe the bug
Running scc on libcxx/include ignores most of the files in it, because they have no extension. They are C++ header files, though. I can't find any way to get it to count these files.

To Reproduce

  1. Clone libcxx
  2. scc --by-file libcxx/include

Expected behavior
I would expect to see files such as vector appear in the result, but they don't.

Desktop (please complete the following information):

  • OS: Linux
  • Version: Debian testing
@boyter
Copy link
Owner

boyter commented Mar 15, 2020

So not a bug per-say (working as intended), but certainly not what I would like to see reported.

The issue is that scc uses extensions in order to know what the file type is. The exception to this rule is #! files where the content is inspected for a shebang and known type. You can see this with the verbose flag being set, where its trying to identify each file using this logic.

$ scc -v
 WARN 2020-03-15T21:07:17Z: possible #! file: algorithm
 WARN 2020-03-15T21:07:17Z: possible #! file: any
 WARN 2020-03-15T21:07:17Z: possible #! file: array
 WARN 2020-03-15T21:07:17Z: possible #! file: atomic
 WARN 2020-03-15T21:07:17Z: possible #! file: bit
 WARN 2020-03-15T21:07:17Z: possible #! file: bitset
 WARN 2020-03-15T21:07:17Z: possible #! file: cassert
 WARN 2020-03-15T21:07:17Z: possible #! file: ccomplex
 WARN 2020-03-15T21:07:17Z: possible #! file: cctype
 WARN 2020-03-15T21:07:17Z: possible #! file: cerrno
 WARN 2020-03-15T21:07:17Z: possible #! file: cfenv
 WARN 2020-03-15T21:07:17Z: possible #! file: cfloat
 WARN 2020-03-15T21:07:17Z: possible #! file: charconv
 WARN 2020-03-15T21:07:17Z: possible #! file: chrono
 WARN 2020-03-15T21:07:17Z: unable to determine #! language for algorithm
 WARN 2020-03-15T21:07:17Z: unable to determine #! language for array
 WARN 2020-03-15T21:07:17Z: unable to determine #! language for cfloat
 WARN 2020-03-15T21:07:17Z: unable to determine #! language for any
 WARN 2020-03-15T21:07:17Z: unable to determine #! language for cerrno
 WARN 2020-03-15T21:07:17Z: unable to determine #! language for cassert
 WARN 2020-03-15T21:07:17Z: unable to determine #! language for cctype

It might be possible to extend the #! rules logic to include this, but I will need some assistance.

I am not a C++ developer (very little experience with it) so is there some reliable way to determine that these files are headers using say the first 255 bytes of the file? It looks like -*- C++ -*- might do it, but I have no idea if this is just a LLVM convention?

If there is I can modify scc to include this check.

@boyter boyter added enhancement New feature or request question Further information is requested labels Mar 15, 2020
@dbaggerman
Copy link
Collaborator

I don't think there is anything that is guaranteed to be in the first 255 bytes. The -*- C++ -*- convention appears to be present in the GNU C++ headers I've got, so it's not just LLVM - but it might just be a case where LLVM followed what GNU was doing for compatibility.

Another thing to look for could be the #ifndef SOME_IDENTIFIER convention. This tends to be near the start of the file, but could come after block comments of arbitrary length. This convention applies to C headers as well as C++ though, so it wouldn't be possible to distinguish between them.

@boyter
Copy link
Owner

boyter commented Mar 15, 2020

Had a feeling that might be the case.

Something that's best effort and not perfect might be a reasonable solution, OR rules that can be modified to suit this case. Something similar to how the remap rules work, but for identifying inside the file similar to how the generated check works.

@tbodt
Copy link
Author

tbodt commented Mar 17, 2020

I think it would be enough for me to have some way to override the language for files like this with no extension.

@boyter
Copy link
Owner

boyter commented Mar 18, 2020

Interestingly I just tried to do that and may have run into a bug

$ scc -v --by-file "utility:java" utility
file or directory does not exist: utility:java

Ill have a look. You should in theory be able to map every extension like this, but since it isn't working at all that is an issue.

@boyter boyter added the bug Something isn't working label Mar 18, 2020
@tbodt
Copy link
Author

tbodt commented Mar 18, 2020

That looks like you have to specify a new --by-file override for every file, can I specify just one for all files without an extension?

@boyter
Copy link
Owner

boyter commented Mar 18, 2020

Not currently. It was not designed for this particular use case. I think being able to override the #! rules would work though as you could then define -*- C++ -*- as being the match rule.

Still thinking about it though.

@boyter
Copy link
Owner

boyter commented Mar 18, 2020

Interesting to note that none of the other tools work in this case. Tokei, cloc, polyglot nor any other counter works in this situation and I don't see any obvious ways to achieve it with them either. So this is treading new ground.

@boyter
Copy link
Owner

boyter commented Jun 3, 2020

So thinking about this. I think that if we give you the ability to say scan the first 1000 bytes of the file looking for a specific string and let you then override the language that way might be a reasonable workaround. Its either that or expand out the #! logic, but that feels more fragile. At least with the string option you can in theory build your own language logic into things in a sense.

@boyter
Copy link
Owner

boyter commented Aug 3, 2020

So I have been thinking about this enough. Time for action.

Going to add a new flag,

--remap or some such (looking for name suggestions here.

You will call it like so,

scc --remap "-*- C++ -*-":"C Header"

The only question I have is where this should happen. There are a lot of things that affect the outcome here. Just a few,

  • #! checks already apply here because the files have no extension
  • files with unknown extensions are ignored by default with the "unknown extension" warning
  • can you remap files that are already identified through the extension

My thinking is that this is a hard remap. My expectation would be, if I were to have a file with -*- C++ -*- defined in it, even if it was an unknown extension, it would be read and remapped. Also any files with known extensions get remapped. In other words if that string is found it gets remapped ignoring all other logic.

This seems to be the most sensible option to me because it does what I would expect best.

EDIT - Although this might be an issue in this case, because even the header file has the same in it... https://github.com/llvm-mirror/libcxx/blob/master/include/complex.h

So it either needs to be against all files it was unable to detect, or everything... and im not sure if the former is a good idea, at least not unless there exists the second option as well.

@boyter
Copy link
Owner

boyter commented Aug 3, 2020

Thinking a bit more I think there needs to be two options.

A hard remap which applies to all files with the string, and a soft remap which only looks in files which it cannot identify though extension or which would be looked at using the #! logic (IE when no extension at all)

@boyter
Copy link
Owner

boyter commented Aug 6, 2020

Yep so I think the solution is two flags

--missing-remap
--hard-remap

Still not sure on the names yet... remap feels wrong. Perhaps inspect?

The first will apply ONLY to files that were not identified using file extension and means that scc will attempt to process every file (binary logic checks will still kick in) unless its in the explicit denylist.

The latter will do everything the previous does but in addition will inspect every file it opens looking to do a remap.

@boyter boyter removed bug Something isn't working question Further information is requested labels Sep 1, 2020
@boyter
Copy link
Owner

boyter commented Sep 7, 2020

@tbodt Would it be possible for you to have a look at master and try this out? You should be able to count this correctly now,

# bb100123 @ VCOANSYD256197 in ~/Documents/kablamo/libcxx on git:master x [10:12:25]
$ scc
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C++                       6426    594693    67947     82426   444320      75139
C Header                   149     23209     3560      3577    16072       1341
Python                      35      7995      496       767     6732        594
CMake                       27      3886      411       460     3015        375
ReStructuredText            16      3060      778         0     2282          0
C++ Header                  13      1016      180       121      715         16
HTML                        11      3015      318        87     2610          0
Plain Text                  11      1265      199         0     1066          0
Shell                        9       736      127        67      542         67
Expect                       8       995       23        16      956          0
Autoconf                     5       103       12        27       64          5
JSON                         5       415        0         0      415          0
Markdown                     5      1418      255         0     1163          0
Objective C++                5       116       21        44       51          2
YAML                         4       366       32        22      312          0
CSS                          2        66       10         9       47          0
gitignore                    2       119       22        26       71          0
Batch                        1        53        7        12       34         11
License                      1       202       33         0      169          0
───────────────────────────────────────────────────────────────────────────────
Total                     6735    642728    74431     87661   480636      77550
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $17,680,821
Estimated Schedule Effort 40.969907 months
Estimated People Required 38.340111
───────────────────────────────────────────────────────────────────────────────
Processed 22366235 bytes, 22.366 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────


# bb100123 @ VCOANSYD256197 in ~/Documents/kablamo/libcxx on git:master x [10:24:12]
$ scc --remap-unknown "-*- C++ -*-":"C Header"
───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C++                       6426    594693    67947     82426   444320      75139
C Header                   283    155910    17186     24769   113955       9424
Python                      35      7995      496       767     6732        594
CMake                       27      3886      411       460     3015        375
ReStructuredText            16      3060      778         0     2282          0
C++ Header                  13      1016      180       121      715         16
HTML                        11      3015      318        87     2610          0
Plain Text                  11      1265      199         0     1066          0
Shell                        9       736      127        67      542         67
Expect                       8       995       23        16      956          0
Autoconf                     5       103       12        27       64          5
JSON                         5       415        0         0      415          0
Markdown                     5      1418      255         0     1163          0
Objective C++                5       116       21        44       51          2
YAML                         4       366       32        22      312          0
CSS                          2        66       10         9       47          0
gitignore                    2       119       22        26       71          0
Batch                        1        53        7        12       34         11
License                      1       202       33         0      169          0
───────────────────────────────────────────────────────────────────────────────
Total                     6869    775429    88057    108853   578519      85633
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop $21,479,730
Estimated Schedule Effort 44.114871 months
Estimated People Required 43.257333
───────────────────────────────────────────────────────────────────────────────
Processed 26826042 bytes, 26.826 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

The option you want to use is --remap-unknown "-*- C++ -*-":"C Header" for your case. However there is also --remap-all which will override anything just in case.

@boyter
Copy link
Owner

boyter commented Sep 7, 2020

Closing. Pretty sure this works and I want to get the release out ASAP :)

@boyter boyter closed this as completed Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants