Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple static checker based on clang-query #833

Closed

Conversation

real-or-random
Copy link
Contributor

@real-or-random real-or-random commented Oct 15, 2020

This add a simple static checker based on clang-query, which is a
tool that could be described as a "clever grep" for abstract syntax
trees (ASTs).

As an initial proof of usefulness, this commit adds these checks:

The checks are easily extensible.

The main purpose is to run the checker on CI, and this commit adds
the checker to the Travis CI script.

This currently requires clang-query version at least 10. (However,
it's not required not compile with clang version 10, or with clang
at all. Just the compiler flags must be compatible with clang.)
Clang-query simply uses the clang compiler as a backend for
generating the AST.

In order to determine the compile in which the code is supposed to be
compiled (e.g., compiler flags such as -D defines and include paths),
it reads a compilation database in JSON format. There are multiple ways
to generate this database. The easiest way to obtain such a database is
to use a tool that intercepts the make process and build the database.

On Travis CI, we currently use "bear" for this purpose. It's a natural
choice because there is an Ubuntu package for it. If you want to run
this locally, bear is a good choice but other tools such as compiledb
(Python) are available.

@real-or-random real-or-random force-pushed the 202010-clang-query branch 3 times, most recently from b603548 to 3b05077 Compare October 15, 2020 19:37
@real-or-random
Copy link
Contributor Author

real-or-random commented Oct 15, 2020

Here's another interesting one:
match binaryOperator(allOf(unless(isExpansionInSystemHeader()), hasOperatorName("/")))

Unfortunately we use quite a few divisions, so this does not lead anywhere.

We could also check easily for reserved identifiers.

More suggestions?

@real-or-random
Copy link
Contributor Author

grml, apparently clang-query uses a compile_commands.json file to infer compile options/flags, which I had lying around in my local tree from some old experiments in 2018... So I need to rework this.

@gmaxwell
Copy link
Contributor

gmaxwell commented Oct 15, 2020

Unfortunately we use quite a few divisions, so this does not lead anywhere.

Is there any way to tell if the divisor is a compile time constant? We generally don't want divisions ending up in the machine code (though if its in some setup routine it's not that big a deal) as they're non-constant time[1] and slow (e.g. 50-100 cycles on x86, worse on a lot of arm). ... but divisions with constants get converted by the compiler to multiples via strength reduction. I believe the codebase is free or nearly free of divisions with non-constant divisors (maybe there is one in scratch space sizing or something like that).

I've failed to yet fully impress my practice of always using masking over division onto Pieter yet... but there are some cases where masking doesn't easily work.

[1] also, stock valgrind ctime check will not catch divisions involving secret data. I have a pretty trivial patch to valgrind on my laptop so that it does... and it passed for me as of the last time I ran it. I'm not really eager to figure out how to make travis use a custom compiled copy of valgrind...

@real-or-random
Copy link
Contributor Author

Unfortunately we use quite a few divisions, so this does not lead anywhere.

Is there any way to tell if the divisor is a compile time constant? We generally don't want divisions ending up in the machine code (though if its in some setup routine it's not that big a deal) as they're non-constant time[1] and slow (e.g. 50-100 cycles on x86, worse on a lot of arm). ... but divisions with constants get converted by the compiler to multiples via strength reduction. I believe the codebase is free or nearly free of divisions with non-constant divisors (maybe there is one in scratch space sizing or something like that).

Right, for example.

Match #103:

./src/ecmult_impl.h:1030:27: note: "root" binds here
    *n_batch_points = 1 + (n - 1) / *n_batches;
                          ^~~~~~~~~~~~~~~~~~~~

This is the reference for the matcher:
https://clang.llvm.org/docs/LibASTMatchersReference.html

You can match on "constantExpr" but this is not what we want. This matches constant expression as seen from the point of view of the AST, i.e., this matches places where only constant expressions are allowed (such as after case).

This technique in general does not enable us to get insights on what the compiler will do with the code. It's really just pattern matching on the AST, so it's merely a clever grep. (One can do more advanced things with clang tooling but this requires more effort than writing matching expressions.)

What we could do is to check if the divisor is an integer literal, but even this is not precise we have also a few (x-1) divisors for example.

@real-or-random real-or-random force-pushed the 202010-clang-query branch 3 times, most recently from 931cc9b to 3ecd681 Compare October 16, 2020 14:37
@real-or-random
Copy link
Contributor Author

This uses https://github.com/rizsotto/Bear, which intercepts make, to generate a JSON compilation database (see https://clang.llvm.org/docs/JSONCompilationDatabase.html). Bear is nice because we can easily install it on Travis. There are other tools that may work better when used locally, e.g., https://github.com/nickdiego/compiledb .

@real-or-random real-or-random force-pushed the 202010-clang-query branch 10 times, most recently from 0858d47 to eed297b Compare October 16, 2020 21:49
@real-or-random
Copy link
Contributor Author

real-or-random commented Oct 16, 2020

Ready for review.

Travis fails but this is intentionally. See it in action here:

https://travis-ci.org/github/bitcoin-core/secp256k1/jobs/736511002#L1254 (complains about memczero declaration, see #829)
https://travis-ci.org/github/bitcoin-core/secp256k1/jobs/736511002#L1313 (complains about reserved _t type declared in benchmarks)

Edit: argh, the links to the exact lines don't work due to the output folding.
I might add a commit that restructures the Travis output in general by adding folds to the main commands:
travis-ci/docs-travis-ci-com#949 (comment)
Then all the main commands be folded unless they fail, so see only the errors. Moreover we don't need the cats at the end (which have confused people in the past) and we don't need to ugly hack that I introduced here to make the color output work with cat.)

@real-or-random real-or-random marked this pull request as ready for review October 16, 2020 21:56
@sipa
Copy link
Contributor

sipa commented Oct 17, 2020

This is pretty cool. I'll have to play with it a bit.

@real-or-random real-or-random changed the title Add simple static checker using clang-query Add simple static checker based on clang-query Oct 19, 2020
@real-or-random
Copy link
Contributor Author

I added a long commit message and updated the PR description to this message.

real-or-random added a commit to real-or-random/secp256k1 that referenced this pull request Oct 20, 2020
As identified in bitcoin-core#829 and bitcoin-core#833. Fixes bitcoin-core#829.

Since we touch this anyway, this commit additionally makes the
identifiers in the benchmark files a little bit more consistent.
This add a simple static checker based on clang-query, which is a
tool that could be described as a "clever grep" for abstract syntax
trees (ASTs).

As an initial proof of usefulness, this commit adds these checks:
  - No uses of floating point types.
  - No use of certain reserved identifiers (e.g., "mem(...)", bitcoin-core#829).
  - No use of memcmp (bitcoin-core#823).

The checks are easily extensible.

The main purpose is to run the checker on CI, and this commit adds
the checker to the Travis CI script.

This currently requires clang-query version at least 10. (However,
it's not required not compile with clang version 10, or with clang
at all. Just the compiler flags must be compatible with clang.)
Clang-query simply uses the clang compiler as a backend for
generating the AST.

In order to determine the compile in which the code is supposed to be
compiled (e.g., compiler flags such as -D defines and include paths),
it reads a compilation database in JSON format. There are multiple ways
to generate this database. The easiest way to obtain such a database is
to use a tool that intercepts the make process and build the database.

On Travis CI, we currently use "bear" for this purpose. It's a natural
choice because there is an Ubuntu package for it. If you want to run
this locally, bear is a good choice but other tools such as compiledb
(Python) are available.
@elichai
Copy link
Contributor

elichai commented Oct 22, 2020

FWIW

Names beginning with a capital ‘E’ followed a digit or uppercase letter
Names that begin with either ‘is’ or ‘to’ followed by a lowercase letter
Names that begin with ‘LC_’ followed by an uppercase letter
Names of all existing mathematics functions suffixed with ‘f’ or ‘l’
Names that begin with ‘SIG’ followed by an uppercase letter
Names that begin with ‘SIG_’ followed by an uppercase letter
Names beginning with ‘str’, ‘mem’, or ‘wcs’ followed by a lowercase letter
Names that end with ‘_t’

@real-or-random
Copy link
Contributor Author

real-or-random commented Oct 22, 2020

@elichai Yeah, where did you get this list from? Is this really all about all names? I thought E is only for macros. We can't check for macros using clang-query but we could add a simple grep for #define[[:space:]+]E etc. I also have uncommitted changes to add https://clang.llvm.org/docs/DiagnosticsReference.html#wreserved-id-macro if the compiler supports it (this covers only _ and __ macros). Not sure if overkill but it's not a lot of work.

@elichai
Copy link
Contributor

elichai commented Oct 22, 2020

@elichai Yeah, where did you get this list from? Is this really all about all names? I thought E is only for macros.

https://www.gnu.org/software/libc/manual/html_node/Reserved-Names.html

About E it's weird, in the gcc docs it says for additional error code names which are macros, and the exact wording of the standard is:(7.1.4)

Macros that begin with E and a digit or E and an uppercase letter (followed by any combination of digits, letters, and underscore) may be added to the declarations in the <ermo. h> header.

But on 7.1.3 "Reserved identifies" it says:

Each header declares or defines all identifiers listed in its associated subclause, and optionally declares or defines identifiers listed in its associated future library directions subclause and identifiers which are always reserved either for any use or for use as file scope identifiers.

This paper proposing to change that also agrees on that interpretation: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2572.pdf (See "Future Library Directions")

However, p1 makes it clear that all identifiers reserved from this subclause are reserved identifiers regardless of what header files are included, meaning that these rules apply to all C code

In effect, these identifiers are reserved for all uses in C regardless of what header files (if any) are included

EDIT: I realize now that you asked about identifier vs macro, and not about if it's related to errno.h or not. it sounds like it's only for macros but the lawyering here is delicate

@elichai
Copy link
Contributor

elichai commented Oct 22, 2020

Standard on 6.8.3 Semantics:

The identifier immediately following the define is called the macro name

and also Aaron Ballman(the author of that paper, and part of WG14 committee) told me:

macros are defined using an identifier, so "macro name" and "identifier" are interchangable
"reserved for use as a macro name" is just saying how it's intended to be used, not that macro names are a special thing

So yeah :/

@gmaxwell
Copy link
Contributor

It makes sense: if some macro is going to get called E1234 then you obviously can't have a variable named E1234 without trouble.

sipa added a commit that referenced this pull request Nov 4, 2020
…ify_t

1f4dd03 Typedef (u)int128_t only when they're not provided by the compiler (Tim Ruffing)
e89278f Don't use reserved identifiers memczero and benchmark_verify_t (Tim Ruffing)

Pull request description:

  As identified in #829 and #833. Fixes #829.

  Since we touch this anyway, this commit additionally makes the
  identifiers in the benchmark files a little bit more consistent.

  This is necessary before we can merge #833. I preferred a separate PR because it makes it easier to see the results of Travis in #833.

ACKs for top commit:
  sipa:
    utACK 1f4dd03
  jonasnick:
    ACK 1f4dd03

Tree-SHA512: c0ec92798f3c94f3ef6ac69b3f0f39a39257a32be9d9a068832cece1ebe64c89848b70e44652fc397004b8b240883ac4bc0c8f95abbe4ba4b028de120e6734bf
This warns when we define macros with some reserved identifiers, e.g.,
with underscores. It does not catch all reserved identifiers though.
@real-or-random
Copy link
Contributor Author

Hm, closing for now... clang-query is a neat idea, but I'm not convinced that the checks introduced here (namely mostly "reserved identifier" stuff) are worth maintaining this, in particular because clang-query is a bit finicky.

This conclusion could certainly change if we want to use something like this here for tracking magnitude (#1001) or other advanced stuff. And then, we may also want to look into other similar tools such as CodeQL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants