Command line deduplication

In the command line (diff and print commands ) the results in the same header files should be shown and counted only once even if they were found multiple times because multiple c/cpp files included them.
Ericsson · Apr 19, 2018 · c89b8c6 · c89b8c6
1 parent 3bc5544
commit c89b8c6
Show file tree

Hide file tree

Showing 10 changed files with 172 additions and 66 deletions.
diff --git a/docs/usage.md b/docs/usage.md
@@ -423,26 +423,19 @@ enclosing scope of the bug location (function signature, class, namespace).
 ## <a name="how-report-are-counted"></a> How reports are counted?
 
 You can list analysis reports in two ways:
-1. Using the **`CodeChecker parse`** command, which **does not do deduplication**.
-2. Reports view of the **Web UI**, which **does deduplication**.
+1. Using the **`CodeChecker parse`** command.
+2. Reports view of the **Web UI**.
 
-These two views may show slightly different report list and counts based on how 
-duplicate findings or findings with the same hash identifier are rendered.
+Both of them do **deduplication**: it will not show the same bug report multiple
+times even if the analyzer found it multiple times.
 
-The `CodeChecker parse` command does not do deduplication. 
-It lists reports simply as found by the
-analyzers and always lists all duplicate and similar findings.
-
 You may find the same bug report multiple times for two reasons:
 1) The same source file is analyzed multiple times 
 (because the `compile_commmands.json` contains the build command multiple times)
 then the same findings will be listed multiple times. 
 2) All findings that are found in headers 
 will be shown as many times as many source file include that header.
 
-Web UI reports view on the other hand does deduplication: It will not show
-the same bug report two times even if the analyzer found it multiple times.
-
 **Example:**
 ```c++
 //lib.h:
@@ -501,15 +494,7 @@ Found no defects while analyzing a.c
     3, lib.c:2:1: Entered call from 'h'
     4, lib.c:3:11: Division by zero
 
-[HIGH] lib.h:1:30: Dereference of undefined pointer value [core.NullDereference]
-inline int div_h(){int *p; *p=4;};
-                             ^
-  Report hash: 6e7a6b71ac1a26751b7a7f7eea80f5da
-  Steps:
-    1, lib.h:1:20: 'p' declared without an initial value
-    2, lib.h:1:30: Dereference of undefined pointer value
-
-Found 2 defect(s) while analyzing b.c
+Found 1 defect(s) while analyzing b.c
 
 [HIGH] lib.c:3:11: Division by zero [core.DivideZero]
   return 1/b;
@@ -521,15 +506,7 @@ Found 2 defect(s) while analyzing b.c
     3, lib.c:2:1: Entered call from 'f'
     4, lib.c:3:11: Division by zero
 
-[HIGH] lib.h:1:30: Dereference of undefined pointer value [core.NullDereference]
-inline int div_h(){int *p; *p=4;};
-                             ^
-  Report hash: 6e7a6b71ac1a26751b7a7f7eea80f5da
-  Steps:
-    1, lib.h:1:20: 'p' declared without an initial value
-    2, lib.h:1:30: Dereference of undefined pointer value
-
-Found 2 defect(s) while analyzing a.c
+Found 1 defect(s) while analyzing a.c
 
 Found no defects while analyzing b.c
 Found no defects while analyzing lib.c
@@ -538,16 +515,15 @@ Found no defects while analyzing lib.c
 -----------------------
 Filename | Report count
 -----------------------
-lib.h    |            3
+lib.h    |            1
 lib.c    |            2
 -----------------------
 ```
 
-These results are printed without deduplication and uniqueing.
+These results are printed by doing deduplication and without uniqueing.
 As you can see the *dereference of undefined pointer value* error in the 
-`lib.h` is printed 3 times, because the header is included from 
-`a.c, b.c, lib.c`. All three findings have the same Report Identifier value.
-The two division by zero errors from `a.c` and `b.c` are printed also separately.
+`lib.h` is printed only once, even if the header is included from
+`a.c, b.c, lib.c`.
 
 In deduplication mode and without uniqueing (in the Web UI) the reports
 in lib.h would be shown only once, as all three findings are identical. So in

diff --git a/libcodechecker/analyze/plist_parser.py b/libcodechecker/analyze/plist_parser.py
@@ -40,8 +40,8 @@
 
 from libcodechecker import util
 from libcodechecker.logger import get_logger
-from libcodechecker.report import Report
-from libcodechecker.report import generate_report_hash
+from libcodechecker.report import Report, generate_report_hash, \
+    get_report_path_hash
 from libcodechecker.source_code_comment_handler import \
     SourceCodeCommentHandler, skip_suppress_status
 
@@ -283,13 +283,15 @@ def __init__(self,
                  src_comment_handler,
                  skip_handler,
                  severity_map,
+                 processed_path_hashes,
                  analyzer_type="clangsa"):
 
         self.__analyzer_type = analyzer_type
         self.__severity_map = severity_map
         self.__print_steps = False
         self.src_comment_handler = src_comment_handler
         self.skiplist_handler = skip_handler
+        self._processed_path_hashes = processed_path_hashes
 
     @property
     def print_steps(self):
@@ -373,6 +375,16 @@ def write(self, files, reports, analyzed_source_file, output=sys.stdout):
 
         non_suppressed = 0
         for report in reports:
+            path_hash = get_report_path_hash(report, files)
+            if path_hash in self._processed_path_hashes:
+                LOG.debug("Not showing report because it is a deduplication "
+                          "of an already processed report!")
+                LOG.debug("Path hash: %s", path_hash)
+                LOG.debug(report)
+                continue
+
+            self._processed_path_hashes.add(path_hash)
+
             events = [i for i in report.bug_path if i.get('kind') == 'event']
             f_path = files[events[-1]['location']['file']]
             if self.skiplist_handler and \

diff --git a/libcodechecker/cmd/cmd_line_client.py b/libcodechecker/cmd/cmd_line_client.py
@@ -25,7 +25,7 @@
 from libcodechecker.libclient.client import handle_auth
 from libcodechecker.libclient.client import setup_client
 from libcodechecker.output_formatters import twodim_to_str
-from libcodechecker.report import Report
+from libcodechecker.report import Report, get_report_path_hash
 from libcodechecker.source_code_comment_handler import SourceCodeCommentHandler
 from libcodechecker.util import split_server_url
 
@@ -288,16 +288,27 @@ def get_diff_results(client, baseids, cmp_data):
 
     def get_report_dir_results(reportdir):
         all_reports = []
+        processed_path_hashes = set()
         for filename in os.listdir(reportdir):
             if filename.endswith(".plist"):
                 file_path = os.path.join(reportdir, filename)
                 LOG.debug("Parsing:" + file_path)
                 try:
                     files, reports = plist_parser.parse_plist(file_path)
                     for report in reports:
+                        path_hash = get_report_path_hash(report, files)
+                        if path_hash in processed_path_hashes:
+                            LOG.debug("Not showing report because it is a "
+                                      "deduplication of an already processed "
+                                      "report!")
+                            LOG.debug("Path hash: %s", path_hash)
+                            LOG.debug(report)
+                            continue
+
+                        processed_path_hashes.add(path_hash)
                         report.main['location']['file_name'] = \
                             files[int(report.main['location']['file'])]
-                    all_reports.extend(reports)
+                        all_reports.append(report)
 
                 except Exception as ex:
                     LOG.error('The generated plist is not valid!')

diff --git a/libcodechecker/libhandlers/parse.py b/libcodechecker/libhandlers/parse.py
@@ -158,7 +158,8 @@ def arg_match(options):
     parser.set_defaults(func=__handle)
 
 
-def parse(f, context, metadata_dict, suppress_handler, skip_handler, steps):
+def parse(f, context, metadata_dict, suppress_handler, skip_handler,
+          processed_path_hashes, steps):
     """
     Prints the results in the given file to the standard output in a human-
     readable format.
@@ -174,7 +175,8 @@ def parse(f, context, metadata_dict, suppress_handler, skip_handler, steps):
 
     rh = plist_parser.PlistToPlaintextFormatter(suppress_handler,
                                                 skip_handler,
-                                                context.severity_map)
+                                                context.severity_map,
+                                                processed_path_hashes)
 
     rh.print_steps = steps
 
@@ -260,6 +262,8 @@ def main(args):
     if 'skipfile' in args:
         skip_handler = SkipListHandler(args.skipfile)
 
+    processed_path_hashes = set()
+
     for input_path in args.input:
 
         input_path = os.path.abspath(input_path)
@@ -314,6 +318,7 @@ def main(args):
                                            metadata_dict,
                                            suppress_handler,
                                            skip_handler,
+                                           processed_path_hashes,
                                            'print_steps' in args)
             file_change = file_change.union(f_change)
 

diff --git a/libcodechecker/report.py b/libcodechecker/report.py
@@ -15,7 +15,6 @@
 import json
 import os
 
-import libcodechecker.util as util
 from libcodechecker.logger import get_logger
 from libcodechecker.util import get_line
 
@@ -147,6 +146,27 @@ def compare_ctrl_sections(curr, prev):
         return ''
 
 
+def get_report_path_hash(report, files):
+    report_path_hash = ''
+    events = filter(lambda i: i.get('kind') == 'event', report.bug_path)
+
+    for event in events:
+        file_name = os.path.basename(files[event['location']['file']])
+        line = str(event['location']['line']) if 'location' in event else 0
+        col = str(event['location']['col']) if 'location' in event else 0
+
+        report_path_hash += line + '|' + col + '|' + event['message'] + \
+            file_name
+
+    if not report_path_hash:
+        LOG.error('Failed to generate report path hash!')
+        LOG.error(report)
+        LOG.error(events)
+
+    LOG.debug(report_path_hash)
+    return hashlib.md5(report_path_hash.encode()).hexdigest()
+
+
 class Report(object):
     """
     Just a minimal separation of the main section

diff --git a/libcodechecker/server/api/report_server.py b/libcodechecker/server/api/report_server.py
@@ -34,6 +34,7 @@
 from libcodechecker.analyze import plist_parser
 from libcodechecker.logger import get_logger
 from libcodechecker.profiler import timeit
+from libcodechecker.report import get_report_path_hash
 from libcodechecker.server import permissions
 from libcodechecker.server.database import db_cleanup
 from libcodechecker.server.database.config_db_model import Product
@@ -389,27 +390,6 @@ def sort_results_query(query, sort_types, sort_type_map, order_type_map,
     return query
 
 
-def get_report_path_hash(report, files):
-    report_path_hash = ''
-    events = filter(lambda i: i.get('kind') == 'event', report.bug_path)
-
-    for event in events:
-        file_name = os.path.basename(files[event['location']['file']])
-        line = str(event['location']['line']) if 'location' in event else 0
-        col = str(event['location']['col']) if 'location' in event else 0
-
-        report_path_hash += line + '|' + col + '|' + event['message'] + \
-            file_name
-
-    if not len(report_path_hash):
-        LOG.error('Failed to generate report path hash!')
-        LOG.error(report)
-        LOG.error(events)
-
-    LOG.debug(report_path_hash)
-    return hashlib.md5(report_path_hash.encode()).hexdigest()
-
-
 class ThriftRequestHandler(object):
     """
     Connect to database and handle thrift client requests.
@@ -1854,8 +1834,7 @@ def __store_reports(self, session, report_dir, source_root, run_id,
                 bug_paths, bug_events = \
                     store_handler.collect_paths_events(report, file_ids,
                                                        files)
-                report_path_hash = get_report_path_hash(report,
-                                                        files)
+                report_path_hash = get_report_path_hash(report, files)
                 if report_path_hash in already_added:
                     LOG.debug('Not storing report. Already added')
                     LOG.debug(report)

diff --git a/tests/functional/analyze_and_parse/test_files/Makefile b/tests/functional/analyze_and_parse/test_files/Makefile
@@ -13,4 +13,7 @@ tidy_check:
 saargs_forward:
 	$(CXX) -w -std=c++11 saargs_forward.cpp -o /dev/null
 source_code_comments:
-	$(CXX) -w source_code_comments.cpp -o /dev/null
+	$(CXX) -w source_code_comments.cpp -o /dev/null
+deduplication:
+	$(CXX) -w -DVAR=1 simple1.cpp -o /dev/null
+	$(CXX) -w -DVAR=2 simple1.cpp -o /dev/null
diff --git a/tests/functional/analyze_and_parse/test_files/simple1.deduplication.output b/tests/functional/analyze_and_parse/test_files/simple1.deduplication.output
@@ -0,0 +1,42 @@
+NORMAL#CodeChecker log --output $LOGFILE$ --build "make deduplication" --quiet
+NORMAL#CodeChecker analyze $LOGFILE$ --output $OUTPUT$ --analyzers clangsa
+NORMAL#CodeChecker parse $OUTPUT$
+CHECK#CodeChecker check --build "make deduplication" --output $OUTPUT$ --quiet --analyzers clangsa
+--------------------------------------------------------------------------------
+[] - Starting build ...
+[] - Build finished successfully.
+[] - Starting static analysis ...
+[] - [1/2] clangsa analyzed simple1.cpp successfully.
+[] - [2/2] clangsa analyzed simple1.cpp successfully.
+[] - ----==== Summary ====----
+[] - Total analyzed compilation commands: 2
+[] - Successfully analyzed
+[] -   clangsa: 2
+[] - ----=================----
+[] - Analysis finished.
+[] - To view results in the terminal use the "CodeChecker parse" command.
+[] - To store results use the "CodeChecker store" command.
+[] - See --help and the user guide for further options about parsing and storing the reports.
+[] - ----=================----
+[HIGH] simple1.cpp:18:15: Division by zero [core.DivideZero]
+  return 2015 / x;
+              ^
+
+Found 1 defect(s) while analyzing simple1.cpp
+
+Found no defects while analyzing simple1.cpp
+
+----==== Summary ====----
+--------------------------
+Filename    | Report count
+--------------------------
+simple1.cpp |            1
+--------------------------
+-----------------------
+Severity | Report count
+-----------------------
+HIGH     |            1
+-----------------------
+----=================----
+Total number of reports: 1
+----=================----
diff --git a/tests/projects/cpp/Makefile b/tests/projects/cpp/Makefile
@@ -10,7 +10,8 @@ all:
 	$(CXX) -c skip_header.cpp
 	$(CXX) -c path_begin1.cpp
 	$(CXX) -c path_begin2.cpp
-	$(CXX) -c path_begin.cpp
+	$(CXX) -c -DVAR=2 path_begin.cpp
+	$(CXX) -c -DVAR=1 path_begin.cpp
 clean:
 	rm -f call_and_message.o
 	rm -f divide_zero.o