Skip to content
Tom Sellers edited this page Sep 3, 2021 · 6 revisions

The data referenced in this document can be found in the SSL Certificates section of the Rapid7 Open Data website.

Summary

Project Sonar produces a SSL Certificate dataset for every TLS enabled study that we perform.

This data is gathered by first performing a TCP SYN scan across the entire IPv4 address space for study's network port and then running a collection script against every system that returned a positive response. The data is then compared against the previous scan and any new entries (hosts, names, or certificates) are uploaded to Open data.

Data format / schema

The data consists of three files per study:

  • ${date}_certs.gz
  • ${date}_hosts.gz
  • ${date}_names.gz

All three files are gzip-compressed CSVs.

The certs file contains a SHA1 hash of the X509 certificate followed by the base64-encoded X509 certificate itself. The hosts file contains an IP address and the SHA1 hash of the certificate that was found on that IP. If multiple certificates are found on a host, these certificate hashes will be displayed in the order they were seen. It is common for a SSL/TLS server to provide multiple certificates in the response, typically consisting of the server's certificate, followed by a certificate authority's glue certificate, and finally the root certificate. The names file contains a SHA1 hash of the certificate followed by the Common Name or one of the SubjectAltName entries. It is common for a single certificate to have many names associated with it.

Due to the incremental nature of published data, it is necessary to process all historical data files in order to obtain a complete picture of the latest scan. A reasonable approach is to download all data files and process them sequentially, loading the certs, hosts, and names into separate database tables. The X509 certificates will need to be parsed and possibly stored as multiple fields within the certs table. The date of the scan (represented by the file name) should be stored as a column within each table.

Once all of the bulk data has been loaded in the correct order, it becomes easy to determine which certificates and names correspond to which IP addresses and vice-versa. Depending on available memory and storage speed, it may make sense to create join tables or just add indexes to certain fields (SHA1).

The incremental data format is time intensive to setup, but becomes much faster to keep updated, as only the relatively small weekly data files need to be processed, as opposed to the complete raw dataset.