Skip to content
This repository has been archived by the owner on May 23, 2019. It is now read-only.

CIF Architecture Overview

Wes edited this page Oct 22, 2015 · 34 revisions

Introduction

This page presents an overview of the CIF architecture and explains how data moves through the system.

                                                  cif-worker
                                                     ^  +
                                                     |  |
                                                    ZMQ-PUB 
                                                     |  |
                                                     +  v
cif-smrt +---> apache2  <---> cif-starman  <--->  cif-router
                ^  +                                 +  ^
                |  |                                 |  |
                HTTP                                 HTTP
                |  |                                 |  |
                +  v                                 v  +
               client                            elasticsearch

How CIF fetches, parses and normalizes data

cif-smrt is a service that runs every hour with a random start time within a thirty minute window. cif-smrt uses configuration files found in /etc/cif/rules/default as the instructions to specify on what to download, how to parse and how to normalize.

  • cif-smrt uses LWP::UserAgent to fetch the data
  • cif-smrt uses RegEx, HTML::TableExtract, JSON::XS, XML:RSS, String::Tokenizer, and XML::LibXML to parse the data
  • cif-smrt normalizes the data to a JSON data structure
  • cif-smrt submits the JSON data structure to the CIF RESTful API interface
/etc/cif/rules/default/*.cfg

             +
             |
             |
             |
             v

          cif-smrt  +--->  apache2

                              +
                              |
                              |
                              |
                              v

                          cif-router

How CIF post-processes data

cif-worker is responsible for the post-processing of data; CIF ships with four post-processers:

  • UrlResolver - extract the FQDN from a URL
  • Resolver - resolve DNS records from a FQDN
  • Spamhaus - query Spamhaus
  • BGPWhitelist - create whitelisted CIDR ranges from IP addresses resolved from FQDNs tagged at "whitelist"
https://example.com/evil.htm    +--->     cif-worker

                                               +
                                               |
                                               |
                                               v

               cif-router  <----------+   example.com [lower confidence]

How CIF stores data

CIF uses ElasticSearch for it's data warehouse. ElasticSearch is a json document store where every field is indexed and searchable.

How the CIF API allows data to be queried and submitted

CIF uses Mojo::Base and Apache as the core for it's RESTful API (PSGI). The CIF API sits on top of the ElasticSearch API enforcing things like:

  • User Permissions
  • Data Limits
network +--> client +--> apache2 <--> cif-starman <--> cif-router

How CIF permissions data

CIF stamps each record with a group id. CIF tokens (API keys) are associated with Groups and have read, write attributes. The CIF API ensures that users (API keys) are limited to only returning data it has been given read access to and limiting users from writing to the CIF data store.

How CIF produces feeds of data

The CIF SDK (client) is responsible for generating CIF feeds. The primary attributes of a feed are:

  • Filtered by observable type (ipv4, fqdn, url, ipv6, email)
  • De-duplicated or aggregated by observable
  • Whitelisting data-sets applied

The CIF client makes a query to they CIF server to retrieve a overly broad data set and then reduces said data set by the attributes above before returning the data to the user.

Note: In an all-in-one CIF server where the CIF client is on the CIF server, all the processing is completed on a single host. In a distributed environment, the CIF client is able to reduce load on the CIF server by processing data on a separate client host.

Clone this wiki locally