Skip to content

GSoC 2020 Work Product Submission

Mehtab Zafar edited this page Aug 26, 2020 · 4 revisions

Hey everyone, I have worked on improving the storage functionality of tanner.

Here are the details about all the work that I have done:

Adding support to persistent storage

Background

Currently, in TANNER, the only way to store all the session data (analyzed and unanalyzed) was to keep it within the Redis. In TANNER the unanalyzed sessions data is the unprocessed data that is sent by the SNARE and it contains information like time of the session, user agent, etc. And analyzed sessions are those which TANNER generates after processing the unanalyzed data. The analyzed data contains information like possible owners i.e whether the sessions were created by a normal user or a crawler/tool, type of attacks, etc

Back in 2016 when new features were being added to TANNER, Redis was selected as the best choice because of the speed that it offers. Redis provides its speed functionality by storing data in the memory instead of keeping it in the disk. It's kind of similar to how a cache of any system works.

As the project grew we started to gather more and more data and storing everything in Redis caused it to consume a lot of memory which resulted in unexpected crashes on low spec systems.

Below is the diagram which explains how the SNARE & TANNER is working with the only Redis:

Current tanner setup

Disadvantages of this setup

As mentioned above higher consumption of memory on a low spec system would result in an unexpected crash of TANNER server. Also, higher memory consumption would cause other memory issues like slowing down other applications present on the system.

The Solution

First, we decided that we can add support for Postgres, and then we'll give users the option to run tanner either with Redis or with Postgres. But later after some discussion, we decided that we'll use a combination of Redis and Postgres.

The way we decided to use them can be seen in the following diagram:

New tanner setup

As it's clear from the above diagram that after analysis we are storing the data in the Postgres and then we are deleting the analyzed data from Redis.

The advantages of using the combination of both Postgres and Redis are

  • When TANNER receives a session from SNARE it will be able to store that session data in Redis with much higher speed.

  • Once the session is analyzed then it can be stored in persistent disk storage like Postgres.

Change in Data format

One of the major advantages of having Postgres as the storage is because it gives us the power to perform queries before presenting that data to the user.

TANNER API is used to access the data stored in the DB. Let's take an example to understand how the API has changed now, we have an endpoint /snare-stats/<snare-uuid> which returns stats like Number of attacks, total sessions, etc of a given snare instance.

Now in the old (Only Redis) setup, we were extracting all the data from the DB, and then we were selecting only those keys that were required for that endpoint. But now with the new (Redis+Postgres) setup, we don't extract all the data. What we do is we run SQL queries on the data before actually extracting the data, this saves us from unnecessary processing.

If you’d like to see the queries we are executing then please check out the code here

Detail about the format

We decided to store all the data in 4 tables named sessions, cookies, owners, paths. Below you can see the schema of all the tables:

postgres table

If you'd like to see the initial discussion about this format then check out this small gist that I made.

Pull Request submitted

Major changes

  • Adding support for Postgresql: PR-388

  • Update Tanner API and Tanner Web to make it compatible with the new Postgres setup. PR-391

  • Update existing tests according to new changes. PR-392

  • Make an analysis of the session a separate background task. In the old setup, it was happening within the session deletion task. PR-395

  • Add support for multiple filters in API and WEB. PR-396

Minor changes

  • Drop the support of Python 3.6 PR-397

  • Deploy master branch directly to the server using GitHub action. PR-390

  • Add migration script for easy migration from old to newer setup. PR-399

  • Use aioftp instead of FTP, continuing the work from PR-385. PR-398

  • Add support for twig template injection PR-401