- Get
docker
anddocker-compose
- You will probably need to log out and back in again the first time (this is from experience)
From this git directory:
- change the password
redis
in./conf/redis.conf
- set the same password in
web/defaults.py
- If you have a
secrets.py
, don't forget to change the password there too - run
build.sh
(check what it does first, maybe you want to tweak things/do them manually)
Bring it back down with:
docker-compose down
If all goes well, you should see something like:
./build.sh
true
kathe_app running
Stopping kathe_nginx ... done
Stopping kathe_app ... done
Stopping kathe_redis ... done
Removing kathe_nginx ... done
Removing kathe_app ... done
Removing kathe_redis ... done
Generating new kathe SSL cert...
Generating a RSA private key
................................................+++++
..........................+++++
writing new private key to 'conf/cert/kathekey.pem'
-----
Sending build context to Docker daemon 564MB
Step 1/19 : FROM ubuntu:latest
---> adafef2e596e
[...]
Step 19/19 : CMD ["run", "/web/app.py"]
---> Using cache
---> 3152d7f359c6
Successfully built 3152d7f359c6
Successfully tagged kathe:latest
Creating kathe_redis ... done
Creating kathe_app ... done
Creating kathe_nginx ... done
Afterwards, you can just bring things up with:
docker-compose up -d
Creating kathe_redis ... done
Creating kathe_app ... done
Creating kathe_nginx ... done
And if you visit https://localhost/kathe/ (alternatively: http://localhost:8000/kathe), you should see something like this:
If anything goes wrong (like the "500" error code in this screen-shot), hovering over the HTTP response code will tell you what. Clicking it will take you to the page throwing the error. In this case, "dbtimestamp
" doesn't exist yet because the datastore is empty, so there's your error.
If you do have some data in your kathe
system, it should look like this if your search is successful:
If you click the HTTP 200 code, it will open a new tab for you to the successful response that generated the graph. I found this sometimes comes in handy.
For what kathe is, how it works and why I bothered building it, I kindly refer you to My slides for Bsides Cymru 2019.
Without data everything is useless. I've created kathe-cli.py
to load data sets in different formats into kathe. With the docker configuration as we currently have it, the data (at rest) ends up (after a background sync) in /var/lib/docker/volumes/kathe_redis-data/_data/
on the machine running the docker containers. When you rebuild a docker image, you should not lose previously loaded data. If you want to keep this data somewhere else, be sure to change the corresponding line in the docker-compose.yml file.
kathe-cli.py
stores ssdeep hashes in Redis in such a way that correlation (ssdeep compares) between all relevant hashes is possible. Because the comparison is done during storage, retrieving all similar ssdeep hashes later is cheap.
kathe-cli.py
also stores cross-linked info to redis: any of inputname, ssdeep, sha256 and context has pointers to the other info available.
Additionally, unique lists of sha256 hashes, ssdeep hashes, inputnames and contexts are created. These lists function as "Indexes" which can help you access all data. The names:context list is stored as a sorted set (zset
). The score is a counter of occurrence of a context value. Besides the obvious benefit of knowing the size of each stored context group, the zset
also serves as a numbered dictionary in app.py
.
(set) names:sha256
(set) names:ssdeep
(set) names:filename
(zset) names:context
Besides the rolling_window keys, which you probably won't need, there are four additional key types you could use:
info:inputname:<name of the binary input>
info:sha256:<sha256>
info:ssdeep:<ssdeep>
info:context:<context>
Please be aware that kathe-cli.py
removes unwanted characters like non-UTF8 and control characters from inputnames and contexts. In addition the following characters (python list) are also removed:
[':', '\\', '"', '\'', '|', ' ', '/']
The real magic of kathe-cli.py is hidden behind the sorted sets with the name <ssdeep hash>
. The score associated with every value in the zset
, holds the result of a ssdeep_compare
between the "parent" ssdeep and any partially similar ssdeep hashes I've named "siblings".
Because of weirdness in later python-redis versions, you need to install with pip3 install redis='<=3.*'
. This
I have a file with lines containing what you could call a bunch of JSON Arrays. They look like this:
head -1 apt28/apt28.json
["12288:25OuuqTt1WS36Lpvf9wScE1BR53LOvGV1Jww1nOXn+OCVOeXSVbHXwqdC1:25O6HVkpmSDBRBJJw0OXjCVmXw11", "12-033-1589(1).rar", "e53bd956c4ef79d54b4860e74c68e6d93a49008034afb42b092ea19344309914"]
What you can do with this, is just quick & dirty pass them through. Whether you have to use the REST interface (-a
switch) or not depends. I'm running the script on the same system that has the dockers running, so the redis instance is directly accessible. In my experience this is quicker.
cat apt28/apt28.json | while read line; do ./kathe-cli.py -r 13 -c apt28,zap -j "${line}" ; done
I have added the malpedia.py script
used below to the repository. Although specifically designed for working with @malpedia, it can possibly help to get started if you want to parse other repositories. The malpedia.py
script adds the required -c malpedia
context variable (because I'm lazy).
#!/usr/bin/env bash
# using db 13 for testing
./malpedia.py | while read line; do echo "${line}" && ./kathe.py -r 13 ${line};done
docker exec -it kathe_redis bash
redis-cli -n 13
auth <your pass>
save
(don't forget to run save
after you've just loaded new things into the db)
To get a list of all sha256 hashes of all stored info:
smembers hashes:sha256
To get a list of all ssdeep hashes of all stored info:
smembers hashes:ssdeep
To get a list of all inputnames of all stored info:
smembers names:inputname
To get a list of all contexts of all stored info:
zrange names:contexts 0 -1
An inputname can be any arbitrary identifying string. In this example I'm using a small sample set of the zekapab malware. Please feel free to abuse this field for any arbitrary identifying string.
To get information on an inputname:
smembers info:inputname:<input name stripped of "badchars", see above>
As nearly identical inputs can potentially have the same ssdeep hash, this might return a number of (unique) results. Format of the response:
smembers info:inputname:binary-147-meta.exe
1) "sha256:ca8bdaa271083cff19b98c3a46a75bc4f6d24ea483b1e296ad2647017a298e92:ssdeep:384:1gwH4hdaH5CLrowT7xprE4rUuUd989wRTp0W1u:V4XWuoUr8Hd989wRGW1u:context:apt28|zekapab"
info:inputname
always has the same format, so you can always access the ssdeep hash in the set by splitting on ':
' and getting field 3,4,5 (counting from zero).
To get information on a sha256 hash:
smembers info:sha256:<sha256 hash>
As identical input can have many names, this will possibly return a number of (unique) results. Format of the response:
smembers info:sha256:19be1aedc36a6f7d1fcbd9c689757d3d09b7dad7136b4f419a45e6187f54f772
1) "ssdeep:24576:P7AKkolpDEI+UTUqCg5D5WmQW9Ulg/bdG:PblVfP9+gzA:context:apt28|zekapab:inputname:1bcf064650aef06d83484d991bdf6750.virobj"
info:sha256
always has the same format, so you can always access the inputname in this string by splitting on ':
' and getting field 7 (counting from zero).
To get information on a ssdeep hash:
smembers info:ssdeep:<ssdeep hash>
As nearly identical files can potentially have the same ssdeep hash, this might return a number of (unique) results. Format of the response:
smembers info:ssdeep:24576:P7AKkolpDEI+UTUqCg5D5WmQW9Ulg/bdG:PblVfP9+gzA
1) "sha256:19be1aedc36a6f7d1fcbd9c689757d3d09b7dad7136b4f419a45e6187f54f772:context:apt28|zekapab:inputname:1bcf064650aef06d83484d991bdf6750.virobj"
info:ssdeep
always has the same format, so you can always access the contexts in this string by splitting on ':
' and getting field 3 (counting from zero). Contexts are separated by a "|".
You will very likely want to know which files/ssdeeps/inputnames appear in a certain context. That is why I added 'context' (and made it a MUST).
Access to all the info in a context is as simple as:
smembers info:context:apt28
1) "sha256:e7dd9678b0a1c4881e80230ac716b21a41757648d71c538417755521438576f6:ssdeep:24576:ybvZoVeeYPVvwrWmQFVHaf9P3lgtgZBJJw0OXjCVmXw11:ya6VHal3lgtgPJJw0OXuAXwv:inputname:codexgigas_b3086b4d99288d50585d4c07a3fdd0970a9843fc:inputcontext:apt28|zekapab"
2) "sha256:7<...>
After initially working with the "raw" data in redis and some graphviz, I've found that working with an interactive gui is very handy. So we've made app.py, a python-bottle web application. For convenience and speed of development it's wrapped up in a docker configuration.
- Fill in a (large) context to get you started if you just want to look around. Or a sha256 or ssdeep hash if you want to focus on a particular thing.
- All primary contexts of all nodes are listed here. If an element has multiple primary contexts (say you added a piece of malware with apt28 as primary context, but added the same (or nearly the same) malware under the Sofacy name, it will show up here as a unique group called apt28/sofacy).
- If this says "
sample: true
" the number of associated nodes has reached it's max and the dataset has been truncated. What the max is can be adjusted inapp.py
. - The view can be zoomed/ panned and nodes can be clicked, in which case the related edges show up. Hovering over a node shows the ssddeep hash and context, hovering over an edge the ssdeep similarity score. Clicking on the background removes the focus from the clicked node. One TODO left is thinking of a way to z-index the edges up.
- If you hover over a node the left infopanel shows the details of that node.
- IF you click on a node the right infopanel shows the details of the clicked node.