It may be helpful to first understand Stratosphere's architecture.
You can access JupyterLab from the main dashboard. The notebooks located in the subdirectory webapps
are also published as Voilà web applications.
The example notebook 01 kb overview.ipynb
shows how to query the knowledge base with SQL and Pandas. The stratosphere
Python package is included directly from source extending the PYTHONPATH
env variable and it is located at /shared/src/stratosphere
. Modifications to the source code are hot-reloaded in the notebooks thanks to %autoreload
(useful during development).
Attention: You are warmly invited to contribute new extractors with test data samples. Thank You!
How things are glued together:
- Extractors are in charge of scraping and extracting knowledge from the intercepted flows.
- The mitmproxy service operates independently. It intercepts continuously the web traffic, dumping it to
probe.db
. - The extractor service regularly pulls new flows from
probe.db
, passing them to the extractors for processing. The pipeline is retriggered every10
seconds and it prunes the flows older than10
minutes, possibly reprocessing already seen flows. This procedure ensures that recent traffic can always be inspected inprobe.db
without missing data and without retaining the complete flows history.
To add a new extractor, follow these steps:
-
Capture a sample data set of flows that contain the traffic of interest with
02 capture sample.ipynb
. With the default configuration, you have visibility on the flows intercepted in the last10
minutes. You might want to use a separate browser instance without your authenticated extensions and tabs, so that the extracted sample data does not contain your own confidental information. Copy thesample.db
file in thesamples
directory: you will likely need it later to run tests etc. -
Analyze the captured flows. The notebook
03 analyze sample.ipynb
shows how to use some included utility functions to inspect and review the contents of the flows. Once you can manually extract the information you are interested in, including the data to form an UUID for the entities and the relationships, you can move forward. -
Create a new module in
/shared/src/stratosphere/extractors
that defines a functionextract(rows: List[stratosphere.stoerage.models.Flow])
. A symbolic link to/shared/
can be found in the top directory of JupyterLab and you can use it to add the new module. Implement theextract
function s.t. it processes all input flows, inserting them in the knowledge base. You can useextractor_google_search.py
as example. Recommendations:- You should use the class
DuplicateRows
to ensure that duplicate entities and relationships are handled correctly, merging the contents of thedata
fields. Depending on your use case, you might need to implement a custom merge strategy. - The fields of
Flow
ORM objects map approximately to the attributes in theFlow
objects in mitmproxy (official documentation). For example, the fieldflow_response_content
is documented here. The additional columnid
is a random UUID. mitmproxy.py
is currently recording only flows whose response content type refers to tex to improve performance and reduce the database file size. If you want to capture images and other multimedia content, you might want to remove these filters.
- You should use the class
-
Test the new extractor extending
04 test extractors.ipynb
.
My setup:
- VSCode on the host
- Fabric script to manage the container, see
fabfile.py
- Separate browser instance with enabled proxy to capture flows
An overview of useful Docker parameters:
- --rm: Docker will automatically clean up the container and remove the file system when the container exits. If you want to retain the container’s file system, remove
--rm
. - -p: Publish the container's port on the host. The format is
host_ip:host_port:container_port
. If you drop thehost_ip
, the port will be published on all interfaces. By default, the proxy runs on all intefaces, but the web interface is accessible only from localhost. - --name: Name your container.
- -d: Start the container in background.
- -it: Allocate a pseudo-tty and keep STDIN open even if not attached. Useful if you want to access directly the container from terminal, via
docker execute
. - -v: Bind mount a volume.
- --cap-add=SYS_PTRACE: Allow processes to use the ptrace system call, required to run the
utils/fuser_dbs.sh
script.