Skip to content

Commit

Permalink
Docker updates (#138)
Browse files Browse the repository at this point in the history
* Updated migrate.py and docker-entrypoint.sh to be compatible with docker
compose (input() causes error). Also updated docker-config.py.

NOTE: Docker will overwrite docker-config with config.py if it already
exists. This may be desired behavior, however, can cause failures.

* Removing personal info.

* Changed to use config.py-example over docker-config.py
Also changed apt update && install per
"https://docs.docker.com/develop/develop-images/
dockerfile_best-practices/#run".

Noticed that the database variables are hardcoded throughout
entrypoint.sh, Dockerfile, config.py, and docker-compose.yml.
Unsure if I like using sed to update config file.

Alternatives could be using a docker .env file; would need to make
multiple updates and still would need to update config.py. The
configparser python package is possibly a better solution to the
config.py file in general, but that would require extensive updates
to 4cat. COULD possibly create a seperate config file handled by
 configparser and import that into config.py.

* Updates to docker config modifications. Moved docker variables to
docker_config.ini, created docker_setup.py to better utilize
configparser and avoid any accidental "sed" changes, and modified
config.py (actually config.py-example) to use docker_config.ini if
it has been specified in docker_config.ini.

Also moved setup to Dockerfile instead of docker-entrypoint.sh so that
it does not unnecessarily run every time docker containers are started.

* 1. added paths to docker.ini file
2. handled paths in docker_setup.py (create directories if needed)
3. update config.py-example to use docker paths
3. trap SIGTERM in docker-entrypoint.sh for 4cat-daemon backend
4. update docker-compose to separate backend and frontend and use
shared volume for data
5. add Dockerfile_frontend to set up frontend (could use pairing down)
6. rearranged Dockerfile so it doesn't rebuild python packages and
download/install chrome every time I update the config file

* Docker org updates to allow for rebuilding images/updating docker files.
Shared admin password someplace noticable.

* gitignore changes

* update .gitignore (ignore venv & jupyter notebooks). 
Add 4444 port to docker.

* add port for telegram.
add modify api port to config for docker.

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Adding dynamic localhost

* Dynamic API host

* temp logging

* add test status button

* Adding docker test

* Update README.md

* Add sessions path to config files; allow it to be shared by docker 
containers

* Removed logging. Changed admin@admin.com to admin.

* Updates to Docker: user database information is set in .env file which
is then used by docker-compose.yml, passed to the Dockerfiles, 
docker-entrypoint.sh, and docker_setup.py which updates 4cat config 
files.

* TCAT to 4CAT. Cause apparently I don't even know where I am anymore!
  • Loading branch information
dale-wahl authored May 27, 2021
1 parent 1825c25 commit 160909a
Show file tree
Hide file tree
Showing 19 changed files with 347 additions and 220 deletions.
4 changes: 4 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
POSTGRES_USER=fourcat
POSTGRES_PASSWORD=supers3cr3t
POSTGRES_DB=fourcat
POSTGRES_HOST_AUTH_METHOD=trust
14 changes: 14 additions & 0 deletions .github/workflows/dockerimage.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: Docker Image CI

on:
push:
branches: dev
pull_request:
branches: master
jobs :
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Build the stack
run: docker-compose up -d
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,11 @@ build/
dist/
fourcat.egg-info/
webtool/venv/
.4cat_env/
.vscode*
.idea
.env/
*.ipynb

# do not ignore interface images
!webtool/static/img/*.png
Expand Down
51 changes: 27 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,23 @@
[![License: MPL 2.0](https://img.shields.io/badge/license-MPL--2.0-informational)](https://github.com/digitalmethodsinitiative/4cat/blob/master/LICENSE)
![Requires Python 3.7](https://img.shields.io/badge/python-v3.8-blue)

[![Actions Status](https://github.com/digitalmethodsinitiative/4cat/workflows/Docker%20Image%20CI/badge.svg)](https://github.com/digitalmethodsinitiative/4cat/actions)

4CAT is a tool that can be used to analyse and process data from online social
platforms. Its goal is to make the capture and analysis of data from these
platforms. Its goal is to make the capture and analysis of data from these
platforms accessible to people through a web interface, without requiring any
programming or web scraping skills.

In 4CAT, you create a dataset from a given platform according to a given set of
parameters; the result of this (usually a CSV file containing matching items)
can then be downloaded or analysed further with a suite of analytical
parameters; the result of this (usually a CSV file containing matching items)
can then be downloaded or analysed further with a suite of analytical
'processors', which range from simple frequency charts to more advanced analyses
such as the generation and visualisation of word embedding models.

4CAT has a (growing) number of supported data sources corresponding to popular
platforms that are part of the tool, but you can also [add additional data
sources](https://github.com/digitalmethodsinitiative/4cat/wiki/How-to-make-a-data-source)
using 4CAT's Python API. The following data sources are currently supported
4CAT has a (growing) number of supported data sources corresponding to popular
platforms that are part of the tool, but you can also [add additional data
sources](https://github.com/digitalmethodsinitiative/4cat/wiki/How-to-make-a-data-source)
using 4CAT's Python API. The following data sources are currently supported
actively:

* 4chan
Expand All @@ -29,62 +31,63 @@ actively:
* Telegram
* Twitter API (Academic Track, full-archive search)

The following platforms are supported through other tools, from which you can
The following platforms are supported through other tools, from which you can
import data into 4CAT for analysis:

* Facebook (via [CrowdTangle](https://www.crowdtangle.com) exports)
* Instagram (via CrowdTangle)
* TikTok (via [tiktok-scraper](https://github.com/drawrowfly/tiktok-scraper))

A number of other platforms have built-in support that is untested, or requires
e.g. special API access. You can view the [full list of data
sources](https://github.com/digitalmethodsinitiative/4cat/tree/master/datasources)
e.g. special API access. You can view the [full list of data
sources](https://github.com/digitalmethodsinitiative/4cat/tree/master/datasources)
in the GitHub repository.

## Install
We use 4CAT for our own purposes at the University of Amsterdam but you can
(and are encouraged to!) run your own instance. [You can find detailed
installation instructions in our
(and are encouraged to!) run your own instance. [You can find detailed
installation instructions in our
wiki](https://github.com/digitalmethodsinitiative/4cat/wiki/Installing-4CAT).

Support for Docker is work-in-progress. You can install using
[docker-compose](https://docs.docker.com/compose/install/) by running:
Support for Docker is work-in-progress. You can install using
[docker-compose](https://docs.docker.com/compose/install/) by cloning the repository and running:
```
docker-compose up
```

But this may currently not work in all environments. We hope to rectify this in
the future (pull requests are very welcome).
Your admin username and password will appear at the end of the installation and are saved as login.txt in the Docker 4cat_backend container (you should delete this afterwards; `docker exec -it 4cat_backend /bin/bash` to access container). You may also want to change your SQL database information by updating the .env file *prior* to using Docker compose.

Please check our [issues](https://github.com/digitalmethodsinitiative/4cat/issues) and create one if you experience any problems (pull requests are also very welcome).

## Components
4CAT consists of several components, each in a separate folder:

- `backend`: A standalone Python 3 app that scrapes defined data sources,
downloads and stores the relevant data and performs searches and analyses as
- `backend`: A standalone Python 3 app that scrapes defined data sources,
downloads and stores the relevant data and performs searches and analyses as
queued by the front-end.
- `webtool`: A Flask app that provides a web front-end to search and analyze
the stored data with.
- `datasources`: Data source definitions. This is a set of configuration
- `datasources`: Data source definitions. This is a set of configuration
options, database definitions and python scripts to process this data with.
If you want to set up your own data sources, refer to the
[wiki](https://github.com/digitalmethodsinitiative/4cat/wiki/How-to-make-a-data-source).
- `processors`: A collection of data processing scripts that can plug into
4CAT and manipulate or process datasets created with 4CAT. There is an API
you can use to [make your own
you can use to [make your own
processors](https://github.com/digitalmethodsinitiative/4cat/wiki/How-to-make-a-processor).

## Credits & License
4CAT was created by [OILab](https://oilab.eu) and the
4CAT was created by [OILab](https://oilab.eu) and the
[Digital Methods Initiative](https://www.digitalmethods.net) at the University
of Amsterdam. The tool was inspired by the
of Amsterdam. The tool was inspired by the
[TCAT](https://wiki.digitalmethods.net/Dmi/ToolDmiTcat), a tool with comparable
functionality that can be used to scrape and analyse Twitter data.

4CAT development is supported by the Dutch [PDI-SSH](https://pdi-ssh.nl/en/) foundation through the [CAT4SMR project](https://cat4smr.humanities.uva.nl/).
4CAT development is supported by the Dutch [PDI-SSH](https://pdi-ssh.nl/en/) foundation through the [CAT4SMR project](https://cat4smr.humanities.uva.nl/).

4CAT is licensed under the Mozilla Public License, 2.0. Refer to the `LICENSE`
file for more information.

## Links
- [Open Intelligence Lab](https://www.oilab.eu)
- [Digital Methods Initiative](https://www.digitalmethods.net)
- [Digital Methods Initiative](https://www.digitalmethods.net)
7 changes: 4 additions & 3 deletions backend/workers/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ class InternalAPI(BasicWorker):
type = "api"
max_workers = 1

host = config.API_HOST
port = config.API_PORT

def work(self):
Expand All @@ -21,7 +22,7 @@ def work(self):
Opens a socket that continuously listens for requests, and passes a
client object on to a handling method if a connection is established
:return:
"""
if self.port == 0:
Expand All @@ -44,7 +45,7 @@ def work(self):
while has_time:
has_time = start_trying > time.time() - 300 # stop trying after 5 minutes
try:
server.bind(("localhost", self.port))
server.bind((self.host, self.port))
break
except OSError as e:
if has_time and not self.interrupted:
Expand All @@ -59,7 +60,7 @@ def work(self):

server.listen(5)
server.settimeout(5)
self.manager.log.info("Local API listening for requests at localhost:%s" % self.port)
self.manager.log.info("Local API listening for requests at %s:%s" % (self.host, self.port))

# continually listen for new connections
while not self.interrupted:
Expand Down
2 changes: 1 addition & 1 deletion common/lib/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@ def call_api(action, payload=None):
"""
connection = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
connection.settimeout(15)
connection.connect(("localhost", config.API_PORT))
connection.connect((config.API_HOST, config.API_PORT))

msg = json.dumps({"request": action, "payload": payload})
connection.sendall(msg.encode("ascii", "ignore"))
Expand Down
52 changes: 44 additions & 8 deletions config.py-example
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
""" 4CAT configuration """
import os
import configparser

DOCKER_CONFIG_FILE = 'docker/shared/docker_config.ini'

# Data source configuration
DATASOURCES = {
Expand All @@ -8,7 +11,9 @@ DATASOURCES = {
"douban": {},
"customimport": {},
"parler": {},
"reddit": {},
"reddit": {
"boards": "*",
},
"telegram": {},
"twitterv2": {}
}
Expand All @@ -33,6 +38,7 @@ PATH_LOGS = "" # store logs here - empty means the 4CAT root folder
PATH_IMAGES = "" # if left empty or pointing to a non-existent folder, no images will be saved
PATH_DATA = "" # search results will be stored here as CSV files
PATH_LOCKFILE = "backend" # the daemon lockfile will be saved in this folder. Probably no need to change!
PATH_SESSIONS = "sessions" # folder where API session data is stored (e.g., Telegram)

# The following two options should be set to ensure that every analysis step can
# be traced to a specific version of 4CAT. This allows for reproducible
Expand All @@ -44,6 +50,7 @@ GITHUB_URL = "https://github.com/digitalmethodsinitiative/4cat" # URL to the gi

# 4CAT has an API (available from localhost) that can be used for monitoring
# and will listen for requests on the following port. "0" disables the API.
API_HOST = "localhost"
API_PORT = 4444

# 4CAT can anonymise author names in results and does so using a hashed version
Expand Down Expand Up @@ -85,12 +92,41 @@ TUMBLR_API_SECRET_KEY = ""
REDDIT_API_CLIENTID = ""
REDDIT_API_SECRET = ""

# Docker setup requires matching database configuration
use_docker_config = False
if os.path.exists(DOCKER_CONFIG_FILE):
config = configparser.ConfigParser()
config.read(DOCKER_CONFIG_FILE)
if config['DOCKER'].getboolean('use_docker_config'):
use_docker_config = True
DB_HOST = config['DATABASE'].get('db_host')
DB_PORT = config['DATABASE'].getint('db_port')
DB_USER = config['DATABASE'].get('db_user')
DB_NAME = config['DATABASE'].get('db_name')
DB_PASSWORD = config['DATABASE'].get('db_password')

API_HOST = config['API'].get('api_host')
API_PORT = config['API'].getint('api_port')

PATH_ROOT = os.path.abspath(os.path.dirname(__file__)) # better don't change this
PATH_LOGS = config['PATHS'].get('path_logs', "")
PATH_IMAGES = config['PATHS'].get('path_images', "")
PATH_DATA = config['PATHS'].get('path_data', "")
PATH_LOCKFILE = config['PATHS'].get('path_lockfile', "")
PATH_SESSIONS = config['PATHS'].get('path_sessions', "")

ANONYMISATION_SALT = config['GENERATE'].get('anonymisation_salt')

# Web tool settings
class FlaskConfig:
FLASK_APP = 'webtool/fourcat'
SECRET_KEY = "REPLACE_THIS"
SERVER_NAME = 'localhost:5000'
SERVER_HTTPS = False # set to true to make 4CAT use "https" in absolute URLs
HOSTNAME_WHITELIST = ["localhost"] # only these may access the web tool; "*" or an empty list matches everything
HOSTNAME_WHITELIST_API = ["localhost"] # hostnames matching these are exempt from rate limiting
HOSTNAME_WHITELIST_NAME = "Automatic login"
FLASK_APP = 'webtool/fourcat'
SECRET_KEY = "REPLACE_THIS"
SERVER_NAME = 'localhost:5000'
SERVER_HTTPS = False # set to true to make 4CAT use "https" in absolute URLs
HOSTNAME_WHITELIST = ["localhost"] # only these may access the web tool; "*" or an empty list matches everything
HOSTNAME_WHITELIST_API = ["localhost"] # hostnames matching these are exempt from rate limiting
HOSTNAME_WHITELIST_NAME = "Automatic login"

# Docker config
if use_docker_config:
SECRET_KEY = config['GENERATE'].get('secret_key')
5 changes: 3 additions & 2 deletions datasources/telegram/search_telegram.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from telethon.errors.rpcerrorlist import UsernameInvalidError
from telethon.tl.types import User, PeerChannel, PeerChat, PeerUser

import config

class SearchTelegram(Search):
"""
Expand Down Expand Up @@ -125,7 +126,7 @@ async def execute_queries(self):

hash_base = query["api_phone"].replace("+", "") + query["api_id"] + query["api_hash"]
session_id = hashlib.blake2b(hash_base.encode("ascii")).hexdigest()
session_path = Path(__file__).parent.joinpath("sessions", session_id + ".session")
session_path = Path(config.PATH_ROOT).joinpath(config.PATH_SESSIONS, session_id + ".session")

client = None

Expand Down Expand Up @@ -427,4 +428,4 @@ def validate_query(query, request, user):
"api_id": query.get("api_id"),
"api_hash": query.get("api_hash"),
"api_phone": query.get("api_phone")
}
}
4 changes: 3 additions & 1 deletion datasources/telegram/webtool/views.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from telethon.sync import TelegramClient
from telethon.errors.rpcerrorlist import FloodWaitError, ApiIdInvalidError, PhoneNumberInvalidError

import config

def authenticate(request, user, **kwargs):
"""
Expand Down Expand Up @@ -42,7 +43,8 @@ def authenticate(request, user, **kwargs):

# store session ID for user so it can be found again for later queries
user.set_value("telegram.session", session_id)
session_path = Path(__file__).parent.joinpath("..", "sessions", session_id + ".session")
session_path = Path(config.PATH_ROOT).joinpath(config.PATH_SESSIONS, session_id + ".session")


client = None

Expand Down
52 changes: 36 additions & 16 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,52 @@ version: '3.6'

services:
db:
container_name: db
image: postgres:latest
ports:
- 5432:5432
environment:
POSTGRES_USER: fourcat
POSTGRES_PASSWORD: supers3cr3t
POSTGRES_DB: fourcat
POSTGRES_HOST_AUTH_METHOD: trust
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_DB=${POSTGRES_DB}
- POSTGRES_HOST_AUTH_METHOD=${POSTGRES_HOST_AUTH_METHOD}
volumes:
- 4cat_data:/var/lib/postgresql/data/
- 4cat_db:/var/lib/postgresql/data/

api:
build: .
container_name: api
backend:
build:
context: .
dockerfile: docker/Dockerfile_backend
args:
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_DB=${POSTGRES_DB}
container_name: 4cat_backend
depends_on:
- db
environment:
STAGE: test
SQLALCHEMY_DATABASE_URI: postgresql+psycopg2://fourcat:supers3cr3t@db/fourcat
networks:
- default
ports:
- 4444:4444
command: ${POSTGRES_DB} ${POSTGRES_USER}
volumes:
- 4cat_data:/usr/src/app/data/
- 4cat_share:/usr/src/app/docker/shared/

frontend:
build:
context: .
dockerfile: docker/Dockerfile_frontend
container_name: 4cat_frontend
depends_on:
- db
- backend
ports:
- 443:443
- 5000:5000
volumes:
- ./app:/usr/src/app/app
- ./migrations:/usr/src/app/migrations
- 4cat_data:/usr/src/app/data/
- 4cat_share:/usr/src/app/docker/shared/

volumes:
4cat_data: {}
4cat_db: {}
4cat_data: {}
4cat_share: {}
Loading

0 comments on commit 160909a

Please sign in to comment.