-
Notifications
You must be signed in to change notification settings - Fork 61
Enabling local data sources
Most of 4CAT's data sources use external APIs. However, the tool is also capable of capturing, storing, and querying locally saved data, for instance with 4chan and 8kun data (see the data source overview for a list of all local data sources). These data are stored in a PostgreSQL database and can be queried with Sphinx search. See the data source
This page explains how to enable the collection and querying of local data sources.
The first step is to enable the collection of locally stored data.
We first need to generate the database tables for the local data sources you want to add.
This is done by running the SQL query stored in the database.sql
file in the data source's datasources/
folder (e.g. datasources/fourchan/database.sql
).
How to run this SQL query will depend on your specific installation. Usually, this involves running a command through psql
. On local 4CAT installations, you can also use the query tool in software like pgAdmin.
For example, the following code adds the database tables for 4chan collection to a fourcat
database in a Docker installation:
docker exec -it 4cat_backend /bin/bash
cd datasources/fourchan/
psql --host=db --port=5432 --user=fourcat --dbname=fourcat -f database.sql
Once the database tables are generated, let's enable the data source through 4CAT's Web interface.
Navigate to Control Panel
-> Settings
-> Data sources
in 4CAT. Then, enable the desired data sources by checking the checkmark.
Enabling a local data source generates a specific menu for that data source on the Data sources
settings page in the Control Panel
. Here you might want to make some adjustments. For instance, for imageboards, you have to specify which boards you want to scrape, for instance by adding 4chan/pol/ like so:
Navigate to Control Panel -> Settings -> 4chan search, enable collecting and select desired boards, then save.
You can add more than one board to the list, e.g. ["pol", "v", "fit"]
. You can also specify the interval with which boards are scraped, and whether to also download images.
Go to Control Panel
-> Restart or Upgrade
and click the Restart
button. If you're using Docker, you can also use the Docker Desktop interface to stop and start the 4cat_backend
container.
After 4CAT restarts, you should begin to see log messages showing collected data.
Congrats! You're collecting data in a local PostgreSQL database.
The data source will now show up in the Create dataset
page. However, for most local data sources, you will have to run a full-text search engine to execute text queries. To do so, we need to install Sphinx search and index the database.
The instructions will differ based on whether you're using 4CAT through Docker or if you're running it manually.
Note: your desired datasources must be enabled per step 3 above prior to creating the sphinx.conf
file
- Run
docker exec 4cat_backend python3 helper-scripts/generate_sphinx_config.py
- Copy the
sphinx.conf
file to the host machine's current working directory in order to edit the filedocker cp 4cat_backend:/usr/src/app/helper-scripts/sphinx.conf ./
You will later copy the file to the a new Sphinx container.
- Ensure
sql_host
is the 4CAT database container name, e.g.,sql_host = db
(older 4CAT versions did not do this automatically) - Change the
listen
hosts to0.0.0.0
fromlocalhost
(allowing Sphinx to receive connections from other containers and, if desired, your host machine)
listen = 0.0.0.0:9213
listen = 0.0.0.0:9306:mysql41
This container will index your collected data and allow you to search the data with 4CAT. Docker image here
docker run -it --publish 9306 --name 4cat_sphinx -d macbre/sphinxsearch:3.3.1 /bin/sh
- run
docker network ls
to identify 4CAT network, likely4cat_default
- run
docker network connect 4cat_default 4cat_sphinx
assuming4cat_default
is the name of your 4CAT network and you used the--name 4cat_sphinx
option when creating thesphinxsearch
container in the previous step
Edit the "Sphinx host" setting in 4CAT via Control Panel -> Settings -> 4CAT Tool Settings
- Edit "Sphinx host" to either the name of the
sphinxsearch
container (e.g.,4cat_sphinx
) or
- Run
docker network inspect 4cat_default
after adding the sphinx container to the network. Find the new sphinx container in the Container section and record the IPv4Address. - Edit the "Sphinx host" setting in the 4CAT Control Panel to the Sphinx IP address
Prior to 2023-07, the host for Sphinx was hard-coded to run alongside 4CAT, but it must be updated for a Docker container setup.
The 4chan datasource alone must be edited. Change this line in datasources/fourchan/search_4chan.py
to the Sphinx container IP address.
- Change
MySQLDatabase
host (default islocalhost
) to docker IP address found via inspecting 4cat docker networkdocker network inspect 4cat_default
. (You can copy the file to your host directory in order to edit viadocker cp 4cat_backend:/usr/src/app/datasources/fourchan/search_4chan.py ./
or edit directly in the container if desired.) - after updating, copy to
4cat_backend
container (i.e.,docker cp datasources/fourchan/search_4chan.py 4cat_backend:/usr/src/app/datasources/fourchan/
)
# copy sphinx.conf file
docker cp sphinx.conf 4cat_sphinx:/opt/sphinx/sphinx-3.3.1/bin/
# connect to container
docker exec -it 4cat_sphinx /bin/sh
# navigate to sphinx-3.3.1/bin/
cd /opt/sphinx/sphinx-3.3.1/bin/
# create data and data/binlog folders IN sphinx folder (sphinx-3.3.1/data/)
mkdir ../data
mkdir ../data/binlog
# run indexer
./indexer --all
# start searchd
./searchd
- You can check what Sphinx is listening to by running the following in the sphinx container (
docker exec -it sphinx_container_id /bin/bash
)netstat -nlp
If you're not using Docker, you can also install and run Sphinx manually.
- Follow steps 1 through 4 at Enable local data source collection above.
- Download the Sphinx 3.3.1 source code.
- Create a sphinx directory somewhere, e.g. in the directory of your 4CAT instance
4cat/sphinx/
. In it, paste all the unzipped contents of the sphinx-3.3.1.zip file you just downloaded (so that it's filled with the directoriesapi
,bin
, etc.). In the Sphinx directory, also create a folder calleddata
, and in thisdata
directory, one calledbinlog
. - Add a configuration file. You can generate one by running the
generate_sphinx_config.py
script in the folderhelper-scripts
. Ensure that the desired datasources requiring Sphinx (e.g., 4chan, 8chan, 8kun) are enabled in Control Panel -> Settings -> Datasources before running the script and following the instructions in their READMEs to create their respective SQL databases. After runninggenerate_sphinx_config.py
, a file calledsphinx.conf
will appear in thehelper-scripts
directory. Copy-paste this file to thebin
folder in your Sphinx directory (in the case of the example above:4cat/sphinx/bin/sphinx.conf
). - Generate indexes for the posts that you already collected (if you haven't run any scrape yet, you can do this later). Generating indexes means Sphinx will create fast lookup tables so words can be searched quickly. In your command line interface, navigate to the
bin
directory of your Sphinx installation and run the command./indexer --all
(Linux) orindexer.exe --all
(Windows). This should generate the indexes.- If you get the error
No such file or directory, will not index.
, make sure there's adata
folder in thesphinx
directory.
- If you get the error
- Finally, before executing any search queries, make sure Sphinx is active. In your command line interface, run
./searchd
(Linux) orsearchd.exe
(Windows; see known issues below if you get an error), once again within Sphinx'sbin
folder.
See the Sphinx docs for more information.
You will need to run the ./indexer --all
in order to update Sphinx's indexes with newly collected data. As your data grows, this can take a lot of time; we run it nightly (via a cronjob
script).
- On Windows, you might encounter the error
The code execution cannot proceed because ssleay32.dll was not found
(see also this page). This can be solved by downloading Sphinx version 3.1.1. and copy-pasting the following files from the 3.1.1.bin
directory to your 3.3.1bin
directory:- libeay32.dll
- msvcr120.dll
- ssleay32.dll
- On Linux, you might run into permission issues. Make sure to execute the scripts with the right user.
🐈🐈🐈🐈