Skip to content

Enabling local data sources

Dale Wahl edited this page Jul 20, 2023 · 5 revisions

Most of 4CAT's data sources use external APIs. However, the tool is also capable of capturing, storing, and querying locally saved data, for instance with 4chan and 8kun data (see the data source overview for a list of all local data sources). These data are stored in a PostgreSQL database and can be queried with Sphinx search.


This page explains how to enable the collection and querying of local data sources.

Enable local data source collection

The first step is to enable the collection of locally stored data.

Step 1: Add database tables

We first need to generate the database tables for the local data sources you want to add. This is done by running the SQL query stored in the database.sql file in the data source's datasources/ folder (e.g. datasources/fourchan/database.sql). How to run this SQL query will depend on your specific installation. Usually, this involves running a command through psql from the data source folder like so:

psql -U username -d mydatabase -a -f database.sql

On manual, local 4CAT installations, you can also use the query tool in software like pgAdmin. If you're using Docker, the following code adds the database tables for 4chan collection to a fourcat database with user fourcat:

docker exec -it 4cat_backend /bin/bash
cd datasources/fourchan/
psql --host=db --port=5432 --user=fourcat --dbname=fourcat -f database.sql

Step 2: Enable the data source

Once the database tables are generated, let's enable the data source through 4CAT's Web interface.

Navigate to Control Panel -> Settings -> Data sources in 4CAT. Then, enable the desired data sources by checking the checkmark.

image

Step 3: Set individual data source settings

Enabling a local data source generates a specific menu for that data source on the Data sources settings page in the Control Panel (e.g. "4chan search"). Here you might have to make some adjustments. For instance, for imageboard data collection, you have to specify which boards you want to scrape, for instance by adding 4chan/pol/ like so:

image

You can add more than one board to the list, e.g. ["pol", "v", "fit"]. You can also specify the interval with which boards are scraped, and whether to download images.

Step 4: Restart 4CAT to start collection

Go to Control Panel -> Restart or Upgrade and click the Restart button. If you're using Docker, you can also use the Docker Desktop interface to stop and start the 4cat_backend container.

After 4CAT restarts, you should begin to see log messages showing collected data.

Congrats! You're collecting data in a local PostgreSQL database. The data source will now show up in the Create dataset page.


Enabling text search

However, to execute queries for most local data sources, you will have to run a full-text search engine. To do so, we need to install Sphinx search and index the database.

The instructions will differ based on whether you're using 4CAT through Docker or if you're running it manually.

Installing and running Sphinx via Docker

Step 1: Create a sphinx.conf file

  • Run the command docker exec 4cat_backend python3 helper-scripts/generate_sphinx_config.py to create a Sphinx configuration file, which contains information on all of the enabled local data sources (per the steps above).
  • Copy the sphinx.conf file to the host machine's current working directory, so we can edit the file. You can do so through the command: docker cp 4cat_backend:/usr/src/app/helper-scripts/sphinx.conf ./ You will later copy the sphinx.conf file to the a new Sphinx container.

Step 2: Update sphinx.conf file

  • Ensure sql_host is the 4CAT database container name, e.g., sql_host = db (older 4CAT versions did not do this automatically).
  • Change the listen hosts to 0.0.0.0 from localhost. This allows Sphinx to receive connections from other containers and, if desired, your host machine.
listen = 0.0.0.0:9213
listen = 0.0.0.0:9306:mysql41

Step 3: Create a sphinxsearch container

This container will index your collected data and allow you to search the data with 4CAT. The Docker image can be found here. To create the container, run the following command:

docker run -it --publish 9306 --name 4cat_sphinx -d macbre/sphinxsearch:3.3.1 /bin/sh

Step 4: Connect the Sphinx container to the 4CAT network

  • Run docker network ls to identify 4CAT network, likely 4cat_default
  • Run docker network connect 4cat_default 4cat_sphinx assuming 4cat_default is the name of your 4CAT network and you used the --name 4cat_sphinx option when creating the sphinxsearch container in the previous step.

Step 5: Update Sphinx host setting in 4CAT

Edit the "Sphinx host" setting in 4CAT via Control Panel -> Settings -> 4CAT Tool Settings

Option 1
  • Edit "Sphinx host" to either the name of the sphinxsearch container (e.g., 4cat_sphinx) or
Option 2
  • Run docker network inspect 4cat_default after adding the sphinx container to the network. Find the new sphinx container in the Container section and copy the IPv4Address.
  • In the 4CAT Control Panel, go to "4CAT Tool Settings" and change the "Sphinx host" value to the Sphinx IP address you just copied.
Note on older 4CAT versions

Prior to 2023-07, the host for Sphinx was hard-coded to run alongside 4CAT, but it must be updated for a Docker container setup. This only affects the 4chan data source. Change this line in datasources/fourchan/search_4chan.py to the Sphinx container IP address.

  • Change MySQLDatabase host (default is localhost) to Docker IP address found via inspecting 4cat docker network docker network inspect 4cat_default. (You can copy the file to your host directory in order to edit via docker cp 4cat_backend:/usr/src/app/datasources/fourchan/search_4chan.py ./ or edit directly in the container if desired.)
  • After updating, copy to 4cat_backend container (i.e., docker cp datasources/fourchan/search_4chan.py 4cat_backend:/usr/src/app/datasources/fourchan/)

Step 6: Create indexes and run Sphinx

We finally need to create full-text search indexes for any of the data that you already collected. Generating indexes means Sphinx will create fast lookup tables so words can be searched quickly. After, we run Sphinx through executing ./searchd. Follow the following steps:

# Copy the `sphinx.conf` file we generated above to the sphinx bin folder
docker cp sphinx.conf 4cat_sphinx:/opt/sphinx/sphinx-3.3.1/bin/
# Connect to container
docker exec -it 4cat_sphinx /bin/sh
# Navigate to sphinx-3.3.1/bin/
cd /opt/sphinx/sphinx-3.3.1/bin/
# Create data and data/binlog folders IN the sphinx folder (sphinx-3.3.1/data/)
mkdir ../data
mkdir ../data/binlog
# run indexer
./indexer --all
# start searchd
./searchd

This generates full-text search indexes for all the local data sources you enabled and actives Sphinx. Make sure to the container running and restart ./searchd whenever you restart the container!

To index newly collected posts, you can run docker exec 4cat_sphinx /bin/sh -c "cd /opt/sphinx/sphinx-3.3.1/bin/ && ./indexer --all --rotate" whenever the container is running.

Docker troubleshooting

  • You can check what Sphinx is listening to by running the following commend in the sphinx container (docker exec -it sphinx_container_id /bin/bash) netstat -nlp

Installing and running Sphinx manually

If you're not using Docker, you can also install and run Sphinx manually.

  1. Download the Sphinx 3.3.1 source code.
  2. Create a sphinx directory somewhere in the directory of your 4CAT instance, e.g. 4cat/sphinx/. In it, paste all the unzipped contents of the sphinx-3.3.1.zip file you just downloaded (so that it's filled with the directories api, bin, etc.). In the sphinx directory, also create a folder called data, and in this data directory, one called binlog.
  3. Add a Sphinx configuration file. You can generate one by running the generate_sphinx_config.py script in the folder helper-scripts. After running generate_sphinx_config.py, a file called sphinx.conf will appear in the helper-scripts directory. Copy-paste this file to the bin folder in your sphinx directory (in the example above: 4cat/sphinx/bin/sphinx.conf).
  4. Generate indexes for the posts that you already collected (if you haven't run any scrape yet, you can do this later). Generating indexes means Sphinx will create fast lookup tables so words can be searched quickly. In your command line interface, navigate to the bin directory of your Sphinx installation and run the command ./indexer --all (Linux) or indexer.exe --all (Windows). This should generate the indexes.
    • If you get the error No such file or directory, will not index., make sure there's a data folder in the sphinx directory.
  5. Finally, before executing any search queries, make sure Sphinx is active. In your command line interface, run ./searchd (Linux) or searchd.exe (Windows; see known issues below if you get an error), once again within Sphinx's bin folder. Make sure to leave this process running (you may want to use something like tmux).

See the Sphinx docs for more information.


Sphinx is now ready for search via 4CAT!

You will need to re-run the indexer (docker exec 4cat_sphinx /bin/sh -c "cd /opt/sphinx/sphinx-3.3.1/bin/ && ./indexer --all --rotate" for Docker, ./indexer --all for Linux, and indexer.exe --all for Windows) to update Sphinx's indexes with newly collected data. As your data grows, this can take a lot of time, so we run the indexer nightly via a cronjob script.

Known Issues

  • On Windows, you might encounter the error The code execution cannot proceed because ssleay32.dll was not found (see also this page). This can be solved by downloading Sphinx version 3.1.1. and copy-pasting the following files from the 3.1.1. bin directory to your 3.3.1 bin directory:
    • libeay32.dll
    • msvcr120.dll
    • ssleay32.dll
  • On Linux, you might run into permission issues. Make sure to execute the scripts with the right user.