Vectorian is a high-performance search engine for intertextual references powered by fastText, spaCy and simileco.
Installation should work on macOS and Linux. On Windows you should use Docker or a VM.
conda create --name vectorian python=3.7
conda activate vectorian
git clone https://github.com/poke1024/vectorian
cd vectorian
# install core dependencies
pip install -r requirements.txt
# download spaCy's large English model
conda activate vectorian
python -m spacy download en_core_web_lg
Download wiki-news-300d-1M-subword
from https://fasttext.cc/docs/en/english-vectors.html.
Unzip and put this to /path/to/vectorian/data/fasttext/wiki-news-300d-1M-subword.vec
.
There's also support for the larger crawl-300d-2M-subword. Note that this is not recommeneded for standard installations, as loading and preprocessing times are high.
Eigen is needed by vectorian's C++ backend and by simileco. Vectorian needs a current version of Eigen >= 3.90. Since this is not the stable version, you need to install it manually (note that this a header-only library that does not need compilation):
git clone https://gitlab.com/libeigen/eigen
mkdir build
cd build
cmake ..
cd ..
sudo make install
Pyarrow C++ headers are also needed. Install via:
conda install -c conda-forge pyarrow
On some versions of macOS, you might need to patch eigen:
Text data lives inside Vectorian's data/corpus
folder. You add files
and Vectorian will preprocess and load them automatically on startup.
However you need to adhere to a given structure of three types of files that Vectorian needs to preprocess files in an optimal way.
Files are accordingly organized into three subfolders:
data/corpus/shakespeare
: receives XML shakespeare files. The files have to be in the format used by playshakespeare.comdata/corpus/nodels
: contains folders of authors and in these folders plain text files of the author's novels.data/corpus/screenplays
: contains screenplays.
Not all these folders have to exist, you can, for example, just add novels.
Here's an example layout:
corpus
novels
Charles Dickens
Hard Times.txt
The Pickwick Papers.txt
Jane Austen
Northanger Abbey.txt
screenplays
that_exciting_series
pilot.txt
series1
season1.txt
shakespeare
ps_hamlet.xml
ps_henry_v.xml
The Vectorian directory can contain a .config.json
file that configures
additional behaviour. This is optional however.
conda activate vectorian
cd /path/to/vectorian
python ./srv/main.py
After starting up, Vectorian should be available at http://localhost:8080/
.
If you see 'Eigen/Core' file not found
during startup, it means
that the Eigen library has not been installed properly (or is not
in your PATH). You can configure custom paths via srv/cpp/vcore.cpp
.
On macOS, if you observe strange crashes related to numba or llvm, you might need to do (see numba/numba#4256):
pip install pyarrow==0.12.1
Also see: https://www.mail-archive.com/dev@arrow.apache.org/msg13667.html
Under GCC, there might problems with ABI compatibility of arrow and
pyarrow libs (see https://arrow.apache.org/docs/python/development.html).
_GLIBCXX_USE_CXX11_ABI
can help.
web/build.sh
c++ -O3 -larrow -Wall -shared -std=c++17 -I/usr/include/eigen3/ -fPIC `python3 -m pybind11 --includes` src.cpp -o vcore`python3-config --extension-suffix`
on macOS:
export DEBUG_VECTORIAN=1
lldb python -- vectorian/srv/main.py
on Linux:
export DEBUG_VECTORIAN=1
export ASAN_OPTIONS=verify_asan_link_order=0
gdb --args python vectorian/srv/main.py
Here's a template for Vectorian as a systemd service:
[Unit]
Description=The Vectorian
After=multi-user.target
[Service]
Type=simple
ExecStart=/your/python3 your/vectorian/srv/main.py
WorkingDirectory=your/vectorian
Restart=always
RestartSec=10
PrivateTmp=true
StandardOutput=syslog
StandardError=syslog
Environment=OPENBLAS_NUM_THREADS=2
[Install]
WantedBy=multi-user.target
Install this as vectorian.service
into /etc/systemd/system/vectorian.service
.
Now you can use these useful commands:
systemctl daemon-reload
systemctl start vectorian.service
systemctl status vectorian.service
tail -f /var/log/syslog