Skip to content
This repository has been archived by the owner on Sep 9, 2021. It is now read-only.
/ vectorian-legacy Public archive

A search engine for intertextual references based on sequence alignments of word embeddings

Notifications You must be signed in to change notification settings

poke1024/vectorian-legacy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vectorian Screenshot

Vectorian is a high-performance search engine for intertextual references powered by fastText, spaCy and simileco.

Minimal Installation

Installation should work on macOS and Linux. On Windows you should use Docker or a VM.

Python Packages

conda create --name vectorian python=3.7
conda activate vectorian

git clone https://github.com/poke1024/vectorian
cd vectorian

# install core dependencies
pip install -r requirements.txt

Necessary Data Files and Additional Dependencies

spaCy en_core_web_lg

# download spaCy's large English model
conda activate vectorian
python -m spacy download en_core_web_lg

fastText

Download wiki-news-300d-1M-subword from https://fasttext.cc/docs/en/english-vectors.html.

Unzip and put this to /path/to/vectorian/data/fasttext/wiki-news-300d-1M-subword.vec.

There's also support for the larger crawl-300d-2M-subword. Note that this is not recommeneded for standard installations, as loading and preprocessing times are high.

Installing necessary libraries

Eigen is needed by vectorian's C++ backend and by simileco. Vectorian needs a current version of Eigen >= 3.90. Since this is not the stable version, you need to install it manually (note that this a header-only library that does not need compilation):

git clone https://gitlab.com/libeigen/eigen
mkdir build
cd build
cmake ..
cd ..
sudo make install

Pyarrow C++ headers are also needed. Install via:

conda install -c conda-forge pyarrow

Special install stuff for macOS

On some versions of macOS, you might need to patch eigen:

https://stackoverflow.com/questions/46356153/xcode-9-falls-to-build-partial-template-specialization-in-c

Adding Text Data to the Vectorian

Text data lives inside Vectorian's data/corpus folder. You add files and Vectorian will preprocess and load them automatically on startup.

However you need to adhere to a given structure of three types of files that Vectorian needs to preprocess files in an optimal way.

Files are accordingly organized into three subfolders:

  • data/corpus/shakespeare: receives XML shakespeare files. The files have to be in the format used by playshakespeare.com
  • data/corpus/nodels: contains folders of authors and in these folders plain text files of the author's novels.
  • data/corpus/screenplays: contains screenplays.

Not all these folders have to exist, you can, for example, just add novels.

Here's an example layout:

corpus
	novels
		Charles Dickens
			Hard Times.txt
			The Pickwick Papers.txt
		Jane Austen
			Northanger Abbey.txt
screenplays
	that_exciting_series
		pilot.txt
		series1
			season1.txt
shakespeare
	ps_hamlet.xml
	ps_henry_v.xml

Using Vectorian

Configuration File

The Vectorian directory can contain a .config.json file that configures additional behaviour. This is optional however.

Launching Vectorian

conda activate vectorian

cd /path/to/vectorian
python ./srv/main.py

After starting up, Vectorian should be available at http://localhost:8080/.

Developer Instructions

Troubleshooting

general

If you see 'Eigen/Core' file not found during startup, it means that the Eigen library has not been installed properly (or is not in your PATH). You can configure custom paths via srv/cpp/vcore.cpp.

on macOS

On macOS, if you observe strange crashes related to numba or llvm, you might need to do (see numba/numba#4256):

pip install pyarrow==0.12.1

Also see: https://www.mail-archive.com/dev@arrow.apache.org/msg13667.html

on Ubuntu

Under GCC, there might problems with ABI compatibility of arrow and pyarrow libs (see https://arrow.apache.org/docs/python/development.html). _GLIBCXX_USE_CXX11_ABI can help.

Building elm modules for frontend

web/build.sh

Manually building the C++ component

c++ -O3 -larrow -Wall -shared -std=c++17 -I/usr/include/eigen3/ -fPIC `python3 -m pybind11 --includes` src.cpp -o vcore`python3-config --extension-suffix`

Debugging the C++ component

on macOS:

export DEBUG_VECTORIAN=1
lldb python -- vectorian/srv/main.py

on Linux:

export DEBUG_VECTORIAN=1
export ASAN_OPTIONS=verify_asan_link_order=0
gdb --args python vectorian/srv/main.py

Running Vectorian as a systemd service

Here's a template for Vectorian as a systemd service:

[Unit]
Description=The Vectorian
After=multi-user.target

[Service]
Type=simple
ExecStart=/your/python3 your/vectorian/srv/main.py
WorkingDirectory=your/vectorian
Restart=always
RestartSec=10
PrivateTmp=true
StandardOutput=syslog
StandardError=syslog
Environment=OPENBLAS_NUM_THREADS=2

[Install]
WantedBy=multi-user.target

Install this as vectorian.service into /etc/systemd/system/vectorian.service.

Now you can use these useful commands:

systemctl daemon-reload

systemctl start vectorian.service
systemctl status vectorian.service

tail -f /var/log/syslog

About

A search engine for intertextual references based on sequence alignments of word embeddings

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published