Hyphe relies on the following main components/directories:
memory_structure
: a JAVA Lucene database designed for WebEntities & WebEntityLinks handling, served as an API using Thrifthyphe_backend
: Python 2.6/2.7 controllers for the crawling and backend API, with MongoDB buffer database to store crawled data- core.tac: a Twisted based JSON-RPC API controller
- crawler: a Scrapy spider project to build and deploy on ScrapyD
- lib: shared libraries between the two
- memory_structure: Thrift-generated classes for easy dialogue with the Lucene MemoryStructure from Python
hyphe_frontend
: a JavaScript web application powered with Angular.js to constitute and explore web corpora through the backend API
Other useful directories are:
bin
for the executable scriptsconfig
where all useful configuration files aredoc
with this documentation among a few others
Note: hyphe_www_backend
is the source code of an older implementation of the Javascript web frontend, meant to work with Hyphe MonoCorpus (see setting MULTICORPUS), not maintained anymore. _deprecated
gathers old pieces of code or documentation from the past.
The MemoryStructure relies on a specific Lucene database made accessible thanks to Apache Thrift which allows to call the MemoryStructure's Java API from the Python core. This results in building both a compiled jar and a set of Python classes. The Python core starts one instance of the MemoryStructure jar for each corpus (and automatically shuts it down when inactive), see the dedicated code in hyphe_backend/lib/corpus.py
.
All of this means that whenever the code in the memory_structure
directory is modified, the jar and python classes running the memory structure need to be rebuilt, so a dedicated script does this:
bin/build_thrift.sh
The Lucene data model is defined in src/main/java/fr/sciencespo/medialab/hci/memorystructure/index/IndexConfiguration.java
.
The Thrift API and its list of routines is defined in src/main/java/memorystructure.thrift
and src/main/java/fr/sciencespo/medialab/hci/memorystructure/thrift/MemoryStructureImpl.java
. All other files in src/main/java/fr/sciencespo/medialab/hci/memorystructure/thrift
are autogenerated and shouldn't be modified except ThriftServer
which configures the API.
Most of the algorithms logic rely in memory_structure/src/main/java/fr/sciencespo/medialab/hci/memorystructure/index/
and memory_structure/src/main/java/fr/sciencespo/medialab/hci/memorystructure/cache/
.
To run a single memory structure for tryouts without starting a Hyphe corpus, you can use the following command with example arguments:
java -server -Xms256m -Xmx1024m -jar hyphe_backend/memorystructure/MemoryStructureExecutable.jar log.level=DEBUG thrift.port=13500 corpus=TEST
Hyphe's crawler is implemented as a Scrapy spider which needs to be deployed for each corpus on the ScrapyD server (the core API automatically takes care of it whenever a corpus is created) (more information here).
For debug purposes, it can be deployed as follow for a specific corpus:
bin/deploy_scrapy_spider.sh <corpus_name>
Whenever config.json
or the code in hyphe_backend/crawler
and hyphe_backend/lib/urllru.py
is modified, the spider needs to be redeployed on the ScrapyD server to be taken into account. You can either do this by hand running the previous command, or by calling the Core API's method crawl.deploy_crawler
(see API documentation).
The entire frontend relies on calls to the core API which can also very well be scripted or reimplemented. This is especially useful when wanting to exploit some of Hyphe's functionalities which are not available from the web interface yet (for instance, tag all webentities from a list of urls with tag CSV, crawl all IN webentities, etc.).
All of the API's fonctions are catalogued and described in the API documentation.
A simple python script hyphe_backend/test_client.py
which could certainly be greatly improved provides a way to call the API from the command-line by stacking the arguments after the name of the called function, using keyword array before any rich argument such as an array or an object. For instance:
source $(which virtualenvwrapper.sh)
workon hyphe
./hyphe_backend/test_client.py get_status
./hyphe_backend/test_client.py create_corpus test
./hyphe_backend/test_client.py declare_page http://medialab.sciences-po.fr test
./hyphe_backend/test_client.py declare_pages array '["http://medialab.sciences-po.fr", "http://www.sciences-po.fr"]' test
WEID=$(./hyphe_backend/test_client.py store.get_webentity_for_url http://medialab.sciences-po.fr test |
grep "u'id':" |
sed -r "s/^.*: u'(.*)',/\1/")
./hyphe_backend/test_client.py store.add_webentity_tag_value $WEID USER MyTags GreatValue test
./hyphe_backend/test_client.py crawl_webentity $WEID 1 False IN prefixes array '{}' test
In bin/samples/
can be found multiple examples of advanced routines ran direcly via the shell using the command-line client, although these are presently deprecated as they were working with the old MONOCORPUS version of Hyphe and still need to be updated.
The Javascript dependencies are currently shipped with git sources until proper grunt/gulp/cat is setup.
In the mean time, to update dependencies, you can run the following after having installed Node.js:
sudo npm install -g bower
cd hyphe_frontend
bower install
Build the API's documentation
bin/build_apidoc.sh
Update the list of TLDs used by the frontend from Mozilla's list
bin/update_tlds_list.sh
bin/build_release.sh <optional version_id>