This library provides a Python REST Web Service to access a simple pipeline to create and query LDA Models. Created models information are stored in a Mongo database and models file are stored on a shared folder in the host filesystem.
The system is composed by two Docker containers:
- Python container, contains the API library code. This container is mapped on
localhost
on port5000
. - Mongodb container, contains the database server. The container is available with the
db
hostname on port27017
.
Download the code from the repository, then edit the docker-compose.yml file to update all the mount source directories that will be shared between the containers and the host.
Then, run:
docker-compose build
docker-compose up
docker-compose exec web python db/load_fake_data.py
The main container is composed by
- a Flask application that exposes some REST routes (see below)
- a library with algorithms to compute and query the LDA model
- some support tools to load initial data into the associated database
Following a table describing the available APIs:
Endpoint | Http request | Description | Parameters |
---|---|---|---|
models/ | GET | Lists all models | - |
models/ | PUT | Creates a new model w.r.t. the provided parameters | * model_id : str, the id of the model to be created; * number_of_topics : int, the number of topics to extract; * language : 'en', the language of the documents; * use_lemmer : bool, true to perform lemmatisation, false to perform stemming; * min_df : int, the minimum number of documents that should contain a term to consider it; * max_df : float, the maximum percentage of documents that should contain a term to consider it as valid; * chunksize : int, the size of a chunk in LDA; * num_passes : int, the minimum number of passes through the dataset during learning with LDA; * waiting_seconds : int, the number of seconds to wait before starting the learning; * data_filename : str, the filename in the 'data' folder that contains data, the file should contain a json dump of documents each one with doc_id and doc_content keys; * data : json dictionary, a dictionary of documents, containing document_id as key and document_content as value; * assign_topics : bool, true to assign topics to the newly created model and to save on db, false to ignore assignments for the learning documents; |
models/<model-id> |
GET | Shows detailed information about model with id <model-id> |
- |
models/<model-id> /documents/ |
GET | Lists all documents assigned to the model with id <model-id> |
- |
models/<model-id> / |
DELETE | Delete the model with the specified id, stops the computation if scheduled or performed | - |
models/<model-id> /documents/<doc-id> |
GET | Shows detailed information about document with id <doc-id> in model <model-id> |
* threshold : float, the minimum probability that a topic should have to be returned as associated to the document. |
models/<model-id> /neighbors/ |
GET | Computes and shows documents similar to the specified text. | * text : str, the text to categorize; * limit : int, the maximum number of similar documents to extract. |
models/<model-id> /documents/<doc-id> /neighbors/ |
GET | Computes and shows documents similar to the document identified with <doc-id> . |
* limit : int, the maximum number of similar documents to extract. |
models/<model-id> /topics/ |
GET | Lists all topics related to the model with id <model-id> or extracts topics from a text if text is specified. |
Only for extract topics from a text: * text , str, the text to compute topics for; * threshold , float, the min weight of a topic to be retrieved. |
models/<model-id> /topics/ |
SEARCH | Computes and returns all topics assigned to the text. | * text , str, the text to compute topics for; * threshold , float, the min weight of a topic to be retrieved. |
models/<model-id> /topics/<topic-id> |
GET | Shows detailed information about topic with id <topic-id> in model <model-id> |
* threshold : float, the minimum probability that a topic should have to be returned as associated to the document. |
models/<model-id> /topics/<topic-id> /documents |
GET | Shows all documents associated to the topic with id <topic-id> in model <model-id> |
* threshold : float, the minimum probability of the topic that the document should have to be returned as associated to the topic. |
models/<model-id> /topics/<topic-id> /documents |
PUT | Compute topics associated to the provided document (single if doc_id and doc_content are set, multiple if documents is set) in model <model-id> |
* documents : json dictionary, optional, keys are document ids and values are document contents; * doc_id , string, optional, the document id (in single case); * doc_content , string, optional, the document content; * save_on_db , bool, default True, true to save documents and topic assignments on db, False to return and forget. |
models/<model-id> /topics/<topic-id> |
PATCH | Update optional information of the topic with id <topic-id> in model <model-id> |
* label : str, optional, the topic label. * description : str, optional, the optional topic description. |
To invoke the modules that loads fake data into the database run, from the machine that is running Docker, the following command:
docker-compose exec web python db/load_fake_data.py
To connect directly to the mongodb instance:
- connect to the machine that hosts docker with ssh (optional: only if it is not the current machine)
- the db is available on host 'db' port 27017
When asking for model's detailed information, the required model can be in one of the following statuses:
scheduled
, the model computation will start after the specified waiting periodcomputing
, the model computation has been started and is currently runningcompleted
, the model computation is finished and the model is stablekilled
, the model computation has been interrupted by an error
The language can be specified during in model creation message. Each model can handle only one language, chosen from:
en
for english documentsit
for italian documents, stopwords are available in/app/resources
folder and lemmatisation is performed withMorphIt
During model computation it is possible to load documents in two ways:
-
load from file: provide the
data_filename
field in the request. The file should be a json file and should be contained in the data folder. The json should be a list of dictionaries, each dictionary represent a document and contains the keysdoc_id
anddoc_content
. For example:[ {'doc_id': 'doc_1', 'doc_content': 'doc content 1'}, {'doc_id': 'doc_2', 'doc_content': 'doc content 2'}, {'doc_id': 'doc_3', 'doc_content': 'doc content 3'} ]
-
load directly: provide the documents in the
data
field. This field should contain a dictionary of key:values where keys are document ids and values are document contents.{ 'doc_1': 'doc content 1', 'doc_2': 'doc content 2', 'doc_3': 'doc content 3' }
To build all containers
docker-compose build
To run all containers
docker-compose up
To exec a command within a running container, e.g. load fake data into the mongo database
docker-compose exec web COMANDO ARGS