Skip to content

REST web service to compute and query Latent Dirichlet Allocation models

Notifications You must be signed in to change notification settings

valentinarho/lda-rest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Latent Dirichlet Allocation REST Web Service

This library provides a Python REST Web Service to access a simple pipeline to create and query LDA Models. Created models information are stored in a Mongo database and models file are stored on a shared folder in the host filesystem.

Project architecture

The system is composed by two Docker containers:

  • Python container, contains the API library code. This container is mapped on localhost on port 5000.
  • Mongodb container, contains the database server. The container is available with the db hostname on port 27017.

Run the containers and the application

Download the code from the repository, then edit the docker-compose.yml file to update all the mount source directories that will be shared between the containers and the host.

Then, run:

docker-compose build
docker-compose up
docker-compose exec web python db/load_fake_data.py

Python container composition

The main container is composed by

  • a Flask application that exposes some REST routes (see below)
  • a library with algorithms to compute and query the LDA model
  • some support tools to load initial data into the associated database

Available APIs

Following a table describing the available APIs:

Endpoint Http request Description Parameters
models/ GET Lists all models -
models/ PUT Creates a new model w.r.t. the provided parameters * model_id: str, the id of the model to be created; * number_of_topics: int, the number of topics to extract; * language: 'en', the language of the documents; * use_lemmer: bool, true to perform lemmatisation, false to perform stemming; * min_df: int, the minimum number of documents that should contain a term to consider it; * max_df: float, the maximum percentage of documents that should contain a term to consider it as valid; * chunksize: int, the size of a chunk in LDA; * num_passes: int, the minimum number of passes through the dataset during learning with LDA; * waiting_seconds: int, the number of seconds to wait before starting the learning; * data_filename: str, the filename in the 'data' folder that contains data, the file should contain a json dump of documents each one with doc_id and doc_content keys; * data: json dictionary, a dictionary of documents, containing document_id as key and document_content as value; * assign_topics: bool, true to assign topics to the newly created model and to save on db, false to ignore assignments for the learning documents;
models/<model-id> GET Shows detailed information about model with id <model-id> -
models/<model-id>/documents/ GET Lists all documents assigned to the model with id <model-id> -
models/<model-id>/ DELETE Delete the model with the specified id, stops the computation if scheduled or performed -
models/<model-id>/documents/<doc-id> GET Shows detailed information about document with id <doc-id> in model <model-id> * threshold: float, the minimum probability that a topic should have to be returned as associated to the document.
models/<model-id>/neighbors/ GET Computes and shows documents similar to the specified text. * text: str, the text to categorize; * limit: int, the maximum number of similar documents to extract.
models/<model-id>/documents/<doc-id>/neighbors/ GET Computes and shows documents similar to the document identified with <doc-id>. * limit: int, the maximum number of similar documents to extract.
models/<model-id>/topics/ GET Lists all topics related to the model with id <model-id> or extracts topics from a text if text is specified. Only for extract topics from a text: * text, str, the text to compute topics for; * threshold, float, the min weight of a topic to be retrieved.
models/<model-id>/topics/ SEARCH Computes and returns all topics assigned to the text. * text, str, the text to compute topics for; * threshold, float, the min weight of a topic to be retrieved.
models/<model-id>/topics/<topic-id> GET Shows detailed information about topic with id <topic-id> in model <model-id> * threshold: float, the minimum probability that a topic should have to be returned as associated to the document.
models/<model-id>/topics/<topic-id>/documents GET Shows all documents associated to the topic with id <topic-id> in model <model-id> * threshold: float, the minimum probability of the topic that the document should have to be returned as associated to the topic.
models/<model-id>/topics/<topic-id>/documents PUT Compute topics associated to the provided document (single if doc_id and doc_content are set, multiple if documents is set) in model <model-id> * documents: json dictionary, optional, keys are document ids and values are document contents; * doc_id, string, optional, the document id (in single case); * doc_content, string, optional, the document content; * save_on_db, bool, default True, true to save documents and topic assignments on db, False to return and forget.
models/<model-id>/topics/<topic-id> PATCH Update optional information of the topic with id <topic-id> in model <model-id> * label: str, optional, the topic label. * description: str, optional, the optional topic description.

Load sample data

To invoke the modules that loads fake data into the database run, from the machine that is running Docker, the following command:

docker-compose exec web python db/load_fake_data.py

Database

To connect directly to the mongodb instance:

  • connect to the machine that hosts docker with ssh (optional: only if it is not the current machine)
  • the db is available on host 'db' port 27017

Models

When asking for model's detailed information, the required model can be in one of the following statuses:

  • scheduled, the model computation will start after the specified waiting period
  • computing, the model computation has been started and is currently running
  • completed, the model computation is finished and the model is stable
  • killed, the model computation has been interrupted by an error

Languages

The language can be specified during in model creation message. Each model can handle only one language, chosen from:

  • en for english documents
  • it for italian documents, stopwords are available in /app/resources folder and lemmatisation is performed with MorphIt

Documents

During model computation it is possible to load documents in two ways:

  • load from file: provide the data_filename field in the request. The file should be a json file and should be contained in the data folder. The json should be a list of dictionaries, each dictionary represent a document and contains the keys doc_id and doc_content. For example:

      [
      	{'doc_id': 'doc_1', 'doc_content': 'doc content 1'}, 
      	{'doc_id': 'doc_2', 'doc_content': 'doc content 2'}, 
      	{'doc_id': 'doc_3', 'doc_content': 'doc content 3'}
      ]
    
  • load directly: provide the documents in the data field. This field should contain a dictionary of key:values where keys are document ids and values are document contents.

      {
      	'doc_1': 'doc content 1', 
      	'doc_2': 'doc content 2', 
      	'doc_3': 'doc content 3'
      }
    

Useful commands

To build all containers

docker-compose build

To run all containers

docker-compose up

To exec a command within a running container, e.g. load fake data into the mongo database

docker-compose exec web COMANDO ARGS

About

REST web service to compute and query Latent Dirichlet Allocation models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages