This repository is for building a Docker image for LinTO's NLP service for Keyword and Keyphrase Extraction, which can be deployed as a task on the LinTO NLP services stack or as a standalone service (see Develop section in below). It is based on the LinTO microservices template.
Folder structure is as followed:
contains celery related files for connectivity, registration and the task definition.document
contains the swagger definition file.http_server
contains http serving files, centered around API definition
contains the code related to the keyword extraction algorithms.
The service requires docker up and running.
The service's only entry point in job mode are tasks posted on a REDIS message broker using Celery.
The service can be deployed two different ways:
- As a standalone service through an HTTP API.
- As a micro-service connected to a task queue.
1- First step is to build the image:
docker compose build
docker pull [TBR - REGISTRY URL]
Fill the .env with your values.
Variables | Description | Example |
SERVICES_BROKER | Service broker uri | redis://my_redis_broker:6379 |
BROKER_PASS | Service broker password (Leave empty if there is no password) | my_password |
QUEUE_NAME | (Optionnal) overide the generated queue's name (See Queue name bellow) | my_queue |
SERVICE_NAME | Service's name | keyword_extraction_fr |
SERVICE_MODE | Whether the service is launched as a task or standalone | task |
LANGUAGE | Language code as a BCP-47 code | en-US or * or languages separated by "|" |
CONCURRENCY | Number of worker (1 worker = 1 cpu) | >1 |
TOKENIZERS_PARALLELISM | Activate parallelism for tokenizers | False |
2- Run with docker
docker run --rm \
--env-file .env \
This will run a container providing an http API binded on the host HOST_SERVING_PORT port.
⚠️ Not fully tested.
Service can be deployed as a microservice. Used this way, the container spawn celery workers waiting for keyword extraction tasks on a dedicated task queue. Service in task mode requires a configured REDIS broker.
You need a message broker up and running at MY_SERVICE_BROKER. Instance are typically deployed as services in a docker swarm using the docker compose command:
1- Fill the .env
Fill the .env with your values.
Variables | Description | Example |
SERVICES_BROKER | Service broker uri | redis://my_redis_broker:6379 |
BROKER_PASS | Service broker password (Leave empty if there is no password) | my_password |
QUEUE_NAME | (Optionnal) overide the generated queue's name (See Queue name bellow) | my_queue |
SERVICE_NAME | Service's name, uniquely identifies the task | keyword_extraction_fr |
SERVICE_MODE | Whether the service is launched as a task or standalone | task |
LANGUAGE | Language code as a BCP-47 code | en-US or * or languages separated by "|" |
CONCURRENCY | Number of worker (1 worker = 1 cpu) | >1 |
TOKENIZERS_PARALLELISM | Activate parallelism for tokenizers | False |
2- Fill the docker-compose.yml
version: '3.7'
build: .
env_file: .env
replicas: 1
- linto-net
external: true
3- Run with docker compose
docker compose build
docker compose up
Queue name:
By default the service queue name is generated using SERVICE_NAME and LANGUAGE: keyword_extraction_{LANGUAGE}_{SERVICE_NAME}
The queue name can be overided using the QUEUE_NAME env variable.
Service discovery:
As a micro-service, the instance will register itself in the service registry for discovery. The service information are stored as a JSON object in redis's db0 under the id service:{HOST_NAME}
The following information are registered:
"service_name": $SERVICE_NAME,
"host_name": $HOST_NAME,
"service_type": "[TBR-SERVICE TYPE]",
"service_language": $LANGUAGE,
"queue_name": $QUEUE_NAME,
"version": "1.2.0", # This repository's version
"info": "This specific service version does something",
"last_alive": 65478213,
"concurrency": 1
When this service is deployed as a task on the NLP services stack (hosted at [HOST]
on port [PORT]
), it expects the following request:
import requests
url = "[HOST]:[POST]"
headers = {"accept":"application/json"}
data = {
"documents": ["Document 1", "Document 2"],
"nlpConfig": { "keywordExtractionConfig":
"enableKeywordExtraction": True,
"serviceName": "keyword_extraction_fr",
"method": "[METHOD]",
"configParameter1": "value",
"configParameter2": "value",
# ..
job_id ='/nlp', json=data, headers = headers).json()['jobid']
job = requests.get(url+"/job/"+jobid).json()
keywords = requests.get(url+"/results/"+job['result_id'], headers = headers).json()
The supported methods are listed below, as well as their method-specific configurations.
A model combining frenquencies and KeyBERT:
- Extract the most frequent n-grams (up to 3-grams) in the document
- Filter out unlikely keywords (containing no nouns, all stopwords, not corresponding to Wikipedia article titles)
- Remove particles from beginning of keywords
- Fuse smaller keywords into longer ones if they're frequent enough ('open' + 'source' = 'open source')
- Generate keyword embeddings and score them based on their similarity ti segments of text
- Remove near duplicates using embeddings
Config parameter | Description | Default Value |
top_n |
Final (maximum) number of keywords extracted | "all" |
number_of_segments |
Expected number of topical segments | 10 |
top_candidates |
Number of final set of potential keywords to be sorted | 20 |
sbert_model |
SentenceBERT model name to use for embedding | paraphrase-multilingual-MiniLM-L12-v2 |
verbose |
Whether or not to print out the extraction progress | False |
stopwords |
List of words to be used to filter out stopwords | stopwords_fr |
add_stopwords |
List of words to be added to the default stopword list | [] |
Paper: Preprint Repo: MaartenGr/KeyBERT
Config parameter | Description | Default Value |
model_name |
SentenceBERT model name to use for embedding | paraphrase-multilingual-MiniLM-L12-v2 |
keyphrase_ngram_range |
Minimum and maximum length of extracted keywords | (1, 2) |
stopwords |
List of words to be used to filter out stopwords | stopwords_fr |
add_stopwords |
List of words to be added to the default stopword list | [] |
Paper: EMNLP'04
Config parameter | Description | Default Value |
spacy_model |
SpaCy model to use for POS tagging | fr_core_news_md |
damping |
Damping parameter for the PageRank algorithm, to be kept between 0.8 and 0.9 | 0.85 |
steps |
NUmber of iterations for PageRank | 10 |
stopwords |
List of words to be used to filter out stopwords | stopwords_fr |
add_stopwords |
List of words to be added to the default stopword list | [] |
Paper: IJCNLP'13
Config parameter | Description | Default Value |
spacy_model |
SpaCy model to use for POS tagging | fr_core_news_md |
phrase_count_threshold |
Minimum number of occurences for a phrase to be counted | 0 |
stopwords |
List of words to be used to filter out stopwords | stopwords_fr |
add_stopwords |
List of words to be added to the default stopword list | [] |
Simply computes the words that appear with the highest frequency (with the possibility of omitting stopwords).
Config parameter | Description | Default Value |
threshold |
Minimum number of occurences a word appears in the text to be included | 1 |
stopwords |
List of words to be used to filter out stopwords | stopwords_fr |
add_stopwords |
List of words to be added to the default stopword list | [] |
This project is developped under the AGPLv3 License (see LICENSE).