July 30: Live demo is available at a link different from the one in the paper. Please contact the authors to get the link and credentials. Contact info is in the paper.
- The purpose of this project
- The layout of this project
- The call stack of an API call
- The "best" practice when developing and debugging
- Deploy the whole thing on a single machine
- Add a new NLP tool
- Write and run tests
- Access NLP tools via HTTP
- Move some NLP tools to another machine
- Backup and restore the workbench
- Cite this project
Please refer to our paper, poster, slides or video.
- The UI has a fresh new design. A new search bar is added for easier navigation. The admin portal is integrated with the main interface.
- New feature: run inference using any huggingface model.
- New documents can be imported from URL.
- Updated model: AMRBART updated to -v2
- We revamped the admin portal (
/admin
) to support collection creation, browsing, deletion and population. Documents can be imported into a collection from Bing search.
Docker is the preferred way of deployment. Requires a newer docker and docker compose plugin. Tested with docker v20.10.16.
Credentials can be provided as environment variables:
export BING_KEY=AAAAAAAAAA # Bing search API key
export BEARER_TOKEN=AAAAAAAAAAA # Twitter API bearer token
export OPENAI_API_KEY=AAAAAAAAAA # OpenAI API key
export ELASTIC_PASSWORD=elastic # Elasticsearch password, default is `elastic`
export CLASSIFIER_FOLDER=/data/local/workbench-data/classifiers # where HuggingFace classifiers are stored
They can also be stored in .env
. API keys can also be provided in the frontend.
Clone the repositories and build docker images:
# build images
docker compose -f docker-compose.dev.yml --profile non-gpu --profile gpu build
# run
docker compose -f docker-compose.dev.yml --profile non-gpu --profile gpu up
The workbench should be up and running on http://localhost
.
Without further configurations some parts will not be working: the entity linker will not work because the knowledge graph is empty. Classifiers are not available except sentiment classier, until classifier models are put into CLASSIFIER_FOLDER
.
By default docker creates temporary volumes to store data. During production we want to persist things, and this is done by binding locations on the host to the containers. We also need to configure Neo4j and Kibana for pair with the servers. Details on deploying in production mode are documented here.
The above commands will start all services. However, we may choose to only start the services of interest by editing the profiles
sections of the services of interest in docker-compose.yml
. For example, if only NER is required, we may add - myprofile
under the profiles
sections of api
, elasticsearch
, redis
, and ner
. These 4 services are the minimum required services to perform NER.
NLPWorkbench can be accessed procedurally via its HTTP APIs, i.e. managing documents, collections or performing analyses from scripts or other programs, without using the built-in web frontend. For example, you can make HTTP requests in Python:
import requests
collection_name = "new_collection"
workbench_url = "http://localhost"
r = requests.put(f"{workbench_url}/api/collection/{collection_name}")
r.raise_for_status()
The above Python program creates a new collection called new_collection
.
Read the HTTP API references to learn about other APIs.
The architecture of the nlpworkbench is shown in the above figure. Each NLP tool / model runs in its independent container, and communicates with the API server using Celery, or alternatively any protocol you like.
The goal of using Celery is that you can move any Python function to any physical machine and still use it as if it's running on the same machine, without worrying about (de)serialization, networking protocols, communication, etc. Celery will put all tasks into a queue, and starts workers that consumes tasks from the queue.
Let's start with a simple example to show how the Celery works. For example, we have a function that does tokenization:
# callee.py
def tokenize(text):
return text.split()
On the same machine, calling this function is as simple as:
# caller.py
from callee import tokenize
tokens = tokenize("hello world")
Now, suppose we want our tokenize()
function to run on another machine (or another container). We would create a Celery worker and register the function as a remote one, and run python3 callee.py
to start the server.
# callee.py
from rpc import create_celery
celery = create_celery("callee") # first arg is the filename
@celery.task
def tokenize(text):
return text.split()
if __name__ == '__main__':
# you can control the number of workers
celery.start(argv=["worker", "-l", "INFO", "--concurrency=1", "-Q", "callee"])
On the caller's end, calling delay()
on the function tokenizer
can put the task in the queue, and calling get()
will block and wait to get the result:
# caller.py
from caller import tokenize
tokens = tokenize.delay("hello world").get()
That's it! Celery configured in rpc.py
should work with any parameter / return data types, but it's encouraged to only use built-in types to avoid weird bugs. If you are not using this codebase, you can copy rpc.py
to your repository. The only dependency to add is dill==0.3.5
and "celery[redis]"==5.2.7
.
It's worth noting that every function in the project with @celery.task
can be called this in fashion, even if they are in a different container.
If running individual containers, you need to have redis running and configure the redis address in config.py
.
Finally, we can wrap the new tokenization tool in a container. Create a new file called Dockerfile
in the folder:
FROM python:3.7
WORKDIR /app
RUN pip install dill==0.3.5 "celery[redis]"==5.2.7
CMD ["python3", "callee.py"]
then add
tokenizer:
build:
dockerfile: path-to-tokenizer/Dockerfile.ner
context: path-to-tokenizer/
to the services
section in docker-compose.yml
Sample configuration files can be found under example/
.
Services in docker-compose.yml
are labelled with two profiles: gpu
and non-gpu
. Running docker compose [--profile non-gpu] [--profile gpu] up --build
will start the selected groups of containers. In our case we run the non-gpu
group on caidac
and gpu
group on turin4.
Since we are using redis as the message queue, it is relatively easy to move worker containers to other machines, as long as redis is exposed to the public network. CAVEAT: set a long password for redis --requirepass
if exposing to public, and configure firewall to only allow desired hosts. Configure RPC_BROKER
and RPC_BACKEND
in the environment to be the public address of the redis server.
Once we have GPU servers, we might want to move some of the NLP models there. It's possible to run containers on different physical machines with the help of docker swarm
. Read the official tutorial here: https://docs.docker.com/engine/swarm/.
In a nutshell, these things need to be done:
- On one of the machines, use
docker swarm init
to create a swarm manager. - On other machines, use
docker swarm join --TOKEN MANAGER_IP
to join the swarm. - Use
docker node update --label-add foo --label-add bar=baz node-1
to add labels to the nodes. - Combine labels and placement constraints to control what container goes to which node.
- Run
docker stack deploy -c stack.yml nlpworkbench
.
Local containers should be stored in a local registry:
docker service create --name registry --publish published=5000,target=5000 --constraint node.hostname==caidac registry:2
docker compose -f stack.yml build
docker compose -f stack.yml push
docker stack deploy --compose-file stack.yml nlpworkbench
If using explicit mapping, the following mappings are used for caching model outputs:
PUT /bloomberg-reuters-v1/_mapping
{
"properties": {
"raw-ner-output": {
"type": "object",
"enabled": false
}
}
}
PUT /bloomberg-reuters-v1/_mapping
{
"properties": {
"raw-linker-output": {
"type": "object",
"enabled": false
}
}
}
PUT /bloomberg-reuters-v1/_mapping
{
"properties": {
"raw-amr-output": {
"type": "object",
"enabled": false
}
}
}
PUT /bloomberg-reuters-v1/_mapping
{
"properties": {
"raw-person-rel-output": {
"type": "object",
"enabled": false
}
}
}
Some fields, like raw-amr-output
, are not indexed ("enabled": false
).
- Add
- path.repo=/repo
toservices->elasticsearch->environments
indocker-compose.yml
. Mount the folder containing snapshots to/repo
of the ES container. Uncompress previous snapshots there (for example,/path/to/repo/bak
contains the snapshots and/repo
in the container is mapped to/path/to/repo/bak
). - In Kibana -> Management -> Stack Management -> Snapshot and Restore, register a new Shared file system repository with path
bak/
(as in the example). - You will then see and be able to restore the snapshot in Kibana.
neo4j.dump
must be readable by user 7474:7474
.
Create a new container:
docker run -it --rm \
--volume=/home/ubuntu/workbench/docker-data/neo4j:/data \
--volume=/home/ubuntu/workbench/docker-data/neo4j.dump:/neo4j.dump \
neo4j:4.4-community \
bash -c "neo4j-admin load --from=/neo4j.dump --force --verbose"
neo4j image version must match dump version and dbms version!!
This diagram shows what's happening behind the scene when an API call is made to run NER on a document.
When a REST API is called, the NGINX reverse proxy (running in frontend
container) decrypts the HTTPS request, and passes it to the api
container. Inside the api
container, gunicorn
passes the request to one of the Flask server processes. Numbers below correspond to labels in the diagram.
wsgi.py
provides routing for RESTful API calls. Everything underdoc_api.route
is registered with a pre-request hook.- The pre-request hook verifies and loads document from the ES collection. The document is stored in Flask's global object
g
for the lifecycle of the request. - Loading document is handled by
api_impl.py
, which makes a request to Elasticsearch to retrieve the document if the document in a collection, or downloads the article from the URL provided. - An Elasticsearch query is just an HTTP request.
- After the document is retrieved,
get_ner
inapi_impl.py
is called.api_impl.py
provides real implementation of the functions, and caching of results. Functions inapi_impl.py
are wrapped with@es_cache
decorator which handles caching. - The
es_cache
decorator creates a cache key based on the parameters of the function call, and checks if the result is already cached. If so no real computation is done and the cached result is returned. - If no NER output is cached, we use celery to call the remote function
run_ner
running inner
container. - The
run_ner
function prepares the input for the PURE NER model. The open source code from PURE takes json line text files as input and writes to another json line file.run_ner
prepares the input files, calls the NER model, and parses the outputs. call
function is a wrapper in PURE NER's code base. PURE NER initially can only be called via command line (handled in__main__
) and this wrapper function pretends inputs are from the command line.- We also have lazy loading helper functions so that models are only loaded once.
- Output of PURE NER is automatically stored in Elasticsearch by the
es_cache
decorator. - NER output is formatted to suit the need of the frontend, and responded to the user.
build/
|
|-- Dockerfile.api
|-- Dockerfile.service
frontend/
requirements/
|
|-- api.txt
|-- service.txt
workbench/
|
|-- __init__.py
|-- rpc.py
|-- coll/
|
|-- __init__.py
|-- thirdparty/
| -- amrbart/
| -- pure-ner/
docker-compose.yml
build/
folder contains all Dockerfile
s and requirements/
folder contains requirements.txt
for each micro-service.
workbench/
contains all Python code. The folder and all of its subfolders (except thirdparty/
) are Python packages. A __init__.py
file must be present in every subfolder. Relative imports (from . import config
, or from ..rpc import create_celery
) are the preferred way to reference modules within the workbench
package.
We are moving towards test-driven development. The infrastructure for unit tests are available.te
We use the pytest framework to test Python code. It is a very light-weight framework: to write tests one would create a file tests/test_some_module.py
containing functions with name test_some_feature
and writing assert
statements.
Here's a snippet from tests/test_sentiment.py
that tests the VADER sentiment analyzer:
from workbench import vader
def test_classify_positive_sents():
positive_sents = [
"I love this product.",
"Fantastic!",
"I am so happy.",
"This is a great movie."
]
for sent in positive_sents:
output = vader.run_vader(sent)
assert output["polarity_compound"] > 0.3
Running python3 -m pytest tests/test_sentiments.py
will provide a report for this set of unit tests like:
==================== test session starts ====================
platform linux -- Python 3.9.12, pytest-7.1.1, pluggy-1.0.0
rootdir: /data/local/workbench-dev, configfile: pyproject.toml
plugins: anyio-3.5.0
collected 3 items
tests/test_sentiment.py ... [100%]
============== 3 passed, 3 warnings in 0.37s ==============
In the real world we don't directly run code, and instead we use Docker. Unit test is added into a docker image separate from the image used to run the service, by using multi-stage build. Still using VADER as the example, the Dockerfile after adding tests becomes:
FROM python:3.7 AS base
WORKDIR /app
RUN mkdir /app/cache && mkdir /app/vader_log && mkdir /app/lightning_logs
COPY requirements/vader.txt requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install -r requirements.txt
COPY workbench/ ./workbench
ENV PYTHONUNBUFFERED=TRUE
FROM base as prod
CMD ["python3", "-m", "workbench.vader"]
FROM base as test
COPY .coveragerc ./
RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install pytest==7.2 coverage==7.0.5
COPY tests ./tests
CMD ["coverage", "run", "--data-file=cov/.coverage", "--source=workbench/", "--module", "pytest", "tests/test_sentiment.py"]
The base
image contains all the source code and dependencies to run VADER. The prod
image starts the process that serves VADER. The test
image is used to run testing, where tests are added and test frameworks are installed. Running a container with the test
image will invoke the tests.
After adding multi-stage build, docker-compose.dev.yml
needs to be changed to specify the default build stage as prod
:
vader:
build:
dockerfile: ./build/Dockerfile.vader
target: ${COMPOSE_TARGET:-prod}
run-test.sh
provides the scripts to run tests on your local machine using Docker. To test VADER:
./run-test.sh build vader # build the `test` stage image for vader
./run-test.sh test vader # run a container with the `test` image
# repeat the process for other services.
# `vader` can be replaced with other services defined in `docker-compose.dev.yml`
./run-test.sh coverage # combine coverage info from all tests and print coverage report
Once your commits are pushed to GitLab, a pipeline is triggered to automatically run tests. The pipeline badge indicates whether tests are passed, and the coverage badge shows the line coverage percentage.
The tests triggered by the push are defined in .gitlab-ci.yml
. When adding new tests, a build-something
job and a test-something
job should be added following the structure of existing jobs in the file.
The test jobs will be executed by a local runner on one of our own machines (rather than on a shared runner provided by GitLab). The local GitLab Runner is installed on our machines as a Docker container, following the official tutorial. Our local runner is then registered with the repository. The default image for the docker executors is docker:20.10.16
. We are using Docker socket binding so that docker images / containers created within docker containers will be running on the host system, instead of becoming nested containers. This is beneficial for caching and reusing layers.
Please cite the project as:
@inproceedings{yao-etal-2023-nlp,
title = "{NLP} Workbench: Efficient and Extensible Integration of State-of-the-art Text Mining Tools",
author = "Yao, Peiran and
Kosmajac, Matej and
Waheed, Abeer and
Guzhva, Kostyantyn and
Hervieux, Natalie and
Barbosa, Denilson",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl-demo.3",
pages = "18--26",
}