- Introduction
- A Quick Introduction to ZMQ
- Project Architecture
- Brief Project Workflow
- Understanding the Directory Structure
This is the project directory for the guided project Improving the Performance of Deep Learning based Flask App with ZMQ.
This is an extention to the previous project: https://github.com/vagdevik/Flask-Image-Classification
What is the Idea? Now that we are familiar with the working of Project - How to Deploy an Image Classification Model using Flask, let us discuss how it is going to be different from our current project.
(1) The amount of time taken to predict the classes of an input image
If we run the command
time python Flask-Server-Folder/test_client_without_zmq.py
We could see the amount of time taken to give out the predictions. If we experiment it even with different images, we could observe that the average amount of time taken is around 10-21 seconds for each input image. But this is a huge time which is not tolerable in real-time environments. Users expect the applications they use to be not only accurate in terms of functioning but also fast enough in terms of execution time.
(2) It is a monolith program, thus no flexibility
When there is a big service we generally call it monolith meaning, made out of a single stone.
When we make something from a single stone we can't really break it into parts and separate it - we cannot make it modular, it cannot fit on multiple machines. For example, the deep-learning based service might be computation-intensive and thus it may require GPUs, while CPUs may suffice for the working of the web service since they may not require heavy computational resources like GPUs. Thus, breaking down the services based on necessities and resource consumption provide added advantages of optimal resource consumption(thus reducing the cost), modularisation of the code(and thus the services), scaling flexibilities, and isolation of the responsibilities of the team(ML engineer doesn’t need to bother about the work of software engineer and vice-versa, etc.)
-
Model and web server are coded in monolith style, so no flexibility.
-
Also, the same heavy model executions are repeating due to this style of programming.
-
The way
test_client_without_zmq.py
works is:-
All the imports and loading of the model happen every time we run the program.
-
Loading the model means constructing the huge deep learning graph by stacking the layers and associating their weights.
-
So constructing the same model, again and again, is costly since it is time-consuming to load the model each time we run the program.
-
-
Very similarly, the way
app.py
works is this:-
It is coded in a monolith programming style, by putting the flask server and the model loading/inference, all in the same file.
-
Each request from the user will be re-directed to the dedicated URL.
-
Each URL invokes that corresponding function in
app.py
. -
Each time
app.py
is in use, all the imports, and all the execution in that function invoked will be done from the beginning. -
Though we want to use the same model each time we want predictions for different input images, the model is newly getting loaded for each request. This is costly, and thus it is time-consuming.
-
- Now that we understand that the problem is due to the monolith programming style and thus due to the costly process of loading the model repeatedly for each prediction request, we need to find a way where we could overcome the overhead caused by this step.
- Java has static keyword, using which variables could be declared as static. These static members get loaded into the memory exactly once, ie. when that java class was loaded. And these members can be used by different threads.
- But this method has a disadvantage: the service will not be scalable because the static variables remain on one computer, they are not in another, and so on so forth. So there are complications involved when you start using the static variables.
- Also, Python doesn't provide a static mechanism.
- Hence, we now switch to such a way where we are going to partially use the static variable concept - but in a wiser way - where we make use of server-client architecture.
A: To address the pain-point, we introduce the concept of asynchrony, by uniquely integrating ZMQ(an asynchronous network library) with Flask server
-
The idea is to import the modules and construct the graph exactly once, run a server, and let the server use these loaded modules or variables any number of times as long as the server is active and it is receiving requests.
-
This could be done by breaking down the code into 2 services: web service and model service.
-
We will achieve this by defining a service( here let us call it a model server) - which imports and constructs the graph once, keeps it ready for inference and keeps listening to for any client requests. Once a client(here flask server) requests the server, the server (which already constructed the graph and kept it ready for inference) responds to the client with predictions. The flask server then renders the corresponding HTML page with the predictions.
-
ZMQ is such a library that provides us with the networking capabilities using which we could build a custom server-client mechanism as per our need.
-
We achieve this as follows:
-
We separate the model and the
app.py
. -
We create a server-class named Server which invokes the RequestHandler class each time the server gets a request. We will be defining these classes in the resnet_model_server.py file.
(a) As long as the Server is not stopped, it listens to a specified port(say 5576).
(b) As soon as the server receives a request from a client, it invokes the RequestHandler class which handles the request by (1) converting the base64 encoded image into a normal image (2) preprocessing the image (3) feeding the image to the model (4) getting the predictions (5) returning the resultant predictions to the server.
(c) The server responds to the client with the resultant predictions.
-
When a user submits an image for its predictions, the corresponding URL will be invoked(say https://f.cloudxlab.com:4114/uploader).
-
That URL invokes the corresponding function in app.py(here upload_file function in app.py file).
-
This invoked function acts as a client. It registers with the socket 5576, sends the image in an encoded form(here we will use base64 encoding for an input image) and keeps polling(or keeps waiting) for the response from the server.
-
The server receives the request, keeps track of the client through a unique id and routes the request to a request-handler. The request-handler converts it back to a normal image, preprocesses it as required by the pre-trained model, feeds that preprocessed image to the model, and returns the predictions(in the form of JSON object) to the server. The server sends this response to the client who requested it.
-
This JSON object will be received back by the same function(which requested the server previously) in app.py and the predictions will be sent to the corresponding template which would be rendered in that function.
-
In this workflow, it is the ZMQ that provides the mechanism of connecting through specified sockets and keeps the server listening through that port.
-
At the beginning of the
resnet_model_server.py
file, we load the model. Then, we start the server which keeps listening to a port. Thus the server is always actively listening to the port and responds to the client. We don't import the model again and again as long as the server is active. -
The server listens as long as it is not stopped, and thus there is no need to newly load the model. The model is retained in memory unlike loading it from disk for each request. So the amount of time consumed ideally goes down.
-
ZMQ is one of the most efficient libraries using which we can improve performance.
The official words about ZMQ are:
ZeroMQ (also known as ØMQ, 0MQ, or zmq) looks like an embeddable networking library but acts like a concurrency framework. It gives you sockets that carry atomic messages across various transports like in-process, inter-process, TCP, and multicast. You can connect sockets N-to-N with patterns like fan-out, pub-sub, task distribution, and request-reply. It's fast enough to be the fabric for clustered products. Its asynchronous I/O model gives you scalable multicore applications, built as asynchronous message-processing tasks. It has a score of language APIs and runs on most operating systems.
It supports asynchronicity to perform multiple tasks parallelly, providing customizable networking options.
ZMQ is neither a client nor a server. Rather, we could make our own client and server by making use of the networking functionalities provided by ZMQ.
ZMQ provides sockets of various types, which could be used in different scenarios. For example, the PUB/SUB sockets are used in the publisher-subscriber messaging system. In our scenario, we will be using the mechanism involving ROUTER/DEALER. Let us briefly discuss this mechanism:
-
In ROUTER/DEALER sockets, ROUTER is used to accept the requests from clients, route the requests, receive the response and send the response to the client, while the DEALER deals with the workers who perform the task. Workers perform the task and return the results to the ROUTER via the DEALER. A ROUTER may have one or more DEALERs.
-
Here, the ROUTER could be thought of as a frontend for a client to communicate, while the workers work in the backend. Workers perform the task in the backend to return the results to the client via the frontend. The DEALER which is bound to the same context as the ROUTER acts as the main gateway in the backend, through which the results are returned by the workers to the frontend.
-
Note that the ROUTER also keeps track of the client through a unique id using which the responses will be returned to the client by the ROUTER.
In our project, the model gets loaded when we run the file where we define the server.
Then, the server starts and listens to any client request. Note that model loading happens only once(that is when we are importing) and we don't need to load the model for each request, since the model actively listens for any request, and the loaded model is used for responding to any number of requests the server is listening to.
-
User uploads an image in web-app.
-
Flask Server acts as a client of the Model Server. It sends the input image(in some encoded form) to the frontend of the Model Server.
-
Model Server invokes RequestHandler for predictions. A worker connects to the backend of the Model Server.
-
Model yields Predictions and the worker sends them to the backend of the Model Server.
-
The frontend of the Model Server responds to Flask Server with the Predictions
-
Flask Server renders an HTML template along with the predictions displayed.
In our project:
Now let's look at the deeper view of the architecture. Observe the following image(it is just for our intuitive understanding):
(1) We shall create a flask server to serve the web-app and the model server which serves the model. An image is uploaded through the web-app and the resultant predictions are returned by the model server to the client(here app.py) in the flask server.
(2) Upon submitting the image through the web-app, the corresponding URL(here /uploader) invokes the corresponding function(here upload_file function) in the app.py.
(3) In this function, we create a socket and connect it to the same port(here 5576) through which the frontend of the model server communicates.
(4) The image is encoded in this function in such a format its transmission is compatible through the network between the flask server to the model sever.
(5) At the server, we define the frontend(ROUTER) and the backend(DEALER). The frontend is bound to the port through which it receives client requests, and the backend is bound to the endpoint to which the workers are connected through an in-process communication protocol. As discussed, the DEALER deals with workers.
(6) The frontend of the model server(ROUTER) receives this encoded image along with the id of the client.
(7) Once the frontend receives the request(we refer to this as receiving a message or data), the request handler is invoked(where we will define the workers which are connected to the backend through an in-process communication endpoint that is bound to the backend).
(8) In the request handler, we define the workers to connect with the backend. Then, the workers deliver the results in JSON format from the request handler to the backend.
(9) The backend receives the results and these results are transmitted to the client through the frontend. Remember, clients can only talk to the frontend, and the work is done in the backend, and hence the choice of names.
(10) The client receives the results in JSON format, which may be further used in the rendered templates of the function to display these results.
Let us understand what are all the directories we will be using for app:
The Flask-ZMQ-App-Folder
is the main project directory, in which we have:
-
Model-Server-Folder
: It contains the virtual environment, model server code, and therequirements.txt
file.-
model-env
: The virtual environment for the model server. -
resnet_model_server.py
: This is the file where we import the pre-trained resent50 model, receive the encoded image, and perform the class predictions of the given input image by feeding it to the resnet50 pre-trained model we have imported previously. The top 3 predictions will be returned in the form of JSON object to the client. -
test_client.py
: This acts as a temporary client for our resnet_model_server.py server, in order to check if the communication between both of them is happening properly and if the predictions are received without any issues. Once this is successful, we could modify the code in app.py file so that it acts as the client toresnet_model_server.py
. -
requirements.txt
: The list of all the necessary packages along with their corresponding versions, used for the running ofresnet_model_server.py
.
-
-
Flask-Server-Folder
: It contains the virtual environment, flask server code, and therequirements.txt
file.-
flask-env
: The virtual environment for the flask server. -
app.py
: This is the file where we initialize Flask. This acts as the client to the server defined inresnet_model_server.py
. -
requirements.txt
: The list of all the necessary packages along with their corresponding versions, used for the running ofapp.py
. -
static
: This folder contains static files, like CSS and images. -
templates
: This folder contains the HTML templates for the web pages we render.
-