This is a research project aimed at writing a Cloud-native pipeline orchestration platform based on a Kubernetes cluster with the following capabilities:
- the ability to run pipelines on very weak hardware such as Raspberry Pi.
- at-least-once guarantee of task processing when using the workload scheduler.
- horizontal scaling each platform component to handle a large number of requests per second (target 1,000,000) with a large number of established TCP connections and low latency (<50ms) while consuming relatively few cluster resources (condition will be met at the 99th percentile).
- extensibility through the use of Unix/Windows utilities and the ability to deploy custom applications in a cluster.
The main building block in the system is the pipeline. It is a POD with several containers. Displayed as a pipeline worker
in the diagram. One of these containers is an agent that interacts with the task scheduler using the tcp protocol. Additionally, the agent acts as a web server, for cases when the user directly runs commands in the pipeline via gRPC or http without using the task scheduler. The rest of the containers contain the utilities involved in the task execution. The result of the work is transmitted through the shared volume.
An example of a task sent to the pipeline:
{
"pipeline": [
{
"executorName": "wget",
"commands": [
"wget -O /mnt/pipe/2rb88.png https://i.stack.imgur.com/2rb88.png",
"wget -O /mnt/pipe/text-photographed-eng.jpg https://www.imgonline.com.ua/examples/text-photographed-eng.jpg",
"wget -O /mnt/pipe/Cleartype-vs-Standard-Antialiasing.gif https://upload.wikimedia.org/wikipedia/commons/b/b8/Cleartype-vs-Standard-Antialiasing.gif"
]
},
{
"executorName": "tesseract",
"commands": [
"for file in $(ls -v *.*) ; do tesseract $file {file%.*}.txt; done"
]
},
{
"executorName": "mc",
"commands": [
"mc mb buckets/5840e11b-2117-4036-a6e6-bcff03fbd3c9",
"mc cp --recursive /mnt/pipe/ buckets/5840e11b-2117-4036-a6e6-bcff03fbd3c9",
"rm -r /mnt/pipe/*"
]
}
]
}
It's no secret that for a large number of tasks there are already ready-made solutions in the form of a set of utilities. All you need to do is combine these utilities into the pipeline.
In the simplest case, it is enough to run such a pipeline on the local computer. When the load increases, additional processes with pipelines are launched until all computer resources are exhausted.
With further growth of loads, it is necessary to add new components to the system, such as a database, queues, a task scheduler and an autoscaling mechanism.
There are a large number of frameworks and services that make it easier to solve this task. For example, UiPath and Transloadit allow you to automate routine operations in just a matter of hours, such as processing video, text, etc. But when using such systems, after a while you come across either insufficient performance or with the possibility of customizing scenarios.
This project is designed to solve only one task - the execution of the pipeline, all tasks of which must be completed in one single pod entirely. Despite the obvious limitation, compared to workflow engines, this approach allows:
- perfectly scale pods with pipelines.
- dramatically reduce the load on the network due to the absence of the need to send intermediate results to the blob storage.
- excellent customization of pipelines.
In developing this solution, various principles were borrowed from Temporal, Apache Airflow, Orleans, Apache Spark, Transloadit and Dapr.
The project consists of several repositories:
- pipeline-manager - contains documentation, CI/CD and links to other repositories.
- pipeline-manager.charts - repository of helm charts used on the project.
- pipeline-manager.app-deployer - a tool for deploying applications that extend the functionality of the platform.
- pipeline-manager.worker.command-executor - the gRPC agent used by Pipeline workers to execute processes in pod containers.
- pipeline-manager.pipeline.agent - sidecar container service, responsible for executing commands in other containers in the pod.
- pipeline-manager.workload.scheduler - workload task scheduler.
- pipeline-manager.application.ocr - demo with a text recognition application installed using AppDeployer.
- project.template.golang - golang project template.