Pipeline manager

This is a research project aimed at writing a Cloud-native pipeline orchestration platform based on a Kubernetes cluster with the following capabilities:

the ability to run pipelines on very weak hardware such as Raspberry Pi.
at-least-once guarantee of task processing when using the workload scheduler.
horizontal scaling each platform component to handle a large number of requests per second (target 1,000,000) with a large number of established TCP connections and low latency (<50ms) while consuming relatively few cluster resources (condition will be met at the 99th percentile).
extensibility through the use of Unix/Windows utilities and the ability to deploy custom applications in a cluster.

The main building block in the system is the pipeline. It is a POD with several containers. Displayed as a pipeline worker in the diagram. One of these containers is an agent that interacts with the task scheduler using the tcp protocol. Additionally, the agent acts as a web server, for cases when the user directly runs commands in the pipeline via gRPC or http without using the task scheduler. The rest of the containers contain the utilities involved in the task execution. The result of the work is transmitted through the shared volume.

An example of a task sent to the pipeline:

{
  "pipeline": [
    {
      "executorName": "wget",
      "commands": [
        "wget -O /mnt/pipe/2rb88.png https://i.stack.imgur.com/2rb88.png",
        "wget -O /mnt/pipe/text-photographed-eng.jpg https://www.imgonline.com.ua/examples/text-photographed-eng.jpg",
        "wget -O /mnt/pipe/Cleartype-vs-Standard-Antialiasing.gif https://upload.wikimedia.org/wikipedia/commons/b/b8/Cleartype-vs-Standard-Antialiasing.gif"
      ]
    },
    {
      "executorName": "tesseract",
      "commands": [
        "for file in $(ls -v *.*) ; do tesseract $file {file%.*}.txt; done"
      ]
    },
    {
      "executorName": "mc",
      "commands": [
        "mc mb buckets/5840e11b-2117-4036-a6e6-bcff03fbd3c9",
        "mc cp --recursive /mnt/pipe/ buckets/5840e11b-2117-4036-a6e6-bcff03fbd3c9",
        "rm -r /mnt/pipe/*"
      ]
    }
  ]
}

Motivation

It's no secret that for a large number of tasks there are already ready-made solutions in the form of a set of utilities. All you need to do is combine these utilities into the pipeline.

In the simplest case, it is enough to run such a pipeline on the local computer. When the load increases, additional processes with pipelines are launched until all computer resources are exhausted.

With further growth of loads, it is necessary to add new components to the system, such as a database, queues, a task scheduler and an autoscaling mechanism.

There are a large number of frameworks and services that make it easier to solve this task. For example, UiPath and Transloadit allow you to automate routine operations in just a matter of hours, such as processing video, text, etc. But when using such systems, after a while you come across either insufficient performance or with the possibility of customizing scenarios.

This project is designed to solve only one task - the execution of the pipeline, all tasks of which must be completed in one single pod entirely. Despite the obvious limitation, compared to workflow engines, this approach allows:

perfectly scale pods with pipelines.
dramatically reduce the load on the network due to the absence of the need to send intermediate results to the blob storage.
excellent customization of pipelines.

In developing this solution, various principles were borrowed from Temporal, Apache Airflow, Orleans, Apache Spark, Transloadit and Dapr.

Project structure

The project consists of several repositories:

pipeline-manager - contains documentation, CI/CD and links to other repositories.
pipeline-manager.charts - repository of helm charts used on the project.
pipeline-manager.app-deployer - a tool for deploying applications that extend the functionality of the platform.
pipeline-manager.worker.command-executor - the gRPC agent used by Pipeline workers to execute processes in pod containers.
pipeline-manager.pipeline.agent - sidecar container service, responsible for executing commands in other containers in the pod.
pipeline-manager.workload.scheduler - workload task scheduler.
pipeline-manager.application.ocr - demo with a text recognition application installed using AppDeployer.
project.template.golang - golang project template.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.devcontainer		.devcontainer
.github/workflows/devcontainer		.github/workflows/devcontainer
diagrams		diagrams
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline manager

Motivation

Project structure

About

Releases

Packages

Languages

License

RyazanovAlexander/pipeline-manager

Folders and files

Latest commit

History

Repository files navigation

Pipeline manager

Motivation

Project structure

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages