Skip to content

RyazanovAlexander/pipeline-manager

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pipeline manager

This is a research project aimed at writing a Cloud-native pipeline orchestration platform based on a Kubernetes cluster with the following capabilities:

  • the ability to run pipelines on very weak hardware such as Raspberry Pi.
  • at-least-once guarantee of task processing when using the workload scheduler.
  • horizontal scaling each platform component to handle a large number of requests per second (target 1,000,000) with a large number of established TCP connections and low latency (<50ms) while consuming relatively few cluster resources (condition will be met at the 99th percentile).
  • extensibility through the use of Unix/Windows utilities and the ability to deploy custom applications in a cluster.

main

The main building block in the system is the pipeline. It is a POD with several containers. Displayed as a pipeline worker in the diagram. One of these containers is an agent that interacts with the task scheduler using the tcp protocol. Additionally, the agent acts as a web server, for cases when the user directly runs commands in the pipeline via gRPC or http without using the task scheduler. The rest of the containers contain the utilities involved in the task execution. The result of the work is transmitted through the shared volume.

pipeline

An example of a task sent to the pipeline:

{
  "pipeline": [
    {
      "executorName": "wget",
      "commands": [
        "wget -O /mnt/pipe/2rb88.png https://i.stack.imgur.com/2rb88.png",
        "wget -O /mnt/pipe/text-photographed-eng.jpg https://www.imgonline.com.ua/examples/text-photographed-eng.jpg",
        "wget -O /mnt/pipe/Cleartype-vs-Standard-Antialiasing.gif https://upload.wikimedia.org/wikipedia/commons/b/b8/Cleartype-vs-Standard-Antialiasing.gif"
      ]
    },
    {
      "executorName": "tesseract",
      "commands": [
        "for file in $(ls -v *.*) ; do tesseract $file {file%.*}.txt; done"
      ]
    },
    {
      "executorName": "mc",
      "commands": [
        "mc mb buckets/5840e11b-2117-4036-a6e6-bcff03fbd3c9",
        "mc cp --recursive /mnt/pipe/ buckets/5840e11b-2117-4036-a6e6-bcff03fbd3c9",
        "rm -r /mnt/pipe/*"
      ]
    }
  ]
}

Motivation

It's no secret that for a large number of tasks there are already ready-made solutions in the form of a set of utilities. All you need to do is combine these utilities into the pipeline.

In the simplest case, it is enough to run such a pipeline on the local computer. When the load increases, additional processes with pipelines are launched until all computer resources are exhausted.

With further growth of loads, it is necessary to add new components to the system, such as a database, queues, a task scheduler and an autoscaling mechanism.

program-evolution

There are a large number of frameworks and services that make it easier to solve this task. For example, UiPath and Transloadit allow you to automate routine operations in just a matter of hours, such as processing video, text, etc. But when using such systems, after a while you come across either insufficient performance or with the possibility of customizing scenarios.

This project is designed to solve only one task - the execution of the pipeline, all tasks of which must be completed in one single pod entirely. Despite the obvious limitation, compared to workflow engines, this approach allows:

  • perfectly scale pods with pipelines.
  • dramatically reduce the load on the network due to the absence of the need to send intermediate results to the blob storage.
  • excellent customization of pipelines.

In developing this solution, various principles were borrowed from Temporal, Apache Airflow, Orleans, Apache Spark, Transloadit and Dapr.

Project structure

The project consists of several repositories: