Skip to content

Latest commit

 

History

History
507 lines (438 loc) · 20.6 KB

20231208-support-netqos.md

File metadata and controls

507 lines (438 loc) · 20.6 KB
title authors reviewers creation-date last-updated
Support Netqos
@lucming
@zwzhang0107
@hormes
@eahydra
@FillZpp
@jasonliu747
@ZiMengSheng
@l1b0k
2023-12-08
2023-12-08

Support Netqos

Table of Contents

Glossary

ebpf
ebpf tc
edt
terway-qos
net_cls cgroup
tc (traffic control)
ipset

Summary

netqos is designed to resolve container network bandwidth contention problem in the mixed section scenarios. It supports limiting bandwidth by single pod and by priority on node level. Aims to improve the QOS(quantity of service).

Motivation

Currently, network bandwidth has not been taken into account in koordinator, and there may be certain pitfalls, such as:

  1. Low network bandwidth utilisation;
  2. Uneven distribution of network bandwidth load on cluster nodes;
  3. The QOS of high-priority processes cannot be guaranteed on a single machine.

This pr is mainly designed to solve the node-side container network bandwidth preemption problem in the mixed section scenario.

Goals

  • Limits the amount of network bandwidth that pod can be used. Includes ingress and egress.
  • On the same node, multiple containers can use and seize network bandwidth, design guideline: when the network bandwidth load is low, offline containers try to use all the bandwidth, when the network bandwidth is high, online containers give priority to use the network bandwidth.
  • We defined an API/Config for network qos, which can work with external plugins, such as terway-qos or built-in plugins (implemented by tc in the future)
  • Implement a netqos plugin based on tc, as a builtin netqos plugin.
  • Adaptation of some external netqos plugins, such as terway-qos.
  • Other external netqos plugin also can reuse this API.

Non-Goals/Future Work

  • Proposes a netqos implementation based on terway-qos;
  • Schedule/deschedule based on Network bandwidth on k8s cluster;
  • Suppression/eviction based on network bandwidth on node;
  • Observability: add some metrics for the network qos plugin itself.

User Stories

Story 1

As a cluster manager, it is hoped that the network bandwidth distribution of the entire cluster is more balanced, avoiding some nodes with too high network loads, which leads to resource preemption and affects the QOS of containers.

Story 2

As a cluster manager, I would like to improve the node resource utilisation in a k8s cluster by deploying both online and offline services on the nodes. When network resources are idle, offline services can use node bandwidth as much as possible. When network resources are scrambled, you can give priority to guaranteeing network resources for online services, while also taking care that offline services are not starved to death.

Story 3

When node containers experience severe bandwidth contention, administrators can temporarily adjust individual pod bandwidth limits without rebuilding the pod.

Story 4

As a user, I would like to have a variety of netqos implementations to choose from, to fit different nodes.

Design

Design Principles

  • The netqos implementation should be scalable, terway-qos will be adapted first, still need to be compatible with other netqos solutions, just like TC.
  • The netqos feature should be pluggable, and user can configure whether to enable the netqos feature or not.

Implementation Details

koordlet:

api:
node level:

In mixed scenarios, we expect to guarantee maximum bandwidth for online business to avoid contention. During idle periods, offline business should also be able to utilize the full bandwidth resources as much as possible.
we will consider expanding the fields of nodeslo to add new parameters related to network bandwidth as follows:

type NodeSLOSpec struct {
	// QoS config strategy for pods of different qos-class
	ResourceQOSStrategy *ResourceQOSStrategy `json:"resourceQOSStrategy,omitempty"`
	//node global system config
	SystemStrategy *SystemStrategy `json:"systemStrategy,omitempty"`
}

type ResourceQOSStrategy struct {
	// Policies of pod QoS.
	Policies *ResourceQOSPolicies `json:"policies,omitempty"`

	// ResourceQOS for LSR pods.
	LSRClass *ResourceQOS `json:"lsrClass,omitempty"`

	// ResourceQOS for LS pods.
	LSClass *ResourceQOS `json:"lsClass,omitempty"`

	// ResourceQOS for BE pods.
	BEClass *ResourceQOS `json:"beClass,omitempty"`

	// ResourceQOS for system pods
	SystemClass *ResourceQOS `json:"systemClass,omitempty"`

	// ResourceQOS for root cgroup.
	CgroupRoot *ResourceQOS `json:"cgroupRoot,omitempty"`
}

type ResourceQOS struct {
	...
	NetworkQOS *NetworkQOSCfg `json:"networkQOS,omitempty"`
}

type NetworkQOSCfg struct {
	Enable     *bool `json:"enable,omitempty"`
	NetworkQOS `json:",inline"`
}

type NetworkQOS struct {
	// IngressRequest describes the minimum network bandwidth guaranteed in the ingress direction.
	// unit: bps(bytes per second), two expressions are supported,int and string,
	// int: percentage based on total bandwidth,valid in 0-100
	// string: a specific network bandwidth value, eg: 50M.
	// +kubebuilder:default=0
	IngressRequest *intstr.IntOrString `json:"ingressRequest,omitempty"`
	// IngressLimit describes the maximum network bandwidth can be used in the ingress direction,
	// unit: bps(bytes per second), two expressions are supported,int and string,
	// int: percentage based on total bandwidth,valid in 0-100
	// string: a specific network bandwidth value, eg: 50M.
	// +kubebuilder:default=100
	IngressLimit *intstr.IntOrString `json:"ingressLimit,omitempty"`

	// EgressRequest describes the minimum network bandwidth guaranteed in the egress direction.
	// unit: bps(bytes per second), two expressions are supported,int and string,
	// int: percentage based on total bandwidth,valid in 0-100
	// string: a specific network bandwidth value, eg: 50M.
	// +kubebuilder:default=0
	EgressRequest *intstr.IntOrString `json:"egressRequest,omitempty"`
	// EgressLimit describes the maximum network bandwidth can be used in the egress direction,
	// unit: bps(bytes per second), two expressions are supported,int and string,
	// int: percentage based on total bandwidth,valid in 0-100
	// string: a specific network bandwidth value, eg: 50M.
	// +kubebuilder:default=100
	EgressLimit *intstr.IntOrString `json:"egressLimit,omitempty"`
}

type SystemStrategy struct {
	...
	// TotalNetworkBandwidth indicates the overall network bandwidth, cluster manager can set this field via "slo-controller-config" configmap, 
	// and default value just taken from /sys/class/net/${NIC_NAME}/speed, unit: Mbps
	TotalNetworkBandwidth resource.Quantity `json:"totalNetworkBandwidth,omitempty"`
}
pod level:

This is for fine-grained network bandwidth control of containers in a pod.
We will declare the pod-level netqos configuration via pod.annotation["koordinator.sh/networkQOS"] with the following api definition:

type PodNetworkQOS struct {
	NetworkQOS
	QoSClass        extension.QoSClass // BE/LS/LSR
	// todo: network bandwidth limiting & preemption based on container port
	// PortsNetwrokQOS []PortNetworkQOS
}

// todo: netqos api based on cotainer port.
type PortNetworkQOS struct {
	NetworkQOS
	Port     int
	QoSClass extension.QoSClass // BE/LS/LSR
}

After that, the netqos plugin will implement the network limiting operation based on the above API.

supported plugins:
external plugins:
  • terway-qos:

    terway-qos designed three priorities on the node, it been used to limit and ensure containers with different priority can use the network bandwidth.

    • for node:

      koordinator and terway-qos need to interact with a configuration file path in /var/run/koordinator/net/node, and the file content as follows:

      {
        "hw_tx_bps_max": 0,
        "hw_rx_bps_max": 0,
        "l1_tx_bps_min": 0,
        "l1_tx_bps_max": 0,
        "l2_tx_bps_min": 0,
        "l2_tx_bps_max": 0,
        "l1_rx_bps_min": 0,
        "l1_rx_bps_max": 0,
        "l2_rx_bps_min": 0,
        "l2_rx_bps_max": 0
      }

      and api in koordinator just like:

      type NetQosGlobalConfig struct {
        HwTxBpsMax uint64 `json:"hw_tx_bps_max"`
        HwRxBpsMax uint64 `json:"hw_rx_bps_max"`
        L1TxBpsMin uint64 `json:"l1_tx_bps_min"`
        L1TxBpsMax uint64 `json:"l1_tx_bps_max"`
        L2TxBpsMin uint64 `json:"l2_tx_bps_min"`
        L2TxBpsMax uint64 `json:"l2_tx_bps_max"`
        L1RxBpsMin uint64 `json:"l1_rx_bps_min"`
        L1RxBpsMax uint64 `json:"l1_rx_bps_max"`
        L2RxBpsMin uint64 `json:"l2_rx_bps_min"`
        L2RxBpsMax uint64 `json:"l2_rx_bps_max"`
      }

      In the config file above, the unit of each field is bps(byte per second), there are three priorities l0,l1,l2, the higher the number the lower the priority, default is l0. The largest value of l0 is the overall network bandwidth, l0.min=total-l1.min-l2.min, l1,l2 cannot over their network bandwidth limits. when the load is high, priority is given to ensure that high-priority(l0) containers get network bandwidth first. When the load is low, the network bandwidth for low-priority(l2) containers is accommodated as much as possible.

    • for pod:
      koordinator will sync configuration file, content to /var/run/koordinator/net/pods. and then terway-qos or other netqos plugin(eg: tc) will do something to limit the net bandwidth that container can use. content as follows:

      {
        "cgroup":"/sys/fs/cgroup/xxxx",
        "priority":0,
        "pod": "namespacedname",
        "podUID":"xxx",
        # todo: network bandwidth limiting and preemption based on port/dscp
        "qos-config": {}
      }

      TODO: The qos-config is used to define the configuration for network bandwidth limitation and preemption based on port/dscp, which may look like this:

      {
          "ingress": [
              {
                  "matchs": [{
                      "type": "ip"
                  }],
                  "actions": [
                      {
                          "action": "qos-class",
                          "value": "l1",
                      }
                  ],
              }
          ],
          "egress": [
              {
                  "matchs": [{
                      "type": "port",
                      "expr": "=80"
                  }],
                  "actions": [
                      {
                          "action": "qos-class",
                          "value": "l1",
                      },
                      {
                          "action": "dscp",
                          "value": "",
                      },
                      {
                          "action": "bandwidth_min",
                          "value": "1000",
                      },
                      {
                          "action": "bandwidth_max",
                          "value": "1000",
                      }
                  ],
              }
          ]
      }
  • other external netqos plugins ...

internal plugins:
  • TC: (builtin plugin for koordinator, some netqos solution based on linux itself.)

    • for node

      • For the pods of !host network, network speed limiting is implemented as follows:
        image

        When koordlet starts, it will initialise some rules according to the netqos configuration of nodeslo, mainly including tc, iptables, ipset rules, the specific speed limit is achieved by tc qdisc, and other rules are auxiliary.

        notes:
        the tc rules will be set on the NIC that corresponds to the host's default route. the case of multiple NICs has not been handled yet.
        todo: Multiple NIC network speed limit.

        each tc class will correspond to an ipset rule. This ipset declares a group of pods. This group of pods has the same tc class priority, and then share the network bandwidth in this tc class. By default, each tc class can use up all the network bandwidth of the node. there are three classes defined, system_class/ls_class/be_class , each of pods will be matched to a tc class.

        On the reason for using iptables:
        case: if you go to mark packets only by configuring the net_cls cgroup, you will find that packets from this this container do not make it to the desired tc class.
        reason: packages from the container's network namespace to the host network namespce will loss the classid(because the classid acts as a markes, it does not exist on the skb struct, and the packet drops the this markes during transmission). and then package will not be able to enter the desired tc class, causing the speed limit failing.
        ways to resolve: add this flag back in some other way, eg: iptables.
        On the reasons for using ipset:
        It is mainly used for IP grouping so that marking via iptables can be based on ipset objects without having to create an iptables rule for each pod, improving performance.

        The specific traffic distribution is as follows:
        image

        Logic for htb qdisc selection of specific classes:

        1. The htb algorithm starts at the bottom of the class tree and works its way up to find classes with the CAN_SEND status.
        2. If there are more than one class in the layer in the CAN_SEND state then the class with the highest priority (lowest value) is selected. After each class has sent its own quantum bytes, it is the next class's turn to send.

        Configuration of parameters for the specific class corresponding to each priority pod:

        PRIO SYSTEM LS BE
        net_prio 0 1 2
        net_cls 1:2 1:3 1:4
        htb.rate 40% 30% 30%
        htb.ceil 100% 100% 100%

        Of course, the rules can also be configured via shell commands as follows:

        # With an entire network bandwidth of 1000Mbit, the following rules are created.
        tc qdisc add dev eth0 root handle 1:0 htb default 1
        tc class add dev eth0 parent 1:0 classid 1:1 htb rate 1000Mbit
        tc class add dev eth0 parent 1:1 classid 1:2 htb rate 400Mbit ceil 1000Mbit prio 0
        tc class add dev eth0 parent 1:1 classid 1:3 htb rate 300Mbit ceil 1000Mbit prio 1
        tc class add dev eth0 parent 1:1 classid 1:4 htb rate 300Mbit ceil 1000Mbit prio 2
        ipset create system_class hash:net
        iptables -t mangle -A POSTROUTING -m set --match-set system_class src  -j CLASSIFY --set-class 1:2
        ipset create ls_class hash:net
        iptables -t mangle -A POSTROUTING -m set --match-set ls_class src  -j CLASSIFY --set-class 1:3
        ipset create be_class hash:net
        iptables -t mangle -A POSTROUTING -m set --match-set be_class src  -j CLASSIFY --set-class 1:4
      • For the pod of host network. The speed limit can be achieved directly through the net_cls cgroup(Because there is no network namespace switching, the packet classid marking will not be lost, so you can enter the specific class, thus achieving the speed limit).

      todo:
      Limits the network bandwidth in the ingress direction.

      tc's rate limiting for ingress direction requires that business traffic be redirected to an ifb device, limit the network bandwidth in the ingress direction of the device by limiting the rate in the egress direction of the ifb device, so we can only limit the speed of requests to the application, in fact, the traffic has already arrived at the physical device, we can decide if this is necessary based on the business scenario.

      shell comamnd just as follows:

      # the ifb module needs to be loaded manually.
      modprobe ifb
      
      # enable virtual device ifb0
      ip link set dev ifb0 up
      
      # configure filter rules for ifb0
      tc qdisc add dev eth0 handle ffff: ingress
      tc filter add dev eth0 parent ffff: protocol ip u32 match u32 0 0 action mirred egress redirect dev ifb0
      tc qdisc add dev ifb0 root handle 1: htb default 10
      tc class add dev ifb0 parent 1: classid 1:1 htb rate 10000mbit
      tc class add dev ifb0 parent 1:1 classid 1:10 htb rate 1000mbit ceil 1000mbit

koord-scheduler

A NetBandwidth scheduler plugin needs to be added to score the node according to the node network bandwidth load. The higher the node network bandwidth load, the lower the score, so as to ensure that the newly created pod can be scheduled to a node with relatively idle network bandwidth.

score = (node.capacity.netbandwidth - node.netbandwidth.used) * int64(framework.MaxNodeScore)) / node.capacity.netbandwidth

koord-descheduler

The LowNodeLoad rescheduler plugin needs to take into account the actual load on the node's network bandwidth when balancing. We need to add the netBandwidth threshold to the parameters of the LowNodeLoad plugin in koord-descheduler-config.yaml.

apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-config
  namespace: system
data:
  koord-descheduler-config: |
    ...
      - name: LowNodeLoad
        args:
          ...
          lowThresholds:
            netBandwidth: **
          highThresholds:
            netBandwidth: **

usage:

Cluster administrators can configure the slo-controller-config.yaml to Configure cluster or node level network bandwidth, if the node bandwidth is not configured, the network bandwidth reported by the koordlet will be used, The default network bandwidth request percentage for each level is l0:l1:l2=40%:30%:30%, limit all 100% by default, administrators can configure it by themselves.

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: kube-system
data:
  colocation-config: |
    {
      "enable": true
    }
  resource-threshold-config: |
    {
      "clusterStrategy": {
        "enable": true
      }
    }
  resource-qos-config: |
    {
      "clusterStrategy": {
        "lsrClass": {
          "networkQOS": {
            "enable": true,
            "ingressRequest": 40,
            "ingressLimit": 100,
            "egressRequest": 40,
            "egressLimit": 100
          },
        },
        "lsClass": {
          "networkQOS": {
            "enable": true,
            "ingressRequest": 40,
            "ingressLimit": 100,
            "egressRequest": 40,
            "egressLimit": 100
          },
        },
        "beClass": {
          "networkQOS": {
            "enable": true,
            "ingressRequest": 30,
            "ingressLimit": 100,
            "egressRequest": 30,
            "egressLimit": 100
          },
        }
      },
      system-config: |-
        {
          "clusterStrategy": {
            "totalNetworkBandwidth": 1000M
          }
        }
    }

Implementation History

  • 12/08/2023: Open PR for initial draft