Skip to content

resctrl

ahetheri edited this page Dec 7, 2016 · 13 revisions

1 Resource Control

1.1 resctrl

As of Kernel 4.10 resctrl has been added. Resource Control works in a similar way to RDT CAT. It writes to MSR’s to enforce cache partitioning. When a new piece of code or data is scheduled to the cache, resctrl will write a partitioning configuration to the appropriate MSR. The scheduler will then check these MSR’s to ensure it is writing to the correct area of cache. This enforcement works on a per code, per task id and per process id (PID) basis. If you wish to use this before the release of kernel 4.10 you can download from here kernel.googlesource x86/cache

1.1.1 Resctrl structure

Resource control is a mountable virtual file system located at /sys/fs/resctrl . To mount this system use the command:

# mount –t resctrl resctrl [-o cdp] /sys/fs/resctrl

When selected, the cdp mount option allows for code/data prioritization in L3 cache allocations. Once mounted, the default resource group and an info directory become visible.

Info Directory

The info directory contains hardware specific information on a per cache level basis. For example all information on L3 cache is located in info/L3 . Each subdirectory contains the following files:

“num_closids”:

  • The number of unique COS configurations available for this resource (L3/L2/CDP). The kernel uses the smallest number of COS for all enabled resources as the limit.

“cbm_mask”:

  • The max bit mask available for this resource.

“min_cbm_bits”:

  • The minimum number of consecutive bits that can be set for this resource.

Resource groups

Resource groups are represented as directories in the resctrl file system. The default group is the root directory. Other groups may be created as desired by the system administrator using the "mkdir(1)" command, and removed using "rmdir(1)". There are three files associated with each group:

"tasks":

A list of tasks that belongs to this group. Tasks can be added to a group by writing the task ID (PID) to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. If a PID is not in any sub partition, it is in root partition (i.e. default partition).

"cpus":

A bitmask of logical CPUs assigned to this group. Writing a new mask can add/remove CPUs from this group. Added CPUs are removed from their previous group. Removed ones are given to the default (root) group. You cannot remove CPUs from the default group.

"schemata":

A list of all the resources available to this group. Each resource has its own line and format. The format consists of a mask that controls the resources access. For example, a schemata for L3 cache will have a mask representing the cache ways available.

When a task is running the following rules define which resources are available to it:

  1. If the task is a member of a non-default group, then the schemata for that group is used.

  2. Else if the task belongs to the default group, but is running on a CPU that is assigned to some specific group, then the schemata for the CPU's group is used.

  3. Otherwise the schemata for the default group is used.

Figure 1 shows the structure of resctrl

resctrl Structure

1.1.2 How to use resctrl

1.1.2.1 Overview

As explained above, new resource groups are created using mkdir. Once a new resource group is created the cpu, task and schemata files are automatically generated. The first free Class of Service available is then associated with this new resource group. For example, the default resource group will be associated with COS 0, then when the first new resource group is created it will be given COS 1 and so on.

If a resource group is removed (rmdir) then the COS it was associated is freed up to be used again.

CPU file

As explained earlier, the cpu file contains a mask of the cpu’s associated with this resource group. To edit this association simply echo the new mask into the cpu file. For example, if your system has four logical cores the mask for all four will be 1111 in binary or F in hex. To change the cpu association to 2 cores for COS 1 type the following commands.

# mount –t resctrl resctrl [-o cdp] /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir COS1
# echo “3” > COS1/cpus

This will remove the cores associated with COS0 and associate them will COS1, so the files will look like the following.

# cat cpu
	c
# cat COS1/cpu
	3

Task file

The default task file contains all the PID’s of tasks running on the system. Like the cpu file above, if you move a task to a different resource group it will remove the PID from the default resource group.

Schemata file

Each line in the file describes one resource. The line starts with the name of the resource, followed by specific values to be applied in each of the instances of that resource on the system.

Cache IDs

On current generation systems there is one L3 cache per socket and L2 caches are generally just shared by the hyperthreads on a core, but this isn't an architectural requirement. We could have multiple separate L3 caches on a socket, multiple cores could share an L2 cache. So instead of using "socket" or "core" to define the set of logical cpus sharing a resource we use a "Cache ID". At a given cache level this will be a unique number across the whole system (but it isn't guaranteed to be a contiguous sequence, there may be gaps). To find the ID for each logical CPU look in

/sys/devices/system/cpu/cpu*/cache/index*/id

Some examples:

L3 details (code and data prioritization disabled) With CDP disabled the L3 schemata format is:

L3:<cache_id0>=<mask>;<cache_id1>=<mask>;...

L3 details (CDP enabled via mount option to resctrl) When CDP is enabled L3 control is split into two separate resources so you can specify independent masks for code and data like this:

L3data:<cache_id0>=<mask>;<cache_id1>=<mask>;...
L3code:<cache_id0>=<mask>;<cache_id1>=<mask>;...

L2 details

L2 cache does not support code and data prioritization, so the schemata format is always:

L2:<cache_id0>=<mask>;<cache_id1>=<mask>;...

Figure 2 is a sequence diagram showing how to use rescrtl

Sequence Diagram of resctrl

Figure 3 is a block diagram showing how to use resctrl

resctrl new directory

1.1.2.2 Examples

Example 1

On a two socket machine (one L3 cache per socket) with just four bits for cache bit masks

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata

The default resource group is unmodified, so we have access to all parts of all caches (its schemata file reads "L3:0=f;1=f"). Tasks that are under the control of group "p0" may only allocate from the "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. Tasks in group "p1" use the "lower" 50% of cache on both sockets.

Example 2

Again two sockets, but this time with a more realistic 20-bit mask. Two real time tasks pid=1234 running on processor 0 and pid=5678 running on processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy neighbors, each of the two real-time tasks exclusively occupies one quarter of L3 cache on socket 0.

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl

First we reset the schemata for the default group so that the "upper" 50% of the L3 cache on socket 0 cannot be used by ordinary tasks:

# echo "L3:0=3ff;1=fffff" > schemata

Next we make a resource group for our first real time task and give it access to the "top" 25% of the cache on socket 0.

# mkdir p0
# echo "L3:0=f8000;1=fffff" > p0/schemata

Finally we move our first real time task into this resource group. We also use taskset(1) to ensure the task always runs on a dedicated CPU on socket 0. Most uses of resource groups will also constrain which processors tasks run on.

# echo 1234 > p0/tasks
# taskset -cp 1 1234

Ditto for the second real time task (with the remaining 25% of cache):

# mkdir p1
# echo "L3:0=7c00;1=fffff" > p1/schemata
# echo 5678 > p1/tasks
# taskset -cp 2 5678

Example 3

A single socket system which has real-time tasks running on core 4-7 and non real-time workload assigned to core 0-3. The real-time tasks share text and data, so a per task association is not required and due to interaction with the kernel it's desired that the kernel on these cores shares L3 with the tasks.

# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl

First we reset the schemata for the default group so that the "upper" 50% of the L3 cache on socket 0 cannot be used by ordinary tasks:

# echo "L3:0=3ff" > schemata

Next we make a resource group for our real time cores and give it access to the "top" 50% of the cache on socket 0.

# mkdir p0
# echo "L3:0=ffc00;" > p0/schemata

Finally we move core 4-7 over to the new group and make sure that the kernel and the tasks running there get 50% of the cache.

# echo C0 > p0/cpus

1.1.3 How to debug resctrl using pqos

Pqos and resctrl use the same MSR’s to perform CAT. This means that you can use pqos to review your resource allocation at a system level. Pqos can also be used to provide useful information on data needed for resctrl. For example, if you wish to know all the cache id for all cores you can use:

# pqos –s –V

pqos-s-v

We can see that the cache id for L3 core 23 is 1 and for L2 it is 13.

Using the same pqos command we can see what the current COS configuration is for all COS’s:

pqos-s-v2

And finally we can also see the core to COS mapping using the pqos –s –V command:

pqos -s -V

Using this data we are able to get a system wide view of what has been altered using resctrl.

Clone this wiki locally