% PORTO(8)

NAME

porto - linux container management system

SYNOPSIS

portod [-h|--help] [-v|--version] [options...] <command> [arguments...]

portoctl [-h|--help] [-v|--version] [options...] <command> [arguments...]

DESCRIPTION

Porto is a yet another Linux container management system.

The main goal is providing single entry point for several Linux subsystems such as cgroups, namespaces, mounts, networking, etc. Porto is intended to be a base for large infrastructure projects.

Key Features

Nested containers - containers could be put into containers
Nested virtualizaion - containers could use porto service too
Flexible configuration - all container parameters are optional
Reliable service - porto upgrades without restarting containers

Container management software build on top of porto could be transparently enclosed inside porto container.

Porto provides a protobuf interface via an unix socket /run/portod.socket.

Command line tool portoctl and C++, Python and Go APIs are included.

Porto requires Linux kernel 3.18 and optionally some offstream patches.

CONTAINERS

Container is a basic object which holds resources and contains some workload.

Depending on configuration properties container constructs own namespaces and cgroups or inherits them from parent container.

Container executes command or OS (virt_mode=os) or works as meta container for nested sub-containers.

Name

Container name could contains only these characters: 'a'..'z', 'A'..'Z', '0'..'9', '_', '-', '@', ':', '.'. Slash '/' separates nested container: "parent/child".

Each container name component should not exceed 128 characters. Whole name is limited with 200 characters and 220 for superuser. Also porto limits nesting with 16 levels.

Container could be addressed using short name relative current porto namespaces: "name", or absolute name "/porto/name" which stays the same regardless of porto namespace. See absolute_name, absolute_namespace, enable_porto and porto_namespace below.

Host is a pseudo-container "/".

"self" points to current container where current task lives and could be used for relative requests "self/child" or "self/..".

Container "." points to parent container for current porto namespace, this is common parent for all visible containers.

States

stopped - initial state
starting - start in progress
running - command execution in progress
stopping - stop in progress
paused - frozen, consumes memory but no cpu
dead - execution complete
meta - running container without command
respawning - dead and will be started again

Operations

create - creates new container in stopped state
start - stopped -> starting -> running | meta
stop - running | meta | dead -> stopping -> stopped
restart - dead -> stopping -> stopped -> starting -> running
kill - running -> dead
death - running -> dead
pause - running | meta -> paused
resume - paused -> running | meta
destroy - destroys container in any state
list - list containers
get - get container property
set - set container property
wait - wait for container death

Usual Life Cycle:

create -> (stopped) -> setup -> start -> (running) -> death -> (dead) -> get -> destroy

Properties

Container configuration and state both represented in key-value interface.
Some properties are read-only or requires particular container state.

portoctl without arguments prints list of possible container properties.

Some properties have internal key-value structure in like <key>: <value>;... and provide access to individual values via property[key]

Values which represent size in bytes could have floating-point and 1024-based suffixes: B|K|M|G|T|P|E. Porto returns these values in bytes without suffixes.

Values which represents text masks works as fnmatch(3) with flag FNM_PATHNAME: '*' and '?' doesn't match '/', with extension: '***' - matches everything.

Labels

Container could have user-defined labels and associated values.

Label and value may use only symbols allowed for container names. Spaces and '/' are not allowed.

Label must be in format PREFIX.name. Label max length is 128 symbols. PREFIX must be 2..16 UPPERCASE A-Z chars. Prefixes PORTO* are reserved.

Value max length is 256 symbols. Empty value removes label.

Each container may have up to 100 labels.

Use "Y" and "N" for boolean values and "." as placeholder.

For count, size, speed, time use bytes, bytes/second, seconds as decimal integers without suffixes in label and value. Other types must be defined by labels suffix: _ms, _ns, _cores.

Do not keep full file paths in label values: users could be in different chroots. Short file names are ok.

All labeles are stored as property labels. Access via properties labels[PREFIX.name] and PREFIX.name works as well.

Set and inherited labels could be read as labels[.PREFIX.name]

Porto provides API for label lookup, atomic compare-and-set, atomic increment and notifications.

Context

command - container command string

Environment variables $VAR are expanded using wordexp(3). Container with empty command is a meta container.
command_argv - verbatim command line

List of tab separated arguments, overrides command. Could be get and set via index: command_argv[index].

Set command to space separated '${ARGV/'/'\''}'.
core_command - command for receiving core dumps

Container without chroot inherits default core command from parent container.

To enable core-dumos set ulimit[core] to unlimited or anything > 1.

Also see [COREDUMPS] below.
env - environment of main container process, syntax: <variable>=<value>; ...

Container with isolate=false inherits environment variables from parent.

Default environment is:

container="lxc"
PORTO_NAME=container name
PORTO_HOST=host hostname
PORTO_USER=owner_user
HOME=cwd
USER=user
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

Default environment could be configured in in portod.conf:
```
container {
    extra_env {
        name: "NAME"
        value: "VALUE"
   }
}
```
env_secret - secret environment variables, syntax: <variable>=<value>; ...

Same as env but logging and reading replace value with "".
user - uid of container processes, default owner_user
group - gid of container processes, default: owner_group
task_cred - credentials of container process: uid gid groups...
umask - initial file creation mask, default: 0002, see umask(2)
ulimit - resource limits, syntax: <type>: [soft]|unlimited <hard>|unlimited; ... see getrlimit(2)

Default ulimits could be set in portod.conf:
```
container {
    default_ulimit: "type: soft hard;..."
}
```
Hardcoded default is "core: 0 unlimited; nofile: 8K 1M".

If memory_limit is set them default memlock is max(memory_limit_total - 16M, 8M), hard limit is unlimited.

Configuration options in portod.conf:
```
container {
    memlock_minimal: <bytes>
    memlock_margin: <bytes>
}
```
virt_mode - virtualization mode:
- app - (default) start command as normal process
- os - start command as init process
- host - start command without security restrictions
- job - start command as process group in parent container
- docker - start command in user namespace
- fuse - start command in mount and user namespace with direct mapping
Side effects of virt_mode=os
- set default command="/sbin/init"
- set default user="root", group="root"
- set default stdout_path="/dev/null", stderr_path="/dev/null"
- set default net=none
- set default cwd="/"
- reset loginuid for container
- enable systemd cgroup if /sbin/init is systemd
- stop command will send SIGPWR rather than SIGTERM
Side effects of virt_mode=docker
- set default command="bash -c 'containerd& dockerd'"
- join user and network namespaces
- user and group must be not "root"
- parent must be with virt_mode="os"
Side effects of virt_mode=fuse
- set default devices="/dev/fuse rw"
- join mount namespaces
- join user namespaces with direct mapping
userns - create new user namespace
- false - use parent namespace (default)
- true - create new namespace
cgroupfs - mount cgroup fs in container:
- none - (default) do not mount
- ro - mount read only
- rw - allow write. Allowed by config via enable_rw_cgroupfs=true
If cgroupfs!=none then task starts in cgroup namespace

State

id - container id, 64-bit decimal
level - container level, 0 for root, 1 for first level
state - current state, see [States]
exit_code - 0: success, 1..255: error code, -64..-1: termination signal, -99: OOM
exit_status - container exit status, see wait(2)
start_error - last start error
oom_killed - true, if container has been OOM killed
oom_kills - count of tasks killed in container since start (Linux 4.13 or offstream patches)
oom_kills_total - count of tasks killed in hierarchy since creation (Linux 4.13 or offstream patches)
core_dumped - true, if main process dumped core
root_pid - main process pid in client pid namespace (could be unreachable)
stdout[[offset][:length]] - stdout text, see stdout_path
- stdout - get all available text
- stdout[:1000] - get only last 1000 bytes
- stdout[2000:] - get bytes starting from 2000 (text might be lost)
- stdout[2000:1000] - get 1000 bytes starting from 2000
stdout_offset - offset of stored stdout
stderr[[offset][:length]] - stderr text, see stderr_path

Same as stdout.
stderr_offset - offset of stored stderr
time - container running time in seconds
time[dead] - time since death in seconds
creation_time - format: YYYY-MM-DD hh:mm:ss
creation_time[raw] - seconds since the epoch
change_time - format: YYYY-MM-DD hh:mm:ss
change_time[raw] - seconds since the epoch
start_time - format: YYYY-MM-DD hh:mm:ss
start_time[raw] - seconds since the epoch
death_time - format: YYYY-MM-DD hh:mm:ss
death_time[raw] - seconds since the epoch
absolute_name - full container name including porto namespaces
absolute_namespace - full container namespace including parent namespaces

This is required prefix of absolute_name for seeing other container.
controllers - enabled cgroup controllers, see [CGROUPS] below
cgroups - paths to cgroups, syntax: <name>: <path>
process_count - current process count
thread_count - current thread count
thread_limit - limit for thread_count

For first level containers default is 10000.
parent - parent container absolute name
private - 4096 bytes of user-defined text
labels - user-defined labels, syntax <label>: <value>;...

See [Labels].
weak - if true container will be destroyed when client disconnects

Contianer must be created by special API call, client could clear weak flag after that.
respawn - automatically restart container after death

If set also for nested containers they will be scheduled to respawn after parent respawn.
respawn_count - how many times container has been respawned

Could be reset at any time.
max_respawns - limit for automatic respawns, default: -1 (unlimited)
respawn_delay - delay before automatic respawn in nanoseconds, default 1s
aging_time - time in seconds before auto-destroying dead containers, default: 1 day

Security

isolate - create new pid/ipc/utc/env namespace
- true - create new namespaces (default)
- false - use parent namespaces
capabilities - available capabilities, syntax: CAP;... see capabilities(7)

Porto restricts capabilities depending on other properties:

Requires memory_limit: IPC_LOCK.

Requires isolate: KILL, PTRACE.

Requires net: NET_ADMIN.

Always available: NET_BIND_SERVICE, NET_RAW

Requires root (chroot): SETPCAP, SETFCAP, CHOWN, FOWNER, DAC_OVERRIDE, FSETID, SETGID, SETUID, SYS_CHROOT, MKNOD, AUDIT_WRITE.

Requires no root, cannot be ambient, only for suid: LINUX_IMMUTABLE, SYS_ADMIN, SYS_NICE, SYS_RESOURCE, SYS_BOOT.

Without chroot all these capabilities are available. Capabilities which meet requirements could be set as ambient and available within chroot.

Container inherits capabilities from parent and cannot surpass them.

Container owned by root by default have all these capabilities and ignore any restrictions.

These capabilities are not available by default: SYS_RAWIO, SYS_TIME, SYS_MODULE, IPC_OWNER, DAC_READ_SEARCH, LEASE, SYSLOG, MAC_OVERRIDE, MAC_ADMIN, AUDIT_CONTROL, AUDIT_READ, NET_BROADCAST, SYS_PACCT, SYS_TTY_CONFIG, WAKE_ALARM, BLOCK_SUSPEND.
capabilities_allowed - resulting bounding set

This read-only property shows resulting set of capabilities allowed in container.
capabilities_ambient - raise ambient capabilities, syntax: CAP;... see capabilities(7)

All tasks in container will have these capabilities.

Requires Linux 4.3
capabilities_ambient_allowed - allowed ambient capabilities

Subset of capabilities_allowed allowed to be set as ambient. In container with chroot they are equal.
devices - access to devices, syntax: <device> [r][w][m][-][?] [path] [mode] [user] [group]|preset <preset>;...

"device" - device path in host (/dev/null)
"rwm-?" - read, write, mknod, no-access, optional
"path" - device path in container, defaults same as in host
"mode", "user", "group" - device node permissions and owner, defaults are taken from host, only host root and device owner are allowed to change this

By default porto grants access to:
/dev/null
/dev/zero
/dev/full
/dev/random
/dev/urandom
/dev/tty
/dev/console (as alias for /dev/null)
/dev/ptmx
/dev/pts/*

Inside chroots porto creates device nodes in /dev only for allowed devices.

Device access control could be entirely disabled for containers without chroot by setting property controllers[devices]=false at first-level container, in this case devices must stay unchnaged.

Container inherits perissmions from parent and cannot surpass them, thus any additional permissions must be granted for first-level container. Configuration could be changes in runtime.

Device preset should be defined in portod.conf:
```
container {
    device_preset {
        preset: "<preset>"
        device: "<device> [rwm]..."
        device: ...
    }
    ...
}
```
Access to devices for all containers could be also granted in portod.conf:
```
container {
    extra_devices: "<device> [rwm]... ;..."
}
```
Write access to related sysfs nodes could be granted in portod.conf:
```
container {
    device_sysfs {
        device: "/dev/abc"
        sysfs: "/sys/foo"
        sysfs: "/sys/bar"
    }
    ...
}
```
enable_porto - access to porto
- false | none - no access
- read-isolate - read-only access, show only sub-containers
- read-only - read-only access
- isolate - read-write access, show only sub-containers
- child-only - write-access to sub-containers
- true | full - full access (default)
Containers with restricted view sees truncated container names excluding isolated part.

All containers with read access could examine "/", ".", "self" and all own parents.
porto_namespace - name prefix required to shown containers (deprecated, use enable_porto=isolate)

Actual porto namespace concatenates prefixes from parents, see absolute_namespace.
taint - list of known problems in container configuration
owner_user - container owner user, default: creator
owner_group - container owner group, default: creator
owner_cred - credentials of container owner: uid gid groups...
owner_containers - containers that have write access to container

Also grants write access to parent containers. If not set all containers have write access.

Porto client authentication is based on task pid, uid, gid received via socket(7) SO_PEERCRED and task freezer cgroup from /proc/pid/cgroup.

Requests from host are executed in behalf or client task uid:gid.

Requests from containers are executed in behalf of container's owner_user:owner_group.

By defaut all users have read-only access to all visible containers. Visibility is controlled via enable_porto and porto_namespace.

Write access have root user and users from group "porto".

Write access to container requires any of these conditions:

container is a sub-container
owner_user matches to client user

Filesystem

cwd - working directory

In host root default is temporary directory: /place/porto/container-name
In choot: '/'
root - container root path in parent namespace, default: /

Porto creates new mount namespace from parent mount namespace, and chroot into this direcotry using pivot_root(2).

Porto mounts: /dev, /dev/pts, /dev/hugepages, /proc, /run, /sys, /sys/kernel/tracing.

Also porto recreates in /run directories structure from underlying filesystem.

Porto creates in /dev nodes only for devices permitted by property devices.

If container should have access to porto then /run/portod.socket is binded inside.
root_path - container root path in client namespace
root_readonly - remount everything read-only
bind - bind mounts: <source> <target> [ro|rw|rec|dev|nodev|suid|nosuid|exec|noexec|private|unbindable|noatime|nodiratime|relatime],... ;...

This option creates new mount namespace and binds directories or files from parent container mount namespace.

Resulting bind-mounts are invisible from host or parent container and cannot be used for creating volumes. For that use volume backend=bind instead.

Bind mount is non-recurse by default, add flag "rec" to bind sub-mounts too. By default mount is slave-shared - implements one way propagation.
symlink - create symlink, format: <symlink>: <target>;...

Both paths are resolved in chroot, relative paths starts from cwd. Porto creates missing parent directories and makes relative symlink which could be resolved outside chroot. Existing symlinks are replaced when needed. Symlinks could be changed in runtime. Setting empty target removes symlink. For changing single symlink use property "symlink[<symlink>]".
stdout_path - stdout file, default: internal rotated storage

By default stdout and stderr are redirected into files created in default cwd.

Periodically, when size of these files exceeds stdout_limit head bytes are removed using fallocate(2) FALLOC_FL_COLLAPSE_RANGE. Count of lost bytes are show in stdout_offset.

Path "/dev/fd/fd" redirects stream into file descriptor fd of porto client task who starts container.
stdout_limit - limits internal stdout/stderr storage, porto keeps tail bytes

Default is 8Mb, value limited with 1Gb.
stderr_path - stderr file, default: internal rotated storage

Same as stdout_path.
stdin_path - stdin file; default: "/dev/null"
place - places allowed for volumes and layers, syntax: [default][;[alias=]path;...]

This is paths in in host or masks for them which are allowed to be used as place property for volumes for requests from this container.

Default is "/place;***". This means use /place by default, allow any other.

Alias allows to address place by short keyword, example: place="/mnt/data;slow=/mnt/hdd;fast=/mnt/ssd".

Container inherits policy from parent container and cannot surpass it.
place_limit - limits sum of volume space_limit owned by subtree

Format: total|default|<place>|tmpfs|lvm <group>|rbd: <bytes>;...
place_usage - current sum of volume space_limit owned by subtree

Format is same as for place_limit.
volumes_owned - list of volumes charged into place_usage
volumes_linked - list of volumes linked to this container
volumes_required - list of volumes reqired to start this container

Setting bind, root, stdout_path, stderr_path requires write permissions to the target or owning related volume.

Memory

memory_usage - current memory usage

This counts physical pages owned by this contianer:
- touched anonymounts\private mmap(2), anonymous part of RSS
- tmpfs\shmem
- filesystem\disk cache
Task RSS is a count of touched pages in virtual address space: populated PTEs. It includes executable and other mapped files. And doesn't include filesytem cache and unmapped tmpfs. Also some pages could be mapped and counted multiple times.
anon_usage - current anon memory usage

This counts physical pages that have no filesystem\disk backend: anonymous RSS, tmpfs\shmem.
anon_max_usage - peak anon_usage (offstream kernel feature)

Set to empty for reset.
cache_usage - current cache usage

Physical memory pages that have filesystem\disk backend, potentially could be reclaimed by kernel.
shmem_usage - current shmem and tmpfs usage

Physical memory pages that belongs to shmem or tmpfs files.
mlock_usage - current locked memory

Physical memory pages that locked and cannot be reclaimed.
hugetlb_usage - current hugetlb memory usage in bytes
hugetlb_limit - hugetlb memory limit in bytes

For root container shows hugetlb total size.

For now only 2Mb pages are supported.
max_rss - peak anon_usage (offstream kernel feature)

Alias for anon_max_usage for historical reasons.
memory_guarantee - guarantee for memory_usage, default: 0

Memory reclaimer skips container and it's sub-containers if current usage is less than guarantee.

Overcommit is forbidden, 2Gb are reserved for host system.

If system currenyly is under overcommit then porto allows to start only containers without memory guarantee and non-root user can start only sub-containers.

Hugetlb pages are subtracted from host memory.

Reserve (2Gb) is set in portod.conf:
```
daemon {
    memory_guarantee_reserve: <bytes>
}
```
memory_guarantee_total - hierarchical memory guarantee

Upper bound for guaranteed memory for container including guarantees for sub-containers.

memory_guarantee_total = max(memory_guarantee, sum memory_guarantee_total for running childrens)
memory_limit limit for memory_usage

For first level containers default is max(Total - 2Gb, Total * 3/4), for deeper levels default is 0 (unlimited).

Margin from total ram (2Gb) is set in portod.conf:
```
container {
    memory_limit_margin: <bytes>
}
```
Allocations over limit reclaims cache or write anon pages into swap. On failure syscalls returns ENOMEM\EFAULT, page fault triggers OOM.
memory_limit_total - hierarchical memory limit

Effective memory limit for container including limits for parent containers.

memory_limit_total = min(memory_limit, memory_limit_total for parent)
anon_limit - limit for anon_usage (offstream kernel feature)

Default is memory_limit - min(memory_limit / 4, 16M).

Default margin from limit (16Mb) is set in portod.conf:
```
container {
    anon_limit_margin: <bytes>
}
```
anon_limit_total - hierarchical anonymous memory limit

anon_limit_total = min(anon_limit, anon_limit_total for parent)
anon_only - keep only anon pages, allocate cache in parent, default: false (offstream kernel feature)
dirty_limit limit for dirty memory unwritten to disk, default: 0 (offstream kernel feature)
recharge_on_pgfault - if true immigrate cache on minor page fault, default: false (offstream kernel feature)
pressurize_on_death - if true set tiny soft memory limit for dead and hollow meta containers, default false
oom_is_fatal - kill all affected containers on OOM, default: true
oom_score_adj - OOM score adjustment: -1000..1000, default: 0

See oom_score_adj in proc(5).
minor_faults - count minor page-faults (file cache hits)
major_faults - count major page-faults (file cache misses, reads from disk)
virtual_memory - non-recursive sum for processes in container, format: <type>: <bytes>;...

Types: count, size, max_size, used, max_used, anon, file, shmem, huge, swap, locked, data, stack, code, table.
memory_lock_policy - memory.mlock_policy value to container memory cgroup

Available values:
- disabled - 0: disabled, default
- mlockall - 1: similar to mlockall
- executable - 2: only files which may be executed
- xattr - 3: only files which has "user.yndx.mlock" xattr
memory_pressure - memory pressure stall total in us

Total amount of time (in microseconds) during which processes were waiting for memory. More

CPU

cpu_usage - CPU time used in nanoseconds (1 / 1000_000_000s)
cpu_usage_system - kernel CPU time in nanoseconds
cpu_wait - total time waiting for execution in nanoseconds (offstream kernel feature)
cpu_throttled - total throttled time in nanoseconds
cpu_burst_usage - total burst usage time in nanoseconds
cpu_unconstrained_wait - total unconstrained wait time in nanoseconds
cpu_pressure - cpu pressure stall total in us

Total amount of time (in microseconds) during which processes were waiting for cpu. More
cpu_weight - CPU weight, syntax: 0.01..100, default: 1

Multiplies cpu.shares and +10% cpu_weight is -1 nice.
cpu_guarantee - desired CPU power

Syntax: 0.0..100.0 (in %) | <cores>c (in cores), default: 0

Increase cpu.shares accourding to required cpu power distribution. Offstream kernel patches provides more accurate control.
cpu_guarantee_total - effective CPU guarantee

Porto popagates CPU guarantee from childrents into parent containtes: cpu_guarantee_total = max(cpu_guarantee, sum cpu_guarantee_total for running childrens)

For root container this shows total CPU guarantee.
cpu_guarantee_bound - maximum guarantee of upper hierarhy
cpu_limit - CPU usage limit

Syntax: 0.0..100.0 (in %) | <cores>c (in cores), default: 0 (unlimited)

Porto setup both CFS and RT cgroup limits. RT cgroup limit is strictly non-overcommitable in mainline kernel.

For root container this shows total CPU count.
cpu_limit_total - total CPU limit

cpu_limit_total = sum min(cpu_limit, cpu_limit_total) for running or meta childrens plus contianer cpu_limit if it's running.

For root contianer this shows total CPU commitment.
cpu_limit_bound - minimum limit of upper hierarhy
cpu_period - CPU limit accounting period

Syntax: 1ms..1s, default: 100ms [nanoseconds]
cpu_policy - CPU scheduler policy, see sched(7)
- normal - SCHED_OTHER (default)
- high - SCHED_OTHER (nice = -10, increases cpu.shares by 16 times)
- rt - SCHED_RR (nice = -20, priority = 10)
- batch - SCHED_BATCH
- idle - SCHED_IDLE (also decreases cpu.shares by 16 times)
- iso - SCHED_ISO (offstream kernel feature)
cpu_set - CPU affinity
- [N|N-M,]... - set of CPUs (logical cores)
- node N - bind to NUMA node
- jail N [;node N] - evenly distribute to N CPUs and optionally bind to NUMA node
- reserve N - allocate N CPUs, use the rest too
- threads N - allocate N CPUs, use only them
- cores N - allocate N physical cores, use only one thread for each
Each container owns set of cpus (shown in cpu_set_affinity) and distributes them among childrens.

Child could use only subset of parent cpus, by default all of them.

Allocated CPUs are removed from cpu sets of sibling containers, but these CPUs still could be in use by processes outside sub-tree.
cpu_set_affinity - resulting CPU affinity: [N,N-M,]...

Disk IO

Disk names are single words, like: "sda" or "md0".

"fs" is a statistics and limits at filesystem level (offstream kernel feature).

"hw" is a total statistics for all disks.

Statistics and limits could be requested for filesystem path. Absolute paths are resolved in host, paths starting with dot in chroot: io_read[/], io_read[.].

io_read - bytes read from disk, syntax: <disk>: <bytes>;...
io_write - bytes written to disk, syntax: <disk>: <bytes>;...
io_ops - disk operations: <disk>: <count>;...
io_read_ops - disk read operations: <disk>: <count>;...
io_write_ops - disk write operations: <disk>: <count>;...
io_time - total io time: <disk>: <nanoseconds>;...
io_limit - IO bandwidth limit, syntax: fs|<path>|<disk> [r|w]: <bytes/s>;...
- fs [r|w]: <bytes> - filesystem level limit (offstream kernel feature)
- <path> [r|w]: <bytes> - setup blkio limit for disk used by this filesystem
- <disk> [r|w]: <bytes> - setup blkio limit for disk
io_guarantee - IO bandwidth guarantee, syntax: fs|<path>|<disk> [r|w]: <bytes/s>;... (see description in io_limit above)
io_ops_limit - IOPS limit: fs|<path>|<disk> [r|w]: <iops>;... (see description in io_limit above)
io_ops_guarantee - IOPS guarantee: fs|<path>|<disk> [r|w]: <iops>;... (see description in io_limit above)
io_policy IO scheduler policy, see ioprio_set(2)
- none - set by cpu_policy, blkio.weight = 500 (default)
- rt - IOPRIO_CLASS_RT(4), blkio.weight = 1000 (highest)
- high - IOPRIO_CLASS_BE(0), blkio.weight = 1000
- normal - IOPRIO_CLASS_BE(4), blkio.weight = 500
- batch - IOPRIO_CLASS_BE(7), blkio.weight = 10
- idle - IOPRIO_CLASS_IDLE, blkio.weight = 10 (lowest)
io_weight IO weight, syntax: 0.01..100, default: 1

Additional multiplier for blkio.weight.
io_pressure - io pressure stall total in us

Total amount of time (in microseconds) during which processes were waiting for io. More

Network

Matching interfaces by name support masks '?' and '*'. Interfaces aggregated into groups from /etc/iproute2/group, see ip-link(8).

Possible indexes for statistics and parameters: - default - all interfaces - Uplink - external links, not VLANs or tunnels - <interface> - particular interface - group <group> - group of interfaces - CS0|...|CS7 - DSCP class at all uplink interfaces - <interface> CSx - DSCP class at particular interface - Leaf CSx - leaf tc class for container - Fallback CSx - default tc class in host - Saved CSx - removed devices and containers

net - network namespace configuration, Syntax: <option> [args]...;...
- inherited - use parent container network namespace (default)
- none - empty namespace, no network access (default for virt_mode=os)
- L3 [extra_routes] <name> [master] - veth pair and ip routing from host
- NAT [name] - same as L3 but assign ip automatically
- tap <name> - tap interface with L3 routing
- ipip6 <name> <remote> <local> - ipv4-via-ipv6 tunnel
- MTU <name> <mtu> - set MTU for device
- MAC <name> <mac> - set MAC for device
- autoconf <name> - wait for IPv6 SLAAC configuration
- ip <cmd> <args>... - configure namespace with ip(8)
- container <name> - use namespace of this container
- netns <name> - use namespace created by ip-netns(8)
- steal <device> - steal device from parent container
- macvlan <master> <name> [bridge|private|vepa|passthru] [mtu] [hw]
- ipvlan <master> <name> [l2|l3] [mtu]
- veth <name> <bridge> [mtu] [hw]
- ECN [name] - enable ECN
ip - ip addresses, syntax: <interface> <ip>[/<prefix>];...

For L3 devices whole prefix is routed inside, see [L3].
ip_limit - ip allowed for sub-containers: none|any|<ip>[/<mask>];...

If set sub-containers could use only L3 networks and only with these ip.
default_gw - default gateway, syntax: <interface> <ip>;...
hostname - hostname inside container

Inside container root /etc/hostname must be a regular file, porto bind-mounts temporary file over it.
etc_hosts - Override /etc/hosts content
resolv_conf - DNS resolver configuration, syntax: default|keep|<resolv.conf option>;...

Default setting resolv_conf="default" loads configuration from portod.conf:
```
container {
    default_resolv_conf: "nameserver <ip>;nameserver <ip>;..."
}
```
or from host /etc/resolv.conf if option in portod.conf isn't set.

Inside container root /etc/resolv.conf must be a regular file, porto bind-mounts temporary file over it.

Setting resolv_conf="keep" keeps configuration in container as is.
sysctl - sysctl configuration, syntax: <sysctl>: <value>;...

Porto allows to set only virtualized sysctls from hardcoded white-list.

Default values for network and ipc sysctls are the same as in host and could be overriden in portod.conf:
```
container {
    ipc_sysctl {
        key: "sysctl"
        val: "value"
    },
    ...
    net_sysctl {
        key: "sysctl"
        val: "value"
    },
    ...
}
```
net_guarantee - required egress bandwidth: <interface>|group <group>|default: <Bps>;...

"eth0 CS0" - egress guarantee for CS0 at eth0 in host
CS0 - egress guarantee for CS0 at each host uplink
eth0 - egress guarantee for each class at eth0 in host
default - default guarantee for everything
net_limit - maximum egress bandwidth: <interface>|group <group>|default: <Bps>;...

"eth0 CS0" - egress limit for CS0 at eth0 in host
CS0 - egress limit for CS0 at each host uplink
veth - total egress limit for net="L3 veth"
default - default limit for everything
CS7: 1 - setup blackhole
net_rx_limit - maximum ingress bandwidth: <interface>|group <group>|default: <Bps>;...

veth - total ingress limit for net="L3 veth"
net_bytes - traffic class counters: <interface>|<class>: <bytes>;...
net_class_id - traffic class: <class>: major:minor (hex)
net_drops - tc drops: <interface>|<class>: <packets>;...
net_overlimits - tc overlimits: <interface>|<class>: <packets>;...
net_packets - tc packets: <interface>|<class>: <packets>;...
net_rx_bytes - device rx bytes: <interface>|group <group>: <bytes>;...
net_rx_drops - device rx drops: <interface>|group <group>: <packets>;...
net_rx_packets - device rx packets: <interface>|group <group>: <packets>;...
net_tx_bytes - device tx bytes: <interface>|group <group>: <bytes>;...
net_tx_drops - device tx drops: <interface>|group <group>: <packets>;...
net_tx_packets - device tx packets: <interface>|group <group>: <packets>;...
net_tos - default IP ToS: CS0..CS7

For now without offstream kernel patch this property defines only default TC class for containers who lives in host network namespace.

Porto setup first level TC classes for each CSx. Default class, weights and limits could be set in portod.conf:
```
network {
    default_tos: "CS0"
    dscp_class {
        name: "CS1"
        weight: 10
        limit: 123456
        max_percent: 16.5
    }
    dscp_class {
        name: "CS3"
        weight: 42
    }
    ...
}
```

EXTRA PROPERTIES

Option extra_properties in portod.conf set container properties by containers name filter if property do not set

Example:

container {
    extra_properties {
        filter: "abc"
        properties {
            name: "command"
            value: "sleep 123"
        }
        properties {
            name: "max_respawns"
            value: "10"
        }
    }
    extra_properties {
        filter: "***"
        properties {
            name: "cgroupfs"
            value: "rw"
        }
    }
}

NETWORKING

L2

Mode net="macvlan eth0 eth0" or net="macvlan eth0 eth0; autoconf eth0" for SLAAC creates isolated netns with macvlan eth0 at device eth0. See man ip-link(8).

L3

Mode net=L3 connects host and container netns with veth pair and configures routing in both directions.

L3 adds neighbour/arp proxy entries to interfaces with addresses from the same network, this way container becomes reachable from the outside. For ip with prefix proxy entry is added for each address in range but range is limited.

network {
    proxy_ndp: true                 (add proxy entries)
    proxy_ndp_watchdog_ms: 60000    (reconstruction period)
    proxy_ndp_max_range: 16         (limit for subnet size)
}

Host sysctl configuration:

net.ipv4.conf.all.forwarding = 1
net.ipv6.conf.all.forwarding = 1
net.ipv6.conf.all.proxy_ndp = 1
net.ipv6.conf.all.accept_ra = 2

Default MTU copied from host interface which has addresses from the same L2 domain. MTU for L3 device and for ipv4/ipv6 default route could be set in portod.conf:

network {
   l3_default_mtu: <device-mtu>
   l3_default_ipv4_mtu: <ipv4-mtu>
   l3_default_ipv6_mtu: <ipv6-mtu>
}

For L3 network we can add extra_routes, and setup mtu, advmss for it:

network {
    extra_routes {
        dst: "default | ipv6"
        mtu: <mtu>
        advmss: <advmss>
    }
}

Example:

network {
    extra_routes {
        dst: "default"
        mtu: 1450
        advmss: 1390
    },
    extra_routes {
        dst: "64:ff9b::/96"
        mtu: 1450
        advmss: 1390
    },
    extra_routes {
        dst: "2a02:6b8::/32"
        mtu: 8910
    },
    extra_routes {
        dst: "2620:10f:d000::/44"
        mtu: 8910
    },
}

For enable extra_routes in container you must add extra_routes option for net property:

net="L3 extra_routes ..."

From porto 5.1 extra_routes enabled by default for L3 network.

NAT

Mode net=NAT works as L3 and automatically allocates IP from pool configured in portod.conf:

network {
    nat_first_ipv4: "*ip*"
    nat_first_ipv6: "*ip*"
    nat_count: *count*
}

Example:

network {
    nat_first_ipv4: "192.168.42.1"
    nat_first_ipv6: "fec0::42:1"
    nat_count: 255
}

Host iptables configuration:

iptables -t nat -A POSTROUTING -s 192.168.42.0/24 -j MASQUERADE
ip6tables -t nat -A POSTROUTING -s fec0::42:0/120 -j MASQUERADE

Address label

In new network namespaces porto setup ip-addrlabel(8) from portod.conf:

network {
    addrlabel {
        prefix: "ip/mask"
        label: number
    },
    ...
}

This helps container to choose correct source IP when it works in several networks.

Traffic scheduler

Porto setup tc scheduler for all host interfaces except listed in portod.conf as unmanaged:

network {
    unmanaged_device: "name"
    unmanaged_group: "group"
}

CGROUPS

Porto enables all cgroup controllers for first level containers or if any related limits is set. Otherwise container shares cgroups with parent.

Controller "cpuacct" is enabled for all containers if it isn't bound with other controllers.

Controller "freezer" is used for management and enabled for all containers.

Cgroup tree required for systemd is configured automatically for virt_mode=os containers if /sbin/init is a symlink to systemd.

Enabled controllers are show in property controllers and could be enabled by: controllers[name]=true. For now cgroups cannot be enabled for running container. Resulting cgroup path show in property cgroups[name].

All cgroup knobs are exposed as read-only properties <name>.<knob>, for example memory.status.

COREDUMPS

Portod might register itself as a core dump helper and forward cores into container if it has set core_command.

Variables set in environment and substituted in core_command:

CORE_PID (pid inside container)
CORE_TID (crashed thread id)
CORE_SIG (signal)
CORE_TASK_NAME (comm for PID)
CORE_THREAD_NAME (comm for TID)
CORE_EXE_NAME (executable file)
CORE_CONTAINER
CORE_OWNER_UID
CORE_OWNER_GID
CORE_DUMPABLE
CORE_ULIMIT
CORE_DATETIME (%Y%m%dT%H%M%S)

Command executed in non-isolated sub-container and gets core dump as stdin.

For example:

core_command='cp --sparse=always /dev/stdin crash-${CORE_EXE_NAME}-${CORE_PID}-S${CORE_SIG}.core'

saves core into file in container.

Container ulimit[core] should be set to unlimited, or anything > 1.

Cores from tasks with suid bit or ambiend capabilities are ignored unless suid core dumps are enabled via prctl(2) PR_SET_DUMPABLE or sysctl fs.suid_dumpable or core_command is set and container lives in chroot.

Core command is executated in same environment as container. Information in proc about crashed process and thread available too.

Porto makes sure that sysctl kernel.core_pipe_limit isn't zero, otherwise crashed task could exit and dismantle pid namespace too early.

Required setup in portod.conf:

core {
    enable: true
    default_pattern: "/coredumps/%e.%p.%s"
    space_limit_mb: 102400
    slot_space_limit_mb: 10240
}

Default pattern is used for non-container cores or if core command isn't set. It might use '%' kernel core template defined in core(5). If default_pattern ends with '.gz' or '.xz' core will be compressed.

File owner set according to owner_user and owner_group.

For uncompressed format porto detects zero pages and turns them into file holes and flushes written data to disk every 4Mb (set by option core.sync_size).

Porto also creates hardlink in same directory with name:

${CORE_CONTAINER//\//%}%${CORE_EXE_NAME}.${CORE_PID}.S${CORE_SIG}.$(date +%Y%m%dT%H%M%S).core

Option space_limit_mb limits total size of default pattern directory, after exceeding new cores are discarded.

Option slot_space_limit_mb limits total size for each first-level container.

Limits are counted only for already dumped cores and do not include dumping core size, therefore these limits may be exceeded.

Container option coredump_filter can be used to control which memory segments are written to the core dump. The bits in the mask are set according to core(5), the filter is written in hex format.

Total and dumped cores are counted in labels CORE.total, CORE.dumped at container and parents.

Porto never deletes old core dumps.

VOLUMES

Porto provides "volumes" abstraction to manage disk space. Each volume is identified by a full path. All paths must absolute and normalized, symlinks are not allowed. You can either manually declare a volume path or delegate this task to porto.

A volume can be linked to one or more containers, links act as reference counter: unlinked volume will be destroyed automatically. By default volume is linked to the container that created it: "self", "/" for host.

Each link might define target path for exposing volume inside container. This path also works as alias for volume path for requests from container.

Link also might restrict access to read-only.

By default volume unlink calls lazy umount(2) with flag MNT_DETACH, strict unlink calls normal umount and fails if some files are opened.

Volume Properties

Like for container volume configuration is a set of key-value pairs.

id - volume id, 64-bit decimal
backend - backend engine, default: autodetect
- dir - directory for linking into containers
- plain - bind mount storage to volume path
- bind - bind mount storage to volume path, requires volume path
- rbind - recursive bind mount storage to volume path, requires volume path
- tmpfs - mount new tmpfs instance
- hugetmpfs - tmpfs with transparent huge pages
- quota - project quota for volume path
- native - bind mount storage to volume path and setup project quota
- overlay - mount overlayfs and optional project quota for upper layer
- squash - overlayfs and quota on top of squashfs image set in layers
- loop - create and mount ext4 image storage/loop.img or storage if this's file
- rbd - map and mount ext4 image from caph rbd storage="id@pool/image"
- lvm - ext4 in lvm(8) storage="[group][/name][@thin][:origin]"
Depending on chosen backend some properties becomes required of not-supported.
storage - persistent data storage, default: internal non-persistent
- /path - path to directory to be used as storage
- name - name of internal persistent storage, see [Volume Storage]
Some backends (rbd, lvm) expects configuration in special format.

Storage directory must be writeable for user (or readable for read-only volume). Loop image file is read-only, parent directory must be writable.
ready - is construction complete
build_time - format: YYYY-MM-DD hh:mm:ss
change_time - format: YYYY-MM-DD hh:mm:ss
state - volume state
- initial
- building
- ready
- unlinked
- to-destroy
- destroying
- destroyed
private - 4096 bytes of user-defined text
device_name - name of backend disk device (sda, md0, dm-0)
owner_container - owner container, default: creator

Used for tracking place_usage and place_limit.
owner_user - owner user, default: creator
owner_group - owner group, default: creator
target_container - define root path, default: creator

Volume will be created inside root path of this container.
user - directory user, default: creator
group - directory group, default: creator
permissions - directory permissions, default: 0775
creator - container user group
read_only - true or false, default: false
fs_type - filsystem type of image (backend=loop) or network device (backend=nbd)
containers - initial links, syntax: container [target] [ro] [!];... default: "self"

Target defines path inside container root, flag "ro" makes link read-only, "!" - adds into volumes_required.
layers - layers, syntax: top-layer;...;bottom-layer
- /path - path to layer directory
- name - name of layer in internal storage, see [Volume Layers]
Backend overlay use layers directly.

Backend squash expects path to a squashfs image as top-layer.

Some backends (plain, native, loop, lvm, rbd) copy layers into volume during construction.
place - place for layers and default storage

This is path to directory where must be sub-directories "porto_layers", "porto_storage" and "porto_volumes".

Default and possible paths are controller by container property place:
- default - first path in client container property place
- /path - path in host
- ///path - path in client container
- alias - path in host set is alias=/path in property place
place_key - key for charging place_limit for owner_container

Key equal to place if backend keeps data in filesystem.

Key is empty if backend doesn't limit or own data. Like plain, bind, quota, or if storage is provided by user.

Some backend use own keys: "tmpfs", "lvm <group>", "rbd".
space_limit - disk space limit, default: 0 - unlimited
inode_limit - disk inode limit, default: 0 - unlimited
space_guarantee - disk space guarantee, default: 0

Guarantees that resource is available time of creation and protects from claiming by guarantees in following requests.
inode_guarantee - disk inode guarantee, default: 0
space_used - current disk space usage
inode_used - current disk inode used
space_available - available disk space
inode_available - available disk inodes

Volume Storage

Storage is a directory used by volume backend for keeping volume data. Most volume backends by default use non-persistent temporary storage: place/porto_volumes/id/backend.

If storage is specified then volume becomes persistent and could be reconstructed using same storage.

Porto provides internal persistent volume storage, data are stored in place/porto_storage/storage.

Volume Layers

Porto provides internal storage for overlayfs layers. Each layer belongs to particular place and identified by name, stored in place/porto_layers/layer. Porto remembers owner's user:group and time since last usage.

Layer name shouldn't start with '_', except special prefixes.

Layer which names starts with '_weak_' are removed once last their user is gone.

Porto provide API for importing and exporting layers in form compressed tarballs in overlay or aufs formats. For details see portoctl command layers.

For building layers see portoctl command build and sample scripts in layers/ in porto sources.

Meta Storage

Several volume storages and layers could be enclosed into precreted meta-storage which enforces space limit for them all together.

Such layers and storages have name with prefix meta-storage/ -- meta-storage/sub-layer and meta-storage/sub-storage.

For details see portoctl command storage and API.

Backend LVM

Backend lvm takes configuration from property storage in format: [group][/name][@thin][:origin]".

Default volume group could be set in portod.conf

volumes {
    default_lvm_group: "group"
}

It "name" is set volume becomes persistent: porto keeps and reuse logical volume "group/name". User have to remove it using lvremove(8).

If "thin" is set them volume is allocated from precreated thin pool "group/thin".

If "origin" is set then volume is created as thin snapshot of "group/origin" and belongs to the same pool.

EXAMPLES

Run command in foreground:

$ portoctl exec hello command='echo "Hello, world!"'

Run command in background:

$ portoctl run stress command='stress -c 4' memory\_limit=1G cpu\_limit=2.5c
$ portoctl destroy stress

Create volume and destroy:

$ mkdir volume
$ portoctl vcreate $PWD/volume space_limit=1G
$ portoctl vlist -v $PWD/volume
$ portoctl vunlink $PWD/volume
$ rmdir volume

Create volume with automatic path and destroy:

$ VOLUME=$(portoctl vcreate -A space_limit=1G)
$ portoctl vunlink $VOLUME

Run os level container and enter inside:

portoctl layer -I vm-layer vm-layer.tgz
portoctl run vm layers=vm-layer space_limit=1G virt_mode=os hostname=vm memory_limit=1G cpu_limit=2c net="L3 eth0" ip="eth0 192.168.1.42"
portoctl shell vm
^D
portoctl destroy vm
portoctl layer -R vm-layer

Show containers and resource usage:

portoctl top

See portoctl(8) for details.

FILES

/run/portod.socket

Porto API unix socket.

/run/portod
/run/portod.version

Symlink to currently running portod binary and it's version.

/run/portoloop.pid
/run/portod.pid

Pid file for porto master and slave daemon.

/var/log/portod.log

Porto daemon log file.

/run/porto/kvs
/run/porto/pkvs

Container and volumes key-value storage.

/usr/share/doc/porto/rpc.proto.gz

Porto API protobuf.

/etc/defaults/portod.conf (deprecated, do not use)
/etc/portod.conf
/etc/portod.conf.d/*.conf (loaded in sorted order)

Porto daemon configuration in protobuf text format.
Porto merges it with hardcoded defaults and prints into log when starts.

/usr/share/doc/porto/config.proto.gz

Porto configuration file protobuf.

/place/porto/container

Default current/working directories for containers.

/place/porto_volumes/id

Default place keeping volumes and their data.

/place/porto_layers/layer

Default place for keeping overlayfs layers.

/place/porto_storage/storage

Default place for persistent volume storages.

/place/porto_storage/_meta_meta-storage
/place/porto_storage/_meta_meta-storage/sub-storage
/place/porto_storage/_meta_meta-storage/_layer_sub-layer

Meta-storage with nested layers and storages.

LINUX KERNEL FEATURES

Required

CONFIG_MEMCG
CONFIG_FAIR_GROUP_SCHED
CONFIG_CFS_BANDWIDTH
CONFIG_CGROUP_FREEZER
CONFIG_CGROUP_DEVICE
CONFIG_CGROUP_CPUACCT
CONFIG_PID_NS
CONFIG_NET_NS
CONFIG_IPC_NS
CONFIG_UTS_NS
CONFIG_IPV6

CONFIG_CGROUP_PIDS
CONFIG_BLK_CGROUP
CONFIG_RT_GROUP_SCHED
CONFIG_CPUSETS
CONFIG_CGROUP_HUGETLB
CONFIG_HUGETLBFS
CONFIG_OVERLAY_FS
CONFIG_SQUASHFS
CONFIG_BLK_DEV_LOOP
CONFIG_BLK_DEV_DM
CONFIG_DM_THIN_PROVISIONING
CONFIG_VETH
CONFIG_TUN
CONFIG_MACVLAN
CONFIG_IPVLAN
CONFIG_NET_SCH_HTB
CONFIG_NET_SCH_HFSC
CONFIG_NET_SCH_SFQ
CONFIG_NET_SCH_FQ_CODEL
CONFIG_NET_SCH_INGRESS
CONFIG_NET_ACT_POLICE
CONFIG_IPV6_TUNNEL
CONFIG_INET_DIAG

HOMEPAGE

https://github.com/ten-nancy/porto

AUTHORS

Roman Gushchin klamm@yandex-team.ru
Stanislav Fomichev stfomichev@yandex-team.ru
Konstantin Khlebnikov khlebnikov@yandex-team.ru
Evgeniy Kilimchuk ekilimchuk@yandex-team.ru
Michael Mayorov marchael@yandex-team.ru
Stanislav Ivanichkin sivanichkin@yandex-team.ru
Vsevolod Minkov vminkov@yandex-team.ru
Vsevolod Velichko torkve@yandex-team.ru
Maxim Samoylov max7255@yandex-team.ru
Dmitry Yakunin zeil@yandex-team.ru
Alexander Kuznetsov wwfq@yandex-team.ru Alexander Ovechkin ovov@yandex-team.ru Lev Pantiukhin kndrvt@yandex-team.ru

Files

porto.md

Latest commit

History

porto.md

File metadata and controls

NAME

SYNOPSIS

DESCRIPTION

Key Features

CONTAINERS

Name

States

Operations

Usual Life Cycle:

Properties

Labels

Context

State

Security

Filesystem

Memory

CPU

Disk IO

Network

EXTRA PROPERTIES

NETWORKING

L2

L3

NAT

Address label

Traffic scheduler

CGROUPS

COREDUMPS

VOLUMES

Volume Properties

Volume Storage

Volume Layers

Meta Storage

Backend LVM

EXAMPLES

FILES

LINUX KERNEL FEATURES

Required

Recommended

HOMEPAGE

AUTHORS