Live Migration

Checkpoint/Restore In Userspace, or CRIU, is an utility to checkpoint/restore a process tree. It is commonly used with lxd and docker to provide live snapshot/restore functionality, and sometimes a step forward to live migration of container at run time preserving all necessary status to a persistent storage.

Installation of CRIU

CRIU run mostly in user space, but some features from the Linux are required to be fully functional:

Linux >= 3.11, whereas >= 4.15 is recommended
iproute2 >= 3.5.0 for dumping network namespaces
ptrace must be allowed

The software is packaged in both Debian Sid and Ubuntu 18.04. With either of the two distributions we can install the utility with a single command: apt update && apt install criu

After the installation finished, check whether it works: criu check

It should say Looks OK when check pass, warnings are shown when there's something to mention.

Checkpoint & Restore a process group

Here we take subutai-nginx.service as an example.

To get the process number to dump, it must be a process group leader:

root@debian:~# systemctl status subutai-nginx
● subutai-nginx.service - nginx instance for subutai
   Loaded: loaded (/lib/systemd/system/subutai-nginx.service; enabled; vendor preset: enabled)
   Active: active (running) since Sat 2018-06-02 03:53:20 CST; 37min ago
     Docs: man:nginx(8)
  Process: 8019 ExecReload=/usr/sbin/nginx -c /etc/subutai/nginx/nginx.conf -g daemon on; master_process on; -s reload (code=exited, status=0/SUCCESS)
  Process: 822 ExecStart=/usr/sbin/nginx -c /etc/subutai/nginx/nginx.conf -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
  Process: 820 ExecStartPre=/usr/sbin/nginx -c /etc/subutai/nginx/nginx.conf -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS)
 Main PID: 824 (nginx)
    Tasks: 3 (limit: 4915)
   Memory: 13.0M
      CPU: 144ms
   CGroup: /system.slice/subutai-nginx.service
           ├─ 824 nginx: master process /usr/sbin/nginx -c /etc/subutai/nginx/nginx.conf -g daemon on; master_process on;
           ├─8132 nginx: worker process
           └─8143 nginx: cache manager process

So we know the Main PID is 824, which will be out target to this dump and restore experiment.

To pre-dump a process, for shorter freezing time later mkdir -p /root/dump/nginx && criu pre-dump -t 824 -D /root/dump/nginx, resulting folder has the following files:

root@debian:~# ls /root/dump/nginx/
irmap-cache       pagemap-8143.img  pagemap-shmem-15577.img  pagemap-shmem-15581.img  pages-2.img  pages-4.img  pages-6.img
pagemap-8132.img  pagemap-824.img   pagemap-shmem-15578.img  pages-1.img              pages-3.img  pages-5.img  stats-dump

To actually check point the process: criu dump -t 824 -D /root/dump/nginx/, resulting folder has changed to this:

root@debian:~# ls /root/dump/nginx/
cgroup.img     fdinfo-2.img  fs-8132.img   ids-8143.img   mm-8132.img       pagemap-8143.img         pagemap-shmem-15581.img  pages-4.img  stats-dump
core-8132.img  fdinfo-3.img  fs-8143.img   ids-824.img    mm-8143.img       pagemap-824.img          pages-1.img              pages-5.img
core-8143.img  fdinfo-4.img  fs-824.img    inventory.img  mm-824.img        pagemap-shmem-15577.img  pages-2.img              pages-6.img
core-824.img   files.img     ids-8132.img  irmap-cache    pagemap-8132.img  pagemap-shmem-15578.img  pages-3.img              pstree.img

And at this time systemctl status subutai-nginx shows the service is in failed status because the main process is killed.

To create a new PID namespace, also mount namespace and mount /proc filesystem before the processes are run: unshare -p -m --fork --mount-proc
To restore the image and detach from the process after finished: criu restore -d -D /root/dump/nginx/
Verify that the process group has back and it's not started by systemd: systemctl status subutai-nginx is still in failed status.
Verify the process group has all children ready: ps aux | grep nginx:

root@debian:~# ps aux | grep nginx
root       824  0.0  0.0 184448  1928 ?        Ss   04:32   0:00 nginx: master process /usr/sbin/nginx -c /etc/subutai/nginx/nginx.conf -g daemon on; master_process on;
daemon    8132  0.0  0.1 184864  2348 ?        S    04:32   0:00 nginx: worker process
daemon    8143  0.0  0.0 184648  2040 ?        S    04:32   0:00 nginx: cache manager process
root      8226  0.0  0.0  12784   940 pts/0    S+   04:32   0:00 grep nginx

Also there's a helper script called criu-ns which can assist to restore in a pseudo-container

Checkpoint & Restore an unprivileged container

Create a container by cloning debian-stretch template: subutai clone debian-stretch test
Add lxc.tty = 0 to its config: echo "lxc.tty = 0 >> /var/lib/lxc/test/config
Start the container: subutai start test
Find out PID of the container:

root@debian:~# lxc-ls --active -f -F PID test
PID
20807

Create the folder for dumping: mkdir -p /root/dumps/test
Find out the tty number using python:

root@debian:~# python
Python 2.7.13 (default, Nov 24 2017, 17:33:09)
[GCC 6.3.0 20170516] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> st = os.stat("/proc/20807/root/dev/console")
>>> print "tty[%x:%x]" % (st.st_rdev, st.st_dev)
tty[8801:11]

Find out the veth MAC address:

root@debian:~# grep lxc.network.veth.pair /var/lib/lxc/test/config | cut -f3 -d' = '
00163ec24665

Dump the container:

/usr/sbin/criu dump --tcp-established --file-locks --link-remap --manage-cgroups=full \
  --ext-mount-map auto --enable-external-sharing --enable-external-masters \
  --enable-fs hugetlbfs --enable-fs tracefs \
  -D /root/dumps/test -o /root/dumps/test/dump.log \
  --cgroup-root name=systemd:/lxc/test \
  --cgroup-root devices:/lxc/test \
  --cgroup-root freezer:/lxc/test \
  --cgroup-root cpu,cpuacct:/lxc/test \
  --cgroup-root pids:/lxc/test \
  --cgroup-root blkio:/lxc/test \
  --cgroup-root cpuset:/lxc/test \
  --cgroup-root net_cls,net_prio:/lxc/test \
  --cgroup-root perf_event:/lxc/test \
  --cgroup-root memory:/lxc/test \
  --ext-mount-map /sys/fs/fuse/connections:sys/fs/fuse/connections \
  --ext-mount-map /home:home \
  --ext-mount-map /opt:opt \
  --ext-mount-map /var:var \
  -t 20807 \
  --skip-in-flight \
  --freeze-cgroup /sys/fs/cgroup/freezer///lxc/test \
  --ext-mount-map /dev/console:console --external tty[8801:11] \
  --force-irmap \
  --leave-running

Here we don't need --leave-running in real deployment, it can be dangerous because the running process may modify various system state. But agent will restart the container anyways when it's not aware of a stop request, so we stop the container manually in next step. 9. Make sure the container is stopped:

root@debian:~# subutai stop test
INFO[2018-06-04 18:53:39] test stopped
root@debian:~# subutai list -i
NAME            STATE   IP      Interface
----            -----   --      ---------
management      STOPPED         eth0
test            STOPPED         eth0

Restore the container:

/usr/sbin/criu restore --tcp-established --file-locks --link-remap --manage-cgroups=full \
  --ext-mount-map auto --enable-external-sharing --enable-external-masters \
  --enable-fs hugetlbfs --enable-fs tracefs \
  -D /root/dumps/test -o /root/dumps/test/restore.log \
  --cgroup-root name=systemd:/lxc/test \
  --cgroup-root devices:/lxc/test \
  --cgroup-root freezer:/lxc/test \
  --cgroup-root cpu,cpuacct:/lxc/test \
  --cgroup-root pids:/lxc/test \
  --cgroup-root blkio:/lxc/test \
  --cgroup-root cpuset:/lxc/test \
  --cgroup-root net_cls,net_prio:/lxc/test \
  --cgroup-root perf_event:/lxc/test \
  --cgroup-root memory:/lxc/test \
  --ext-mount-map sys/fs/fuse/connections:/sys/fs/fuse/connections \
  --ext-mount-map home:/var/lib/lxc/test/home \
  --ext-mount-map opt:/var/lib/lxc/test/opt \
  --ext-mount-map var:/var/lib/lxc/test/var \
  --root /usr/lib/x86_64-linux-gnu/lxc/rootfs \
  --restore-detached --restore-sibling --inherit-fd fd[1]:tty[8801:11] \
  --ext-mount-map console:/dev/pts/0 \
  --external veth[eth0]:00163ec24665

Here we are using fd[1] for convenience of demonstration, but creating a new fd and give it to criu will be better.

Check the container's running state:

root@debian:~# subutai list -i
NAME            STATE   IP              Interface
----            -----   --              ---------
management      STOPPED                 eth0
test            RUNNING 10.10.10.32     eth0