Podman checkpoint/restore on btrfs

Posted on Feb 20, 2021

Podman Checkpoint/Restore in Userspace

Recently I was migrating our entire Podman infra to a new, bigger server, roughly ~180 container. Most of our things are running in pods which are fairly easy to migrate, just podman generate kube --filename xyz.yml xyz-pod, clean up the yml a bit and on the new server I just play it with podman play xyz.yml. Done, pod migrated. On the other hand we have a few containers that are running outside of pods and was interested how lazy I can get with those using Podman. Fairly quickly ran into Podman Checkpoints/Restore in Userspace (CRIU). With CRIU Podman is able to checkpoint and restore containers in their current state. Migration is just one use-case, another would be to restore a container after a host reboot exactly the way it was prior the restart. This is exactly what I needed, scripts back in the box, in the new and shiny stuff comes.

First run

Podman’s checkpoint/restore feature is currently only supports rootfull containers. I run openSUSE MicroOS on all of my servers running any kind of compute load which needs btrfs. I tried to run podman container checkpoint --keep --leave-running --export=/root/test.tar.gz test on a simple nginx test container - because I will not try anything new on production containers - which gave some pretty grim outputs:

ERRO[0000] read unixpacket @->@: EOF                    
Error: `/usr/bin/runc checkpoint --image-path /var/lib/containers/storage/btrfs-containers/c88b0634a239ad7547c2513b644b7b6d199a137012b0d8bbc10b96a1b712c0de/userdata/checkpoint --work-path /var/lib/containers/storage/btrfs-containers/c88b0634a239ad7547c2513b644b7b6d199a137012b0d8bbc10b96a1b712c0de/userdata --leave-running c88b0634a239ad7547c2513b644b7b6d199a137012b0d8bbc10b96a1b712c0de` failed: exit status 1

Tried to search for the error, but came up with nothing. Podman is fairly new and maybe not so well adopted yet so it might be difficult to find common issues on the web. Tried my luck on Github under the project, still found nothing much so went ahead and opened the issue. Got a response from Adrian Reber pretty quickly and we started to work our way through the issue. After a debug run, Adrian pointed out that the problem is probably caused by btrfs. Tried to run the same export on a VM running openSUSE Leap on XFS. It worked fine…

Take two

After pushing things around for a while - and switching to XFS is not being an option - I had this idea to disable copy-on-write (COW) for the container storage folder - /var/lib/container/ - and tried it again. Adrian also provided me with an excellent stateful test container that made a lot more sense to use over nginx, which is stateless, so the next round of testing began:

# chattr +C /var/lib/containers # disable COW for the container storage folder
# podman run -d quay.io/adrianreber/counter #The test container
# curl `podman inspect -l --format "{{.NetworkSettings.IPAddress}}"`:8088 # testing
counter: 0 
# curl `podman inspect -l --format "{{.NetworkSettings.IPAddress}}"`:8088
counter: 1
# curl `podman inspect -l --format "{{.NetworkSettings.IPAddress}}"`:8088
counter: 6
# curl `podman inspect -l --format "{{.NetworkSettings.IPAddress}}"`:8088
counter: 7
# curl `podman inspect -l --format "{{.NetworkSettings.IPAddress}}"`:8088
counter: 8
# curl `podman inspect -l --format "{{.NetworkSettings.IPAddress}}"`:8088
counter: 9
# curl `podman inspect -l --format "{{.NetworkSettings.IPAddress}}"`:8088
counter: 10
# curl `podman inspect -l --format "{{.NetworkSettings.IPAddress}}"`:8088
counter: 11
# podman container checkpoint --keep --export=./test11.tar.gz awesome_goldstine # **Checkpointing**
b9027b522b432e854307ee96358b9fc8e02f24cbd1b2c4a5b53a868be4d7db90
# podman ps
CONTAINER ID  IMAGE   COMMAND  CREATED  STATUS  PORTS   NAMES
# podman ps -a
CONTAINER ID  IMAGE                               COMMAND  CREATED        STATUS                    PORTS   NAMES
b9027b522b43  quay.io/adrianreber/counter:latest           4 minutes ago  Exited (0) 4 seconds ago          awesome_goldstine
# podman rm awesome_goldstine 
b9027b522b432e854307ee96358b9fc8e02f24cbd1b2c4a5b53a868be4d7db90
# l
total 6400
drwxr-xr-x 1 root    root       578 Feb 17 15:51 ./
drwxr-xr-x 1 root    root       114 Feb 17 15:40 ../
drwxr-xr-x 1 root    root         0 May 16  2020 GeoIP/
drwx------ 1 root    root         0 Jun 25  2020 NetworkManager/
drwxr-xr-x 1 root    root       418 Feb 17 15:41 YaST2/
drwxr-xr-x 1 root    root       394 Feb 17 15:34 alternatives/
drwxr-xr-x 1 root    root        10 Feb 17 15:33 apparmor/
drwxr-xr-x 1 root    root        16 Feb 17 15:35 autoinstall/
drwxr-xr-x 1 root    root        70 Feb 17 15:31 ca-certificates/
drwxr-x--- 1 chrony  chrony       0 Jun 10  2020 chrony/
drwx------ 1 root    root        30 Feb 17 15:47 cni/
drwx------ 1 root    root        24 Feb 17 15:46 containers/
drwxr-xr-x 1 root    root        20 Feb 17 15:33 dbus/
drwxr-xr-x 1 root    root        30 Feb 17 15:34 dhcp/
drwxr-xr-x 1 root    root        32 Feb 17 15:34 dhcp6/
drwx------ 1 root    root         8 Feb 17 15:40 ebtables/
drwxr-xr-x 1 root    root         0 Mar  7  2020 empty/
drwxr-xr-x 1 root    root        36 Feb 17 15:28 hardware/
drwxr-xr-x 1 root    root         0 Sep 21  2019 lifecycle/
drwxr-xr-x 1 root    root        38 Feb 17 15:37 misc/
drwxr-xr-x 1 root    root         0 May 17  2020 net-snmp/
drwxr-xr-x 1 root    root        56 Feb 17 15:34 nfs/
drwxr-xr-x 1 nobody  root         0 Jun  9  2020 nobody/
drwxr-xr-x 1 root    root        54 Feb 17 15:40 nscd/
drwxr-xr-x 1 root    root         0 May 16  2020 os-prober/
drwxr-xr-x 1 root    root        26 Feb 17 15:35 plymouth/
drwxr-xr-x 1 root    root         0 May 17  2020 polkit/
drwx------ 1 postfix root        22 Feb 17 15:40 postfix/
lrwxrwxrwx 1 root    root        26 Feb 17 15:35 rpm -> ../../usr/lib/sysimage/rpm/
drwxr-xr-x 1 root    root        68 Feb 17 15:35 samba/
drwxr-xr-x 1 root    root        22 Feb 17 15:33 smartmontools/
drwxr-xr-x 1 root    root         0 Jun  9  2020 sshd/
drwx--x--x 1 root    root        20 Feb 17 15:41 sudo/
drwxr-xr-x 1 root    root       104 Feb 17 15:40 systemd/
-rw------- 1 root    root   2181425 Feb 17 15:49 test.tar.gz
-rw------- 1 root    root   2181134 Feb 17 15:51 test11.tar.gz
-rw------- 1 root    root   2181397 Feb 17 15:50 test2.tar.gz
drwxr-xr-x 1 root    root         0 May 17  2020 usb_modeswitch/
drwxr-xr-x 1 root    root         0 Jun  6  2020 vmware/
drwxr-x--- 1 root    root        80 Feb 17 15:50 wicked/
drwxr-xr-x 1 root    root       136 Feb 17 15:43 zypp/
# podman container restore -i test11.tar.gz # restore on another MicroOS server
b9027b522b432e854307ee96358b9fc8e02f24cbd1b2c4a5b53a868be4d7db90
# curl `podman inspect -l --format "{{.NetworkSettings.IPAddress}}"`:8088 # Final test
counter: 12

And there I was with a functional checkpoint and restore feature on my openSUSE servers with btrfs. Great feeling to get something running that wasn’t out-of-the-box and share the solution upstream. But then again I was reminded that it only supports rootfull use-cases so went back using my scripts -_-.

References: