This article is about containers, the different types (depending on what privileges we let them have), and how they build their isolation: mainly kernel namespaces and capabilities, overlay filesystems, seccomp, and SELinux. My motivation to start this article was for me to understand a bit better how the combination of container isolation mechanisms and privileges affects (increasing or decreasing) the risk of kernel flaws.
I obtained a great part of the information from Michael Kerrisk videos on Youtube, but also from many sources included through the document and at the end of it. If I forgot to mention any source, please, let me know. Michael Kerrisk offers training. Consider it if you want to know more about Linux advanced programming, including Linux security and isolation APIs. He is a great great instructor.
Introduction
A container is a group of processes with some cool kernel features sprinkled on top [Jessica Greben’s definition]. Those kernel features allow the processes to pretend that they’re running on their own separate machine. While the host machine knows that the container is actually a set of processes, the container thinks that it is a separate machine.
The kernel features that allow this are:
- namespaces (NS): Make the container look and feel like it is in a different machine. For example, if you do a ps, the container will only see its processes. Different namespaces accomplish different segregation.
- cgroups: Group processes together and limit the resources they can consume.
- capabilities: Superuser privileges that can be enabled or disabled for the user within the container. A user within the container may perform some privileged actions, whether root or not.
- seccomp: Restricts which syscalls are available within the container.
- SELinux: helps avoiding something escaping from the container.
Additionally, filesystems take a key role in how containers are constructed.
Containers are usually used like a packaging mechanism that abstracts the code and all of its dependencies to make applications build fast and reliable [source].
The isolation between what is run within a container and the host is not perfect. There are points of interaction. If vulnerabilities appear in those points of interaction, the isolation can be broken, and something that is supposed to be isolated within a container may reach the host in unexpected ways. This is specially a risk for SaaS and PaaS services if some part of the security relies on the isolation of containers.
Rootful vs Rootless containers
A rootful container is a container run by root in the host. Since it is run by root, it has access to all the functionality that root has. This does not mean that any process executed within the container would be run as root. As we have said, the container engines implement some functionality, segregation, and limitations that try to control what a process within the container can do. We will see how containers implement these limitations.
However, at the end, if there is a vulnerability in the functionality that creates this isolation, the user within the container will be root on the host, and the compromise would be total.
Docker containers are usually run as root, however, it seems that it is possible to execute them rootless (https://docs.docker.com/engine/security/rootless/).
In order to have more security, an additional sound layer of security, someone thought it would be nice that the containers could be executed as a regular user. This is a challenge because the architecture of the containers was initially designed thinking that the user who executes them was root. However, with a lot of engineering and some tricks, rootless containers are possible.
A rootless container is a container that could be run without root privileges in the host. Docker runs containers launching them with the Docker daemon, which is run as root. Podman does not use any daemon and it does not need root to run containers.
“Rootless containers” does not mean that the user within the container is not root. It can be root, and by default it is, when using either Docker or Podman.
Related to security, the main benefit of rootless containers is that even if the container engine, runtime, or orchestrator is compromised, the attacker won’t gain root privileges on the host.
Rootless containers have limitations. Since they are executed as non-root, they don’t have access to all the features of the operating system. Some limitations are documented here https://github.com/containers/podman/blob/master/rootless.md. For example, you cannot publish a port below 1024.
Container isolation architecture
In this section, I’m going to enumerate how segregation is built for rootless containers, what kernel functionalities are used, and what are their effects on containers.
Namespaces
It is a kernel functionality that provides isolation of a resource. It accomplishes this by allowing the creation of different groups of processes and different groups of resources and then make each group of processes see only some group of resources. By default, only one namespace of each type is created and all processes can access all namespaces.
It could be seen as a very lightweight virtualization method.
There are 8 namespace kinds:
- Mount (mnt): isolates mount points.
- Process ID (pid): isolates process IDs.
- Network (net): isolates network stack.
- Interprocess Communication (ipc): isolates interprocess communication resources.
- UTS: isolates hostnames and domain names.
- User ID (user): isolates user and groups IDs.
- Control groups (cgroups): isolates cgroups.https://www.youtube.com/watch?v=0kJPa-1FuoI
- Time: isolates time (since kernel 5.6 Mar’20)
When the system boots up there exists one namespace of each kind, but multiple instances of each kind may exist on a system. Each process resides in one instance of each namespace. So, a bash process belongs to a mnt NS, a pid NS, a ipc NS, a UTS NS, a user NS, a cgroups NS and a Time NS. Each process only sees the resources that belong to its namespace.
You can create a process in a different namespace. When a process is created via fork(), it resides in the same namespace as its parent. With clone() + some flags, you can create processes in other namespaces.
$ hostname
localhost.localdomain
$ PS1='uns2# ' unshare -Ur -u bash
uns2# hostname
localhost.localdomain
uns2# hostname winterfell
uns2# hostname
winterfell
$ hostname
localhost.localdomain
In the example above, I have created a user namespace (-U), mapped the root user (-r) and created a UTS namespace (-u). I executed bash there. This uns2 bash shell is running in a different user namespace, and in a different UTS namespace than the host, but it belongs to the same other namespaces than the host. Only new user and UTS namespaces have been created, the rest are still shared.
It is possible to know in which namespace a process is by consulting these links with readlink.
/proc//ns/mnt
/proc/ /ns/pid
/proc/ /ns/net
/proc/ /ns/ipc
/proc/ /ns/uts
/proc/ /ns/user
/proc/ /ns/cgroups
/proc/ /ns/time
For example:
$ readlink /proc/$/ns/time
time:[4026531834]
$$ returns the PID of the current process. In my case, the shell.
If two processes have the same number in the brackets, they are in the same namespace.
Unprivileged users can create user namespaces since kernel 3.8, in general. Although some distributions may have included this possibility later. You need CAP_SYS_ADMIN for creating other namespaces.
As you may have noticed, in the hostname example above, as an unprivileged user, I was able to create a user namespace where I’m root. What’s going on? When you create a user namespace, by default, you get all capabilities, including CAP_SYS_ADMIN. That’s why I was able to create the UTS namespace withint the new user namespace too. But what you can do with those capabilities is constrained. We say that the capabilities are namespaced, that means that you can only do actions that affect that namespace, and not other namespaces. So, yes, you are root, but in a limited way. More on this later.
Below I explain what I have researched about the Mount, Process ID, and User ID namespaces. For information about the Network, Interprocess Communication, UTS, Control Groups and Time, please, refer to Wikipedia and the references at the end of the article. I have not researched them and I don’t have nothing to add to the information there. I think the Control Groups namespace may be relevant when analyzing flaws that may trigger a DoS.
Mount (mnt) namespace
The mnt NS isolates the list of mount points. Each process can only see the mount points in the same namespace. Therefore, each process in a different namespace sees a different filesystem (different /proc/
mount(2) and umount(2) will only affect processes in the same mnt NS.
If we try to verify this, we will see that a new mnt namespace already has mount points, and if you do an ls, you can see files that were in the host. Why does this happen if the mount points are supposed to be isolated?
$ unshare -Ur -m bash
# cat /proc/$/mounts
/dev/mapper/fedora_localhost--live-root / ext4 rw,seclabel,relatime 0 0
sysfs /sys sysfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
cgroup2 /sys/fs/cgroup cgroup2 rw,seclabel,nosuid,nodev,noexec,relatime,nsdelegate 0 0
pstore /sys/fs/pstore pstore rw,seclabel,nosuid,nodev,noexec,relatime 0 0
efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0
none /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0
configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0
By default, when a namespace is created, the mount table of the parent process is recreated in the child process. After this point, if you mount or unmount anything in the child namespace, it does not affect the parent namespace.
If the new mount namespace is created in a user namespace different from the parent’s user namespace, the new mount namespace is considered less privileged. This has some implications. One of them is that mount flags RS_RDONLY, MS_NOSUID, MS_NOEXEC and “atime” flags cannot be changed in less privileged mount namespaces.
We can see the propagation types of the mounts points with this commands:
$ cat /proc/$/mountinfo
23 65 0:22 / /sys rw,nosuid,nodev,noexec,relatime shared:2 - sysfs sysfs rw,seclabel
24 65 0:5 / /proc rw,nosuid,nodev,noexec,relatime shared:14 - proc proc rw
25 65 0:6 / /dev rw,nosuid shared:10 - devtmpfs devtmpfs rw,seclabel,size=8056916k,nr_inodes=2014229,mode=755
26 23 0:7 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:3 - securityfs securityfs rw
If you write from within a mount namespace, you write to the host filesystem. The mnt NS isolates the mount points, but if they are visible and the user has enough permissions, he can write into the filesystem.
$ unshare -Ur -m bash
# pwd
/home/jdoe
# touch test.txt
In another console…
$ pwd
/home/jdoe
$ ls -l test.txt
-rw-r--r--. 1 jdoe jdoe 0 nov 3 10:19 test.txt
So, the moint namespace doesn’t provide filesystem isolation, only different mount points.
In order to obtain filesystem isolation, containers use overlay filesystems. Podman containers use fuse-overlayfs, while Docker rootful containers use OverlayFS that is a functionality within the kernel. The use of fuse has some security implications.
Overlay filesystems are a kind of file systems formed by different layers. When a user tries to read a file, the system tries to find it in the upper layer. If it doesn’t find it, it looks into the lower layer. If there is a file in the upper layer with the same name as a file in lower layers, the former is used. So, files in the upper layer hide files in th