What is a container?

Objective: understand what is a container

Takahiro Oda
6 min readDec 24, 2021

#all the contents come from here

Processes

Containers are just normal Linux Processes. Let’s start Redis container.

docker run -d --name=taka redis:alpine

The Docker container launches a process called redis-server

We can check the process with this command.

ps aux | grep redis-server

Check docker top command to check its PID.

docker top <name>

Check who is the PPID with this command

ps aux | grep <ppid>

The command pstree will list all of the sub-processes. As you can see, they are just standard processes.

pstree -c -p -A $(pgrep dockerd)

Process Directory

The configuration for each process is defined within the /proc directory. This command lists all the contents under the proc and store the Redis PID for future use.

DBPID=$(pgrep redis-server) echo Redis is $DBPID ls /proc

Each process has it’s own configuration and security settings defined within different files.

ls /proc/$DBPID

You can see the environment variables settings

cat /proc/$DBPID/environ
docker exec -it taka env

Namespaces

The concept of namespaces is to limit what processes can see and access certain parts of the system, such as other network interfaces or processes.

When a container is started, the container runtime, such as Docker, will create new namespaces to sandbox the process. By running a process in it’s own Pid namespace, it will look like it’s the only process on the system.

The available namespaces are:

Mount (mnt)

Process ID (pid)

Network (net)

Interprocess Communication (ipc)

UTS (hostnames)

User ID (user)

Control group (cgroup)

What happens when we share a namespace?

Namespaces are inode locations on disk.

You can see all the namespaces with this command

ls -lha /proc/<PID>/ns/

NSEnter is used to attach processes to existing Namespaces. Useful for debugging purposes.

nsenter --target <PID> --mount --uts --ipc --net --pid ps aux

namespaces can be shared using the syntax container:<container-name>

If you want to connect nginx to the existing namespace(taka),

docker run -d --name=medium-test --net=container:taka nginx:alpine

check with this command

docker ps -a 

Check Nginx PID

pgrep nginx | tail -n1

It is listed as a namespace after the net has been shared.

ls -lha /proc/<PID>/ns

But, both of the net namespace points to the same place

Chroot

An important part of a container process is the ability to have different files that are independent of the host. This is how we can have different Docker Images based on different operating systems running on our system.

Chroot provides the ability for a process to start with a different root directory to the parent OS. This allows different files to appear in the root.

Cgroups (Control Groups)

CGroups limit the number of resources a process can consume. The cgroups are values defined in particular files within the /proc directory.

You can check cgroups with this command

 cat /proc/<PID>/cgroup

To check other cgroup directories on disk

ls /sys/fs/cgroup/

What are the CPU stats for a process?

You can check the CPU stats and usage

cat /sys/fs/cgroup/cpu,cpuacct/docker/<container ID>/cpuacct.stat

you can check your container ID with

docker ps -a

The CPU shares limit is also defined here

cat /sys/fs/cgroup/cpu,cpuacct/docker/<container ID>/cpu.shares

All the Docker cgroups for the container’s memory configuration are stored

ls /sys/fs/cgroup/memory/docker/

Each of the directory is grouped based on the container ID assigned by Docker.

ls /sys/fs/cgroup/memory/docker/<container ID>

How to configure cgroups?

One of the properties of Docker is the ability to control memory limits. This is done via a cgroup setting. By default, containers have no limit on the memory.

docker stats <name> --no-stream

The memory quotes are stored in a file called memory.limit_in_bytes.

By writing to the file, we can change the limit limits of a process.

echo 800000 > /sys/fs/cgroup/memory/docker/<container ID>/memory.limit_in_bytes

You can see it has changed

Seccomp / AppArmor

All actions with Linux is done via syscalls. All applications use a combination of these system calls to perform the required operations.

AppArmor is a application defined profile that describes which parts of the system a process can access.

to check the current AppArmor profile

cat /proc/<PID>/attr/current

Seccomp provides the ability to limit which system calls can be made, blocking aspects such as installing Kernel Modules or changing the file permissions.

When assigned to a process it means the process will be limited to a subset of the ability system calls. If it attempts to call a blocked system call is will recieve the error “Operation Not Allowed”.

The status of SecComp is also defined within a file.

cat /proc/<PID>/status |grep Seccomp

The flag meaning are: 0: disabled 1: strict 2: filtering

Capabilities

Capabilities are groupings about what a process or user has permission to do. These Capabilities might cover multiple system calls or actions, such as changing the system time or hostname.

A process can drop as many Capabilities as possible to ensure it’s secure.

cat /proc/<PID>/status |grep ^Cap

--

--

No responses yet