If you are a backend or frontend engineer, chances are you might have heard of Docker. For DevOps or infrastructure engineers, Docker needs no introduction. Docker has revolutionized the software deployment and packaging industry. However, as an engineer, you should always strive to understand the underlying mechanisms that make these tools work, rather than just memorizing commands
As someone who enjoys diving deep into topics, I started exploring Docker a while ago. I wanted to understand what makes a container a container. It turns out that containers are powered by a subset of the Linux kernel. Isn't it fascinating that the software that drives cloud systems is built on top of the Linux kernel?
Core Features of Containerization
Containerization is built on three key features provided by the Linux kernel:
- Namespaces
- Cgroups
- Chroot
Let's break down these concepts in simpler terms.
Namespaces
Namespaces allow you to specify what resources a specific application or process is allowed to share or inherit from the host and which ones should be isolated within the group. Essentially, the resources that a program can share with other processes in that environment are determined by the namespaces it is part of.
There are eight types of namespaces, but for our purposes, we need to understand three of them. Here's a quick overview of the different namespaces:
- CLONE_NEWCGROUP : Shares CGroup
- CLONE_NEWIPC : Shares Inter-Process Communication Queue
- CLONE_NEWNET : Shares Network Devices and Ports
- CLONE_NEWNS : Shares Mount Points
- CLONE_NEWPID : Shares Process IDs
- CLONE_NEWTIME : Shares System Clock
- CLONE_NEWUSER : Shares User IDs
- CLONE_NEWUTS : Shares Hostname and Network Information Service
Chroot
Chroot, short for "Change Root", allows you to change your root directory to a custom location. This creates an isolated environment within the filesystem, which is essential for containers.
Cgroups
Cgroups, or control groups, allow the kernel to restrict access to system resources for a program. This ensures that containers do not exceed their allocated resources and maintains system stability. With an understanding of these three features, you have the foundational knowledge to implement a container runtime in your favorite language. For this tutorial, I will use C++. Why not Rust or Go? Because the code we'll be writing involves syscalls and low-level system interactions, where the real fun lies in handling memory segmentation faults and debugging BSODs (cough cough, CrowdStrike).
So, fasten your seat belts because this is going to be a fun and exciting blog where we'll understand and implement our own container runtime
Before we begin, let's set up a few things. Make sure you have a C++ compiler installed with C++17 support.
Understanding Docker Run Command
Let's take a look at a simple Docker run command and break down what happens behind the scenes.
Run the following command:
sudo docker run ubuntu echo "Hello Dhananjay"
Here's what happens step-by-step:
- Docker searches for the image 'ubuntu': If Docker cannot find the image locally, it fetches the image from a remote repository
- Docker loads the image: Once the image is fetched, Docker loads it into the system
- Docker runs the command: Docker runs the command echo "Hello Dhananjay" within the container.
- Container exits: After executing the command, the container exits
Let's implement a similar behavior our own container runtime.
Writing a C++ Program for Containerization
Let's start writing our C++ program. The first step is to extract the arguments and print them on the screen.
#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
int main(int argc, char** argv) {
std::vector<std::string> args(argv + 1, argv + argc);
std::cout << "Got Arguments: ";
std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
return EXIT_SUCCESS;
}
Compile and run the program with:
g++ main.cpp -std=c++17 -o knocker && ./knocker arg1 arg2
Now, let's modify the program to execute whatever is passed as an argument. We will use execvp, which is part of the exec family of functions. execvp replaces the current process with a new one specified by its arguments.
#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
int main(int argc, char** argv) {
std::vector<std::string> args(argv + 1, argv + argc);
std::cout << "Running Argument: ";
std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
std::cout << std::flush;
std::vector<char*> run_arg;
for (auto& str : args) {
run_arg.push_back(const_cast<char*>(str.c_str()));
}
run_arg.push_back(nullptr);
execvp(run_arg[0], run_arg.data());
return EXIT_SUCCESS;
}
This code may look a bit complex due to the use of pointers, but here's what it does:
- It stores the arguments in a vector of strings
- It converts the vector of strings to a vector of C-style strings (char*)
- It terminates the vector with a nullptr
- It calls execvp to execute the command
Now, let's start virtualizing the process using namespaces. We will use a syscall called clone. The clone syscall is used to create a new process, and it takes several parameters:
- Function to run in the cloned process
- Pointer to stack memory allocation for the cloned process
- Flags/namespaces
- Argument pointer to pass to the function
First, let's define the stack memory
#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
int main(int argc, char** argv) {
std::vector<std::string> args(argv + 1, argv + argc);
std::cout << "Running Argument: ";
std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
std::cout << std::flush;
std::vector<char*> run_arg;
for (auto& str : args) {
run_arg.push_back(const_cast<char*>(str.c_str()));
}
run_arg.push_back(nullptr);
auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024);
auto* stackTop = stack + sizeof(char) * 1024 * 1024;
execvp(run_arg[0], run_arg.data());
return EXIT_SUCCESS;
}
We allocate 1MB of memory for the stack. Since malloc points to the start of the allocated memory, we need to reach the top of the memory as the stack grows downwards
Next, we need to declare a function to run in the cloned process:
#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
int runtime(void* args) {
return 1;
}
int main(int argc, char** argv) {
std::vector<std::string> args(argv + 1, argv + argc);
std::cout << "Running Argument: ";
std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
std::cout << std::flush;
std::vector<char*> run_arg;
for (auto& str : args) {
run_arg.push_back(const_cast<char*>(str.c_str()));
}
run_arg.push_back(nullptr);
auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024);
auto* stackTop = stack + sizeof(char) * 1024 * 1024;
execvp(run_arg[0], run_arg.data());
return EXIT_SUCCESS;
}
Now, let's create process isolation. Open a terminal and run hostname to see the current hostname of your system
hostname
Set a new hostname in a different terminal session
sudo hostname new-hostname
To isolate the hostname between processes, we will use the clone syscall with the CLONE_NEWUTS flag. Modify the code as follows
#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
int runtime(void* args) {
std::vector<char*>* arg = (std::vector<char*>*)args;
execvp((*arg)[0], (*arg).data());
return 1;
}
int main(int argc, char** argv) {
std::vector<std::string> args(argv + 1, argv + argc);
std::cout << "Running Argument: ";
std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
std::cout << std::flush;
std::vector<char*> run_arg;
for (auto& str : args) {
run_arg.push_back(const_cast<char*>(str.c_str()));
}
run_arg.push_back(nullptr);
auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024);
auto* stackTop = stack + sizeof(char) * 1024 * 1024;
void* arg = static_cast<void*>(&run_arg);
pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS, arg);
waitpid(child_process_id, nullptr, 0);
free(stack);
return EXIT_SUCCESS;
}
In this code:
- We typecast the arguments into a void* pointer and pass it to the clone function with flags for cloning a new UTS namespace and notifying a signal to the child process.
- Inside runtime, we typecast back to the original type, wait for the spun-up process to complete with waitpid, and free the allocated memory as process ends.
Now, let's try setting a hostname as we did earlier but this time after running bash as the program argument. This should isolate the hostname changes to the new process
sudo ./knocker /bin/bash
Changing the hostname inside the spawned process didn’t change the root system's hostname. To further isolate the process, let's examine how to achieve this using the CLONE_NEWPID flag
Isolating Processes in Containers
When we run ps inside the emulated bash, it shows all processes running on the system, indicating the spawned process isn’t isolated. To isolate it, add the CLONE_NEWPID flag to the clone system call.
#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
int runtime(void* args) {
std::vector<char*>* arg = (std::vector<char*>*)args;
execvp((*arg)[0], (*arg).data());
return 1;
}
int main(int argc, char** argv) {
std::vector<std::string> args(argv + 1, argv + argc);
std::cout << "Running Argument: ";
std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
std::cout << std::flush;
std::vector<char*> run_arg;
for (auto& str : args) {
run_arg.push_back(const_cast<char*>(str.c_str()));
}
run_arg.push_back(nullptr);
auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024);
auto* stackTop = stack + sizeof(char) * 1024 * 1024;
void* arg = static_cast<void*>(&run_arg);
pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID, arg);
waitpid(child_process_id, nullptr, 0);
free(stack);
return EXIT_SUCCESS;
}
However, compiling the code with the CLONE_NEWPID flag didn’t change anything. Why? In Linux, everything is treated a file, including memory/process information, stored in a file-like structure under /proc. When you call ps, the OS reads the content of /proc where the kernel dumps all info about running processes
Creating a Custom Filesystem Structure
To isolate the process environment completely, including the /proc filesystem, we need to instruct runtime to use custom filesystem. This involves changing the root of the filesystem
Let's examine how docker does this and what's inside the container filesystem
docker run ubuntu echo "Hello World"
docker container ps --all
sudo docker export {container_id} > ubuntu_fs.tar
mkdir dock_ubuntu_tar
tar -xvf ubuntu_fs.tar -C dock_ubuntu_tar
Linux Users will identify the folder structure easily , it’s the root Filesystem , so now for instructing os to use our own proc , we need to change root filesystem
CHRoot → Changing Root
First, we need to download the Ubuntu base image to use in our isolated system. Navigate to and download the base image from Here . Then, run the following commands to download and extract the root filesystem file
wget https://cdimage.ubuntu.com/ubuntu-base/releases/24.04/release/ubuntu-base-24.04-base-amd64.tar.gz
mkdir ubuntu_fs
tar -xf ubuntu-base-24.04-base-amd64.tar.gz -C ubuntu_fs/
Now let's update our code to use this new filesystem
#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/mount.h>
int runtime(void* args) {
std::vector<char*>* arg = (std::vector<char*>*)args;
std::string hostName = "Knocker-Host";
sethostname(hostName.c_str(), hostName.length());
chroot("/home/dhananjay/ubuntu_fs");
chdir("/");
mount("proc", "proc", "proc", 0, "");
execvp((*arg)[0], (*arg).data());
return 1;
}
int main(int argc, char** argv) {
std::vector<std::string> args(argv + 1, argv + argc);
std::cout << "Running Argument: ";
std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
std::cout << std::flush;
std::vector<char*> run_arg;
for (auto& str : args) {
run_arg.push_back(const_cast<char*>(str.c_str()));
}
run_arg.push_back(nullptr);
auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024);
auto* stackTop = stack + sizeof(char) * 1024 * 1024;
void* arg = static_cast<void*>(&run_arg);
pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID, arg);
waitpid(child_process_id, nullptr, 0);
free(stack);
return EXIT_SUCCESS;
}
In this code, we set the hostname, change the root to the new filesystem, set the current working directory to /, and mount the parent /proc to the child /proc with the proc file type and no special attributes.
Running ps in the child process shows that bash is running as PID 1, indicating successful isolation of the child process from the parent
However, if you look at the mount points in the host , you can see that the proc is mounted in the host system as well. This is because the mount is propagated to the parent namespace. To prevent this, we need to spawn a new namespace associated with child , unmount the proc filesystem in the parent namespace after the clone call to release the resources
#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/mount.h>
int runtime(void* args) {
std::vector<char*>* arg = (std::vector<char*>*)args;
unshare(CLONE_NEWNS);
mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL);
std::string hostName = "Knocker-Host";
sethostname(hostName.c_str(), hostName.length());
chroot("/home/dhananjay/ubuntu_fs");
chdir("/");
mount("proc", "proc", "proc", 0, "");
pid_t pid = fork();
if (pid == 0) {
execvp((*arg)[0], (*arg).data());
} else {
waitpid(pid, nullptr, 0);
std::cout << "Cleanup Running" << std::endl;
umount("proc");
}
execvp((*arg)[0], (*arg).data());
return 1;
}
int main(int argc, char** argv) {
std::vector<std::string> args(argv + 1, argv + argc);
std::cout << "Running Argument: ";
std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
std::cout << std::flush;
std::vector<char*> run_arg;
for (auto& str : args) {
run_arg.push_back(const_cast<char*>(str.c_str()));
}
run_arg.push_back(nullptr);
auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024);
auto* stackTop = stack + sizeof(char) * 1024 * 1024;
void* arg = static_cast<void*>(&run_arg);
pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID, arg);
pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS, arg);
waitpid(child_process_id, nullptr, 0);
free(stack);
return EXIT_SUCCESS;
}
In this code snippet, we:
- Clone the process with new namespaces.
- Unshare the namespace from the host.
- Mount the directory as private.
- Change root, set the hostname, and create a cleanup process with fork.
- This way, the system cleans itself from the mount point it allocated at the end of its lifetime and we don't see the mount point cluttering the host system.
Restricting Resources with Cgroups
To add resource restrictions using cgroups to our minimal container runtime in C++ located in /sys/.fs/cgroups, we will need to use the cgroup filesystem to set limits on CPU, memory, and the number of processes. Here’s how to integrate cgroups into the code for resource restrictions:
Helper Function for applying limits automatically
The rule_set function will create a cgroup, set the process ID of the child to the cgroup, and configure resource limits.
#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <filesystem>
#include <fstream>
void rule_set(pid_t child_pid) {
std::filesystem::path pids_path{"/sys/fs/cgroup/pids/knocker"};
std::filesystem::path memory_path{"/sys/fs/cgroup/memory/knocker"};
std::filesystem::create_directories(pids_path);
std::filesystem::create_directories(memory_path);
std::ofstream ofs(pids_path / "cgroup.procs");
ofs << std::to_string(child_pid);
ofs.close();
ofs.open(pids_path / "pids.max");
ofs << "3";
ofs.close();
ofs.open(memory_path / "cgroup.procs");
ofs << std::to_string(child_pid);
ofs.close();
ofs.open(memory_path / "memory.limit_in_bytes");
ofs << "209715200";
ofs.close();
}
int runtime(void* args) {
std::vector<char*>* arg = (std::vector<char*>*)args;
unshare(CLONE_NEWNS);
mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL);
std::string hostName = "Knocker-Host";
sethostname(hostName.c_str(), hostName.length());
chroot("/home/dhananjay/ubuntu_fs");
chdir("/");
mount("proc", "proc", "proc", 0, "");
pid_t pid = fork();
if (pid == 0) {
execvp((*arg)[0], (*arg).data());
} else {
waitpid(pid, nullptr, 0);
std::cout << "Cleanup Running" << std::endl;
umount("proc");
}
execvp((*arg)[0], (*arg).data());
return 1;
}
int main(int argc, char** argv) {
std::vector<std::string> args(argv + 1, argv + argc);
std::cout << "Running Argument: ";
std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
std::cout << std::flush;
std::vector<char*> run_arg;
for (auto& str : args) {
run_arg.push_back(const_cast<char*>(str.c_str()));
}
run_arg.push_back(nullptr);
auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024);
auto* stackTop = stack + sizeof(char) * 1024 * 1024;
void* arg = static_cast<void*>(&run_arg);
pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS, arg);
rule_set(child_process_id);
waitpid(child_process_id, nullptr, 0);
free(stack);
return EXIT_SUCCESS;
}
In rule_set, we:
- Creates two new cgroup directory: /sys/fs/cgroup/pids/knocker and /sys/fs/cgroup/memory/knocker
- Assigns the child process ID to the cgroup: By writing the PID to cgroup.procs.
- Sets a process limit: By writing 4 to pids.max which limits the number of processes and 200MB to memory.limit_in_bytes which limits the memory usage.
And that's it! We have successfully implemented a minimal container runtime in C++ that isolates processes, hostnames and restricts resources using cgroups. By understanding the underlying mechanisms of containerization, you can build your own container runtime from scratch. This is a great way to learn about the Linux kernel, system calls, and low-level programming. I hope you enjoyed this blog and learned something new.
Great post serrr!
The blog might have a lot of content to digest. Feel free to post your questions in the comment section 🙂