Filesystem Namespaces

Traditional Unix systems have a rather static view on the filesystem. The root directory is always there and regular users never really modifies anything outside their home directory and other permitted places. The root user is able to modify this base view on the filesystem through chrooting and mounting or just modifying the files, but he can’t make any big changes or the whole system breaks.

Modern Unix-like systems offer an interesting feature called filesystem namespaces. In short, they are a set of mountpoints. Each process then belongs to a namespace that governs what it sees, usually a process shares the same namespace with its parent. For instance, if the system administrator mounts a directory, other users would see the mountpoint as well. A new namespace is created by cloning an existing one. The process can then change what is mounted at /usr as it pleases without affecting the original namespace. Linux have supported for this for a while, although sadly it requires root permissions. The Plan 9 operating system introduced this feature a long time ago but embraced it into its design allowing regular users to fully control their namespace.

You can do a few things with namespaces. For instance, you could bind a directory or file at another location, thus giving it a new name (and possibly overwriting what was there in the first place, but on disk, it is not overwritten) much like a hardlink. You could even remove a file/directory from the namespace. This could be useful for webservers that if attacked, they won’t be able to even see the existence of the /home directory. And you could even bind multiple directories at the same location, thus providing the union of their contents.

Filesystem namespaces fits well into the design of Sortix. I intend to try follow the Plan 9 model for namespaces with my own changes. After all, this area is hardly standardized and I’m free to experiment with the implementation. I have to be careful, though, as namespaces can easily cause mass confusion for users if they are used in non-obvious ways. It’s better to let the users do the non-obvious usage once they learn of the feature.

What is filesystem namespaces good for? A simple usage is providing each user with their own private /tmp directory. This avoids conflicts and if documented would surprise no one. Another interesting usage is to obsolete the PATH enviromental variable. The shell could then hardcode PATH as /bin and only load programs from there. The user could then make /bin the union directory of /x86_64-pc-sortix/bin, /share/bin, /data/x86_64-pc-sortix/bin, /data/share/bin, /home/$USER/Software/x86_64-pc-sortix/bin, and /home/ $USER/Software/share/bin. This has the amazing consequence that all programs reside in /bin at runtime, just like the good old days of early Unix. The same is possible with /lib, /etc and /include. Users can then install software into their own directory and it will integrate perfectly with the rest of the system. Indeed, this also gets rid of shared library search paths as they are all in /lib at runtime.

Even more insane constructs are possible. For instance, we can create filesystem server that provides TCP connections through /net/tcp, UDP through /net/udp, and so on. Programs can then be routed to communicate through /net either by modifying them or using a patched libc. A user could then mount the /net directory of another computer on the local /net directory. Programs accessing /net would then be forwarded transparently to the /net directory of another computer, thus using its Internet connection. If we use an encrypted tunnel, we just invented a simple VPN solution.

The chroot system call could be implemented through this method: Simply create a new namespace and mount /home/$USER/my-chroot as / and we’re done.

I’m sure there are plenty of more exciting and creative uses of namespaces. I look forward to implementing this and toying around with it. In my new build system, I’ve used unionfs-fuse to simulate namespaces when cross-compiling software.

Generalizing Root using Subusers

Traditionally Unix systems are managed by an all-powerful user called Root. This is a simple, powerful and straightforward scheme for system administration. Indeed, the root user bypasses all security checks and can do anything. In contrast, regular users are quite limited in what they can do. This isn’t bad as the users can’t tamper with each other or the system, but it also means that users can’t do various useful things. Often this is not a big problem or can be worked around, but it’s not pretty.

From a software design perspective, it’s not ideal to treat root as a special case. From a user’s perspective, it’s occasionally annoying with the root-only mentality. Of course, letting any user chroot or change stuff in /bin is a critical security problem under the current design. That’s why I want to change the design of users in Sortix, possibly we can create something better.

The concept I’ll introduce today is sub-users. The idea is that a user can create new users with less privileges. The original user can then act as root in respect to those new sub-users. For instance, I could create the user sortie/games, usable if I want to run a game, but not let it access my private files. Note that this forms a user hierarchy, with the root user at the top. The name root is simply an alias for / (the root of the user hierarchy) and the user sortie is simply a shorthand for /sortie. This could be interpreted as the current user hierarchy directory being / and the user sortie being resolved relative to that. Note how these naming rules are backwards compatible.

However, most software and filesystems are not ready for such changes to the user system. In particular, which user should own the file according to the filesystem, if the file is owned by a sub-user? Traditionally a user only has one uid and this surely won’t do it. Instead, I propose to allocate each user a range of uids. Since uids are 32-bit integers (at least) on Sortix, there’s plenty of room to allocate even for big organizations (how about 4096 uids per user?). When a user creates a subuser, the user simply gives it a subrange of the user’s current range of uids.

In this case, the root user would have the have uid=0 and the range 0 to UID_MAX. This is useful if we introduce the following rule: A user can access a resource if the user’s uid is first in his range, and the resource’s owner is in this range. If the sortie user has the range 4096-8191, and sortie has uid=4096, and sortie/games has uid=5000, then sortie can act effectively as root towards sortie/games. Naturally, sortie/games can act effectively as root towards sortie/games/quake.

This becomes interesting if we interpret the user hierarchy through a filesystem metaphor. What would happen if we chdir’d or chroot’d to another user/directory? Well, if we did it to /sortie, then sortie would become the root user if the process asks. The process gains no new permissions, but if it had chrooted to a directory the sortie user owns, then it could do anything just like traditional root. This is useful if a user wants to start up his own little OS environment in which he acts as root. Actually, the uid space is actually changed too. Previously it ran from 0 to UID_MAX, but now sortie‘s range of 4096-8191 is mapped at 0-4095. The process won’t be able to tell the difference except the size of the uid space has inexplicably shrunk.

Subusers is simply one of the many methods Sortix will use to empower the users of the system and allow them to fully control their environment in a fully isolated manner. Another interesting concept is per-process namespaces. This allows processes to rearrange the filesystem as they please, unrelated processes won’t see the changed, but child processes would inherit the changes. Combining per-process namespaces with subusers, you get a very powerful chroot-like feature that allows users to run their own fully-fledged systems and the only concern to the system administrator is disk usage and CPU space, but users could waste those regardless.

Towards Sortix 0.7

Development of my homemade operating system Sortix 0.7 is currently underway and many milestones are already completed. In particular, this release will add proper support for filesystems, graphics, threads, and lots of user-space improvements.

I am currently working on adding proper support for kernel threads. This offers a much more convenient programming model in the kernel (it was previously cooperative). Note that the user-space environment has always been fully preempted, now the kernel is too. So far the system boots and runs using kernel threads and it is quite stable, except for a show-stopping bug in the program loader. The bug sometimes loads a program incorrectly which causes exotic crashes.

Building upon kernel threads is the upcoming VFS, which will allow the implementation of mount-points and user-space filesystems. ser-space filesystems are a central design decision in Sortix, where the users have complete control over their view on the filesystem.. Many planned Sortix features depend on user-space filesystem, for instance my ext2 driver which currently has read-only support. Having a real filesystem driver would also remove many limitations, for instance not being able to store files permanently and the initrd slowly running out of space. User-space filesystems have other uses too, for instance, I plan to make a tcp filesystem that can be mounted at /net/tcp. This would allow shell scripts to easily access the internet, but it would also allow a special kind of VPN connection if you mount another computer’s /net directory as your own. Linux has mostly kernel filesystems which often means that traditionally only root can mount filesystems, but it’s not common for users to want to mount filesystems. The Linux FUSE API offers such functionality, but I’m not completely satisfied with the implementation.

I’ve also researched real mode and BIOS calls and it sure isn’t pretty. In fact, it is not possible for 32-bit protected mode or 64-bit long mode operating systems to call the kernel directly (at least, not without gruesome hacks). So I coded a short routine that de-boots the CPU back to 16-bit real mode and calls the BIOS directly and then boots Sortix back up. It works surprisingly well.

There is an old standard called VESA Bios Extensions (VBE), which allows the operating system to easily use modern resolutions and perform 2D bitmap operations. With the BIOS support in place, I was able to code a simple graphics driver based upon VBE. Previously, I only had bitmap support for the BGA mechanism in modern virtual machines. The Sortix Video branch works on creating a simple framework for switching and detecting graphics modes as well as providing a framebuffer to the user-space applications. This API will expose the video modes and framebuffers as the /dev/video directory. Although, my VBE driver is currently quite slow on real hardware as it takes too long to transfer data to the graphics card pixel for pixel, although it performs well in virtual machines. Adding support for MTRRs may just fix the transfer speed problem by bulking up writes and therefore reducing write latency which is the bottleneck.

Here’s some miscellaneous improvements: The standard library, kernel, and shell now supports environmental variables. I’ve added a new initrd format which supports nested directories, permissions, timestamps, and more. I’ve successfully ported libfreetype to Sortix and rendered text to an off-scren buffer (soon onto /dev/video/fb). There is some initial work done on copy-on-write support, but it’s not working yet. And lots of other small improvements and bug fixes.

I expect to finish the above features before releasing Sortix 0.7 and naturally I’ll add many other small improvements. Sortix 0.8 will likely focus on security where I add users and groups as well as making the kernel more robust against malicious processes.

Creating a new process

Modern operating systems generally use two different approaches to create a new process. The first and conceptually simplest is to call a CreateProcess function that accepts the path to the executable and some parameters to control what is inherited and other modifiers.  This approach is used in the Windows operating systems, where the CreateProcess function accepts no less than 10 parameters.

Unix-like operating systems traditionally use another method: they simply create a perfect copy of the parent process except a return value and the child then replaces its entire address space with that of a executable. Somehow this is profoundly simple, though it feels unnecessary to create a copy just to replace it soon thereafter. This is done using the fork system call that accepts not even a single parameter, but returns the process id of the child process, or zero if you are the child.

Back when Unix was created, the overhead of creating a full copy of a process wasn’t very critical, as computers had little memory. That meant when the system ran a program, it often had to swap it in from disk first. In that case, creating a copy of a process was just as expensive as running a process in the first place. Modern Unix-like kernels instead use a technique called Copy-On-Write, which means that it delays creating a copy of memory until either the parent or child modifies the memory. Yet, even though the CreateProcess model may be more efficient today, the fork model remained in use for a few reasons.

The primary reason was that it was trivial to modify the child process: fork works by letting you run code in the child. If you let that code modify the current process, then you modified the child process. It also provided a convenient programming model in which processes could easily cooperate over shared file descriptors (such as pipes). Threads were added much later to Unix systems, and yet, the model of forked processes that cooperate remained in use because it provided a solid and decoupled framework. In contrast threads run in the same process, can directly access each other and are occasionally very hard to understand and debug.

If the kernel developers decide to add a new aspect to the process model in the system, then a process may wish to modify that aspect in its child process. CreateProcess would perhaps require yet another parameter and grow in complexity. In contrast, on a Unix-like system, you would only need to add a new system call to modify the aspect of the current process and then have the forked child call it before it loads the target executable. Note that the Unix program need only invoke the system call if it actually cares (otherwise it gets nice defaults), whereas you need to pass yet another NULL to CreateProcess.

However, as time passed, the developers of the Plan 9 operating system found that fork wasn’t fine-grained enough. The problem is that fork creates a complete copy of parent process, but sometimes, you may wish to share some aspects of a process with the child. Perhaps you wish to create a child process, but share the pointer to the current directory, share the address space, share the process id, share the set of file descriptors? To do this, they added the rfork system call that accepts a single parameter which is a bitmask of what aspects you wish to share, and which you wish to get a copy of. Indeed, if you decide to share everything except the stack, then you essentially create a new thread, not a process. Later on, rfork was adopted into the BSD systems, and inspired the clone system call on Linux which is used to both create process and threads.

I decided to go with the rfork model in Sortix, which is my homemade operating system. However, I had a distaste of the exact semantics of rfork. In particular, the call copied some parts of the process by default. For simplicity, I’d much prefer that sharing aspects is the default, and you need to set a bit to copy the aspect. I therefore renamed rfork to sfork and changed the semantics ever so slightly. Adding a wrapper for BSD rfork compatibility shouldn’t be hard. The nice thing is that you can use sfork both to create processes and threads, and hybrids in-between. The fork function is now implemented as sfork(SFFORK), where SFFORK is a constant that expands to the bits that provide traditional fork semantics. Should you wish to do a normal fork, except share the current directory pointer, you can simply call sfork(SFFORK & ~SFCWD).

Actually, for internal kernel reasons, I didn’t find sfork flexible enough. To solve that, I created the system call tfork (notice a pattern here?), which accepts both a bitmask and a structure that contains the registers of the child thread. This allows you to allocate a big memory chunk, and then tfork a thread that uses it as its stack by simply setting its stack register. You can even implement sfork using tfork by putting the current registers into the structure and calling tfork (and taking care not to call tfork again in the child).

In case you were wondering, I didn’t implement the Linux clone system call, as it bears resemblance to CreateProcess in being over-engineered.