An Overview of the VSTa Microkernel Abstact VSTa is an experimental kernel which attempts to blend the design of a microkernel with the system organization of Plan 9. The result is a small privileged kernel running user-mode tasks to provide system services such as device drivers, filesystems, and name registry. Like Plan 9, each service provides a filesystem-like interface. === Motivation === Two operating systems have emerged in recent years which show considerable promise. The POSIX-conforming QNX operating system (1 ) is a relatively "clean" microkernel, with a 4K executive and most traditional system functions running as user-mode processes. Meanwhile, Plan 9 (2 ) has emerged as a next-generation view of computing from the original architects of UNIX. While each system represents significant new system functionality, the frustration has been in the proprietary nature of each system--synthesis and experimentation by an outside party is difficult or impossible. === A Platform for Experimentation === ==== A New Source Base ==== With these interesting new ideas to try, and without a suitable code base on which to try them, the only option was to wait for the authors of these systems to publish more papers. When the author recently had the opportunity to take a 6 month sabbatical, the result was the VSTa operating system, written from scratch. VSTa is a non-proprietary source base permitting experimentation in many of the areas first broached by Plan 9 and QNX. While the full design is beyond the scope of this overview, several general design decisions are worth mentioning. ==== Symmetric Multiprocessing ==== Microkernels have a surprising interaction with multi-processing support. In a monolithic kernel, vast amounts of system functionality all reside in the same kernel address space. When the kernel first supports multiprocessing, all of this code must be revamped to handle multiple parallel threads of execution. (3 ) Much of this "kernel" code is moved out to individual tasks in a microkernel organization. With such an organizational change, the job of converting this code to the appropriate level of parallelism is divided. A process can be single-threaded; it can be marked serially-reentrant (multiple threads, but only one running in the address space at a time--like a uniprocessor UNIX kernel); or the threads can run fully in parallel. The decision is made per-process. While real-time attributes are discussed later, it is worth noting that the job of writing code to operate correctly in the presence of parallel CPUs has some synergy with the goal of allowing kernel preemption. In both cases the programmer must always consider the possibility of the same code paths being entered and reentered at arbitrary points. Kernel preemption imposes additional demands, especially in code paths like exit(). With the smaller kernels achieved by moving so much functionality out to processes, the remaining code is much easier to design, code, and test. Since writing parallel, preemptive code is harder to write than traditional single-threaded code,(4 ) the reduction in source size is desirable. VSTa's machine-independent layers were written for a shared memory symmetric multiprocessor. A P/V semaphore interface is used for sleep-oriented interlocks. A P/V spinlock is used for spin-oriented interlocks, and also used to interlock against interrupt service procedures. The machine-dependent code, sadly, is only written for a uniprocessor i386--the only machine available to the author. ==== Real Time Facilities ==== While VSTa is not a real-time operating system in itself, numerous features associated with real-time systems offer themselves naturally to solve microkernel design issues. Process memory locking is necessary in order to allow a disk driver task to run as a user process (as otherwise, of course, you will deadlock when your disk driver tries to demand page in a piece of itself from swap.) Non-degrading priorities are necessary to permit critical system services to respond to many users without being penalized for their apparent heavy CPU use. Low-latency process dispatch is necessary to allow interrupt service code to run in a deterministic amount of time after a device event--especially important in the case of heavy data sources like dumb serial ports and LAN interfaces. VSTa was designed with memory locking and real-time priorities. Except when a spinlock is held, a thread is preemptable even when running in kernel mode. Most spin-locks do not involve interrupt-driven code; for these, interrupts are still accepted and queued even while the spinlock is held--preemption to a real-time process is delayed until the spinlock is released. ==== Scheduling ==== VSTa uses a very conventional priority-driven scheduler for real-time processes. However, most processes in the system run under an interactive, timesharing scheduling algorithm with unusual properties. The scheduler is driven from a tree where runnable processes are leafs. The internal nodes represent the partioning of groups of processes into percentage "slices" of the CPU pool (much like a fair share scheduler), with the lowest nodes containing the threads within a process. This organization has two desirable properties. First, it allows users and groups of users to partition the CPU resources fairly among groups based on local policy. With the ratio of CPUs to users approaching 1:1, the classic departmental computing scenario may never arise. But it can be convenient to guarantee that some particular server will never consume more than half the CPU time (unless it would otherwise be idle.) Such a scheduler also provides many of the properties of a gang scheduler. When the classic UNIX scheduling algo- rithm is used to run closely cooperating processes, its global nature allows any runnable processes to compete directly with the threads. Since all threads under VSTa exist under a common scheduling node, the threads can voluntarily relinquish the CPU; the CPU time relinquished remains within the "pool" of the node, so only other related threads under the node will complete for it. (5 ) ==== Messaging ==== In a microkernel, the goal is to identify a small, cohesive set of kernel functions upon which all current and desired future system functions can be implemented as user mode tasks. Like Plan 9 and QNX, VSTa uses a messaging engine as the underlying microkernel mechanism. VSTa, like QNX, supports scatter/gather lists for messages. VSTa, however, had to use significantly different techniques than QNX--VSTa supports virtual memory, where QNX can always assume the presence of physical memory for its messages. ( 6 ) VSTa structures the exchange of messages between client and server. A would-be client requests connection; a connection indication including the would-be client's capabilities is presented to the server. The server accepts or rejects the connection. If accepted, the client can then send messages. Each element in the scatter/gather list of the message is made available in the server's address space when the server receives the message. The contents are mapped on demand as the server makes reference to the data. Alternatively, the server can merely pass the message on without touching its contents (for instance, a middle module in a protocol might add a new initial buffer without needing to examine the contents being encapsulated.) The data was never mapped or read; the passing of data in these cases is thus quite efficient. If the server returns data it is copied into the client's address space before the client's request is completed. It would be desirable to use the same "lazy" semantics for message mapping as the server, but this would make it difficult for the server to know when the client is done using the data. Techniques involving the "handing off" of pages of data are possible, but many servers would then have to copy data into new pages; any performance benefits can easily be lost. ==== Filesystem Interface ==== Plan 9 imposes a filesystem-like interface onto all objects in the system. QNX permits a server to provide a filesystem interface, but does not pursue such a thing as a goal in its own right. Plan 9 allows each process to build its own filesystem view by attaching objects at arbitrary points in its filesystem namespace. QNX provides a global view, but does not implement these mount points in the classic UNIX (and Plan 9) way. QNX instead looks at its "mount table" as a simple mapping of leading strings for mount points, and a corresponding server. VSTa borrows much from each operating system. Like Plan 9, all servers provide their services in terms of a filesystem-like interface. The format of the messages sent through the microkernel is standard and implements a filesystem protocol similar to Plan 9's 9P protocol. ( 7 ) VSTa use a string table approach like QNX, although the table is private to each process. In fact, the table and its interfaces are all entirely with the C library; neither the VSTa kernel nor its servers have any control over a process' mount table. ==== Capabilities ==== VSTa breaks from current operating systems in the way it structures capabilities. The desire was to allow a hierarchy of abilities, and to allow a given user to create sub-abilities without the intervention of a system manager. The resulting system is powerful, but does not map well onto a POSIX interface. In VSTa, an ability is represented as a dot-separated sequence of numbers, called an ID. The numbers become more specific reading from left to right. An object in VSTa (8 ) has a label with such a label, and for each position, another bitmask indicating what actions are permitted (It also has a default access, which is just OR'ed in with any other bits granted.) For instance, assuming no default: {{{ 1. 2. 1. 3 EXEC READ WRITE DELETE }}} Would indicate that someone possessing 5.3 could not access the object; someone with 1.5 could only execute it (a mismatch discontinues the accumulation). A hierarchy of "super users" is gained with the last rule: someone with 1.2 would gain read, write, excute, and delete abilities! When someone possesses an ID which matches a label to the length of the ID, the ID is said to dominate the label. The remaining bits are OR'ed in as if the match continued to the end of the label. The superuser of a VSTa system, therefore, is someone who has an ID of length 0. One can forge a new ID; it is permitted if at least one of the current IDs dominates the new ID to be forged. Thus, someone who logs in with the ability 5.7 could store all sensitive data with a label of, say, 5.7.1 with all access requiring a full match: {{{ 5. 7. 1 (0) (0) READ|WRITE|EXECUTE|DELETE }}} If this same user then wanted to run a somewhat suspect application, he could forge a new ID of 5.7.2, disable his current IDs, and run the application. Since the application does not possess an ID which allows access to 5.7.1, the user's data is protected. Because such ID manipulation can be done by any user, fine-grained protection designs are possible in a way which UNIX forbids without extensive use of super-user powers. === Current State === The VSTa kernel currently comprises 2700 lines of portable C, 750 lines of i386-specific C (using the "count the semicolons" method), and 400 lines of i386 assembly code. It compiles into roughly 50K of 32-bit i386 object code, with another 10K of data. Device servers exist for the keyboard, screen, floppy disk, ST-506 hard disk, and real-time clock. Servers also exist for a flat contiguous-allocation filesystem, a DOS filesystem, an environment server (used by getenv()/setenv()), a swap manager (to manage multiple swap partitions and permit dynamic swap partition additions and deletions), and a service registry database. The system boots and runs on top of either a flat contiguous filesystem or a DOS filesystem. Further device drivers and filesystems can be started and stopped from the command interpreter. The system does demand-paging of executables and does page-stealing when memory becomes scarce using a two-handed clock algorithm. At the application level, a C library and include files have been written to conform to the POSIX specification. GNU C, as, and ld have been ported and run natively under VSTa. Emacs and other amenities are also available. As each port identifies a missing area, it is coded up based on the POSIX standard. Thus, the system deviates from POSIX more by omission than otherwise. The exception is the area of protection, where this incompatibility was foreseen and accepted at the conception of the project. === Future Projects === It is satisfying to have a freely accessible environment on which to prototype further facilities. Several categories of experimentation are of active interest to the author. ==== Performance ==== VSTa provides very good interactive and compilation performance on a single-user 25 Mhz i386 PC. However, several areas could be streamlined. For small buffers of data the current page-mapping messaging techniques are overkill; it would be much cheaper to just copy the actual data around. A "fast path" for small buffers could shorten code paths for keystrokes, RS-232 bytes, and small writes to the screen. This must be traded off against the additional complication in the messaging machinery--but a prototype is certainly called for. Some servers, for the sake of simplicity, do not take full advantage of the scatter/gather functions. At least the disk drivers and filesystems should be carefully optimized to make best use of scatter/gather lists. ==== Clustering ==== Accessing services through messages immediately leads one to ponder the possibilities of remote access. The scatter/gather organization and lazy referencing of VSTa should make it possible to move the data from a client to the network interface without any intermediate copying of the data. Incoming connections must be mapped into a local VSTa server connection; the networking daemon must forge appropriate IDs for the client. A mapping database used by the daemon might suffice: {{{ theirbox:1.* -> mybox:99.* trustedbox:* -> mybox:* *:* -> REJECT }}} This would allow both access control (you must have 1.* on theirbox to log in; trustedbox has the same accounts as us) and dynamic translation of IDs between systems. The name server is currently a local entity. However, once remote service access is available, it is a simple matter to import other node's name servers and use them locally. Because VSTa supports Plan 9-style union directories, you could even mount each name server at the same point in your local filesystem name space, with your local name server coming first. Ultimately, a network-aware database system must be implemented, but it is interesting to ponder how far these simple and powerful Plan 9 techniques can take one. === Acknowledgements === The knowledge and vision of the Plan 9 and QNX developers is appreciated, both in their papers and on Usenet. This project would not have been possible without the outstanding tools from the Free Software Foundation--especially the GNU C compiler and its accompanying utilities. Finally, I would like to thank the Bill and Lynne Jolitz for the 386BSD source code. Its availability as a reference helped me unravel many an intricacy of the i386. ------ Footnotes: 1. Hildebrand, Architectural Overview of QNX 2. Pike et al, The Plan 9 Operating System 3. Operating systems like System V.4.2 MP try to mitigate this with serializing code "wrappers." Few people seriously present this as an appropriate solution, or even an acceptable temporary fix. 4. The author's estimate is 2-3 times harder. 5. Of course, if there are no runnable threads, the CPU time will be parceled out elsewhere in the scheduling tree. 6. QNX designers have designed, but not yet implemented, a virtual memory capability for their system. 7. The protocol would have been 9P, except that it was designed before this level of detail was available for Plan 9. 8. More precisely, the objects offered by standard VSTa servers. ------