vsta: vstafs


=== VSTa FS: an Extent Based Filsystem ===

A file is structured as:

| File header | Extents | Data...

So the stuff UNIX would put in inode is actually just the first N
bytes of a file. Each extent is merely <start,len> tuples. The sizes
are chosen carefully, so that a 100-byte file occupies a single
512-byte sector, including both file data as well as file ownership,
protection, and allocation information. A directory is just a special
file, so smaller directories will also occupy a single sector.

As a file grows, its existing allocation is extended where possible.
So if your file is currently {<3,1>} (i.e., sector three, one sector
long), and you write enough data to push out beyond the initial
sector, the filesystem will see how many blocks can be had starting at
4. Up to 64K is taken at once, in the hope that taking this much
proactively will spare you having to "piecemeal" your file together as
you write.

Thus, your file could then be {<3,129>}. The file's length is marked
to be the entire length of the whole allocation (129*512 = 66048). If
you crash, the file thus correctly represents the storage it holds,
without fsck being required before filesystem operation is resumed.

As you write the file, the filesystem keeps track of the highest
offset you have written. On last close, blocks beyond this point are
freed, and the file is "shrunk" down to the actual size written. If
you wrote a K or so, then closed the file, the example file would end
up as {<3,3>}, and 6..129 would return to the free list.

The free list uses a sorted linked list of free ranges, with
coalescing. Each range is held in a single sector; each sector has a
next pointer to another sector's worth of free list ranges. A sector
can hold 0 or more entries; if it ever overflows, a global rebalance
is done, resulting in each sector being exactly half full (except the
last in the list which might hold less). This allows contiguous ranges
of free blocks to be easily identified, and is very storage efficient,
even with small allocation units (such as sectors). It is also easy to
prove that your free list can always represent the total of free
blocks, because a block on the free list can be consumed to provide
another sector's worth of storage for the free list. Bitmaps are
popular for free block management, but their size grows as a function
of increasing filesystem size and decreasing allocation unit size.
Most implementations either take a larger allocation size--say,
8K--and waste disk space for smaller files, or add tremendous
complexity to special-case sub-allocation-size units (i.e., "frags" in
the BSD filesystem). VSTa is primarily an experimental platform, so I
decided to try something different.

The most complex part of the filesystem is the block cache. This is a
misnomer; each entry in the buffering system is variably-sized. For an
extent of up to 64K, a buffer of the exact size is kept. For larger
extents, the extent is buffered in 64K chunks.

Between the contiguous block allocation and the variable-size buffer,
I hope to provide very efficient filesystem services. Consider a file
with allocation {<10,8>} (i.e., a file starting at sector 10 for 8
sectors--4K). When this file is opened and read, a 4K buffer is
allocated. A single disk I/O fills the entire 4K. Thus, most small
files will have their entire contents buffered in a single disk I/O.
Larger files will have their contents buffered in chunks of 64K. Once
the file contents are in the filesystem's buffer, it may be parcelled
out to clients without further I/O activity.

This scheme should work well for text files, objects, and some
executables. It will not work well for, say, databases. They would do
better to talk directly to a disk partition, or through a similar, but
non-buffered version of this filesystem. My initial prototype will
also not protect itself well against multiple, competing allocation
requests. I hope to incrementally improve this as I learn more from
actual use.