Describing the VFS
To ensure that applications do not face any such obstacles (as mentioned earlier) when working with different filesystems, the Linux kernel implements a layer between end user applications and the filesystem on which data is being stored. This layer is known as the Virtual Filesystem (VFS). The VFS is not a standard filesystem, such as Ext4 or XFS. (There is no mkfs.vfs
command!) For this reason, some prefer the term Virtual Filesystem Switch.
Think of the magic wardrobe from The Chronicles of Narnia. The wardrobe is actually a portal to the magical world of Narnia. Once you step through the wardrobe, you can explore the new world and interact with its inhabitants. The wardrobe facilitates accessing the magical world. In a similar way, the VFS provides a doorway to different filesystems.
The VFS defines a generic interface that allows multiple filesystems to coexist in Linux. It’s worth mentioning again that with the VFS, we’re not talking about a standard block-based filesystem. We’re talking about an abstraction layer that provides a link between the end user application and the actual block filesystems. Through the standardization implemented in the VFS, applications can perform read and write operations, without worrying about the underlying filesystem.
As shown in Figure 1.3, the VFS is interposed between the user space programs and actual filesystems:
Figure 1.3 – The VFS acts as a bridge between user space programs and filesystems
For the VFS to provide services to both parties, the following has to apply:
- All end user applications need to define their filesystem operations in terms of the standard interface provided by the VFS
- Every filesystem needs to provide an implementation of the common interface provided by the VFS
We explained that applications in user space need to generate system calls when they want to access resources in the kernel space. Through the abstraction provided by the VFS, system calls such as read()
and write()
function properly, regardless of the filesystem in use. These system calls work across filesystem boundaries. We don’t need a special mechanism to move data to a different or non-native filesystem. For instance, we can easily move data from an Ext4 filesystem to XFS, and vice versa. At a very high level, when a process issues the read()
or write()
system call to read or write a file, the VFS will search for the filesystem driver to use and forward these system calls to that driver.
Implementing a common filesystem interface through the VFS
The primary goal of the VFS is to represent a diverse set of filesystems in the kernel with minimum overhead. When a process requests a read or write operation on a file, the kernel substitutes this with the filesystem-specific function on which the file resides. In order to achieve this, every filesystem must adapt itself in terms of the VFS.
Let’s go through the following example for a better understanding.
Consider the example of the cp
(copy
) command in Linux. Let’s suppose we’re trying to copy a file from an Ext4 to an XFS filesystem. How does this copy operation complete? How does the cp
command interact with the two filesystems? Have a look at Figure 1.4:
Figure 1.4 – The VFS ensures interoperability between different filesystems
First off, the cp
command doesn’t care about the filesystems being used. We’ve defined the VFS as the layer that implements abstraction. So, the cp
command doesn’t need to concern itself about the filesystem details. It will interact with the VFS layer through the standard system call interface. Specifically, it will issue the open ()
and read ()
system calls to open and read the file to be copied. An open file is represented by the file data structure in the kernel (as we’ll learn in the next chapter, Chapter 2, Explaining the Data Structures in a VFS.
When cp
generates these generic system calls, the kernel will redirect these calls to the appropriate function of the filesystem through a pointer, on which the file resides. To copy the file to the XFS filesystem, the write()
system call is passed to the VFS. This will again be redirected to the particular function of the XFS filesystem that implements this feature. Through system calls issued to the VFS, the cp
process can perform a copy operation using the read ()
method of Ext4 and the write ()
method of XFS. Just like a switch, the VFS will switch the common file access methods between their designated filesystem implementations.
The read, write, or any other function for that matter does not have a default definition in the kernel – hence the name virtual. The interpretation of these operations depends upon the underlying filesystem. Just like user programs that take advantage of this abstraction offered by the VFS, filesystems also reap the benefits of this approach. Common access methods for files do not need to be reimplemented by filesystems.
That was pretty neat, right? But what if we want to copy something from Ext4 to a non-native filesystem? Filesystems such as Ext4, XFS, and Btrfs were specifically designed for Linux. What if one of the filesystems involved in this operation is FAT or NTFS?
Admittedly, the design of the VFS is biased toward filesystems that come from the Linux tribe. To an end user, there is a clear distinction between a file and a directory. In the Linux philosophy, everything is a file, including directories. Filesystems native to Linux, such as Ext4 and XFS, were designed keeping these nuances in mind. Because of the differences in the implementation, non-native filesystems such as FAT and NTFS do not support all of the VFS operations. The VFS in Linux uses structures such as inodes, superblocks, and directory entries to represent a generic view of a filesystem. Non-native Linux filesystems do not speak in terms of these structures. So how does Linux accommodate these filesystems? Take the example of the FAT filesystem. The FAT filesystem comes from a different world and doesn’t use these structures to represent files and directories. It doesn’t treat directories as files. So, how does the VFS interact with the FAT filesystem?
All filesystem-related operations in the kernel are firmly integrated with the VFS data structures. To accommodate non-native filesystems on Linux, the kernel constructs the corresponding data structures dynamically. For instance, to satisfy the common file model for filesystems such as FAT, files corresponding to directories will be created in memory on the fly. These files are virtual and will only exist in memory. This is an important concept to understand. On native filesystems, structures such as inodes and superblocks are not only present in memory but also stored on the physical medium itself. Conversely, non-Linux filesystems merely have to perform the enactment of such structures in memory.
Peeking at the source code
If we take a look at the kernel source code, the different functions provided by the VFS are present in the fs
directory. All source files ending in .c
contain implementations of the different VFS methods. The subdirectories contain specific filesystem implementations, as shown in Figure 1.5:
Figure 1.5 – The source for kernel 5.19.9
You’ll notice source files such as open.c
and read_write.c
, which are the functions invoked when a user space process generates open ()
, read ()
, and write ()
system calls. These files contain a lot of code, and since we won’t create any new code here, this is merely a poking exercise. Nevertheless, there are a few important pieces of code in these files that highlight what we explained earlier. Let’s take a quick peek at the read and write functions.
The SYSCALL_DEFINE3
macro is the standard way to define a system call and takes the name of the system call as one of the parameters.
For the write
system call, this definition looks as follows. Note that one of the parameters is the file descriptor:
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count) { return ksys_write(fd, buf, count); }
Similarly, this is the definition for the read
system call:
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { return ksys_read(fd, buf, count); }
Both call the ksys_write ()
and ksys_read ()
functions. Let’s see the code for these two functions:
ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; ******* Skipped ******* ret = vfs_read(f.file, buf, count, ppos); ******* Skipped ******* return ret; } ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; ******* Skipped ******* ret = vfs_write(f.file, buf, count, ppos); ******* Skipped ******* return ret; }
The presence of the vfs_read ()
and vfs_write ()
functions indicates that we’re transitioning to the VFS. These functions look up the file_operations
structure for the underlying filesystem and invoke the appropriate read ()
and write ()
methods:
ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) { ******* Skipped ******* if (file->f_op->read) ret = file->f_op->read(file, buf, count, pos); else if (file->f_op->read_iter) ret = new_sync_read(file, buf, count, pos); ******* Skipped ******* return ret; } ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) { ******* Skipped ******* if (file->f_op->write) ret = file->f_op->write(file, buf, count, pos); else if (file->f_op->write_iter) ret = new_sync_write(file, buf, count, pos); ******* Skipped ******* return ret; }
Each filesystem defines the file_operations
structure of pointers for supporting operations. There are multiple definitions of the file_operations
structure in the kernel source code, unique to each filesystem. The operations defined in this structure describe how read or write functions will be performed:
root@linuxbox:/linux-5.19.9/fs# grep -R "struct file_operations" * | wc -l 453 root@linuxbox:/linux-5.19.9/fs# grep -R "struct file_operations" * 9p/vfs_dir.c:const struct file_operations v9fs_dir_operations = { 9p/vfs_dir.c:const struct file_operations v9fs_dir_operations_dotl = { 9p/v9fs_vfs.h:extern const struct file_operations v9fs_file_operations; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_file_operations_dotl; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_dir_operations; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_dir_operations_dotl; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_cached_file_operations; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_cached_file_operations_dotl; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_mmap_file_operations; 9p/v9fs_vfs.h:extern const struct file_operations v9fs_mmap_file_operations_dotl;
[The rest of the code is skipped for brevity.]
As you can see, the file_operations
structure is used for a wide range of file types, including regular files, directories, device files, and network sockets. In general, any type of file that can be opened and manipulated using standard file I/O operations can be covered by this structure.