The Linux Virtual File System (2024)

NextPreviousContents

8. The Linux Virtual File System

The Linux kernel implements the concept of Virtual File System(VFS, originally Virtual Filesystem Switch), so that it is(to a large degree) possible to separate actual "low-level"filesystem code from the rest of the kernel.The API of a filesystem is described below.

This API was designed with things closely related to the ext2filesystem in mind. For very different filesystems, like NFS,there are all kinds of problems.

Four main objects: superblock, dentries, inodes, files

The kernel keeps track of files using in-core inodes("index nodes"), usually derived by the low-level filesystemfrom on-disk inodes.

A file may have several names, and there is a layer of dentries("directory entries") that represent pathnames, speeding up thelookup operation.

Several processes may have the same file open for reading or writing,and file structures contain the required informationsuch as the current file position.

Access to a filesystem starts by mounting it. This operationtakes a filesystem type (like ext2, vfat, iso9660, nfs) and a deviceand produces the in-core superblock that contains the informationrequired for operations on the filesystem; a third ingredient,the mount point, specifies what pathname refers to the rootof the filesystem.

Auxiliary objects

We have filesystem types, used to connect the name ofthe filesystem to the routines for setting it up (at mount time)or tearing it down (at umount time).

A struct vfsmount represents a subtree in the big filehierarchy - basically a pair (device, mountpoint).

A struct nameidata represents the result of a lookup.

A struct address_space gives the mapping betweenthe blocks in a file and blocks on disk. It is needed for I/O.

8.1 Terminology

Various objects play a role here. There are file systems,organized collections of files, usually on some disk partition.And there are filesystem types, abstract descriptionsof the way data is organized in a filesystem of that type,like FAT16 or ext2. And there is code, perhaps a module,that implements the handling of file systems of a given type.Sometimes this code is called a low-level filesystem,low-level since it sits below the VFS just like low-level SCSI driverssit below the higher SCSI layers.

8.2 Filesystem type registration

A module implementing a filesystem type must announce its presenceso that it can be used. Its task is (i) to have a name,(ii) to know how it is mounted, (iii) to know how to lookup files,(iv) to know how to find (read, write) file contents.

This announcing is done using the call register_filesystem(),either at kernel initialization time or when the module is inserted.There is a single argument, a struct that contains the name of thefilesystem type (so that the kernel knows when to invoke it) and aroutine that can produce a superblock.

The struct is of type struct file_system_type.Here the 2.2.17 version:

struct file_system_type { const char *name; int fs_flags; struct super_block *(*read_super) (struct super_block *, void *, int); struct file_system_type *next;};

The call register_filesystem() hangs this struct in the chainwith head file_systems, andunregister_filesystem() removes it again.

Accesses to this chain are protected by the spinlockfile_systems_lock. There are no other writers.The main reader is of course the mount() system call(via get_fs_type()).Other readers are get_filesystem_list() usedfor /proc/filesystems, and the sysfssystem call.

The code is in fs/filesystems.c.

static struct file_system_type tue_fs_type = { .owner = THIS_MODULE, .name = "tue", .get_sb = tue_get_sb, .kill_sb = kill_block_super, .fs_flags = FS_REQUIRES_DEV,};static int __init init_tue_fs(void) { return register_filesystem(&tue_fs_type);}static void __exit exit_tue_fs(void){ unregister_filesystem(&tue_fs_type);}

8.3 Struct file_system_type

struct file_system_type { const char *name; int fs_flags; struct super_block *(*get_sb)(struct file_system_type *, int, char *, void *, struct vfsmount *); void (*kill_sb) (struct super_block *); struct module *owner; struct file_system_type *next; struct list_head fs_supers; struct lock_class_key s_lock_key; struct lock_class_key s_umount_key;};

(In 2.4 there was no kill_sb(), and the role ofget_sb() was taken by read_super().The final parameter of get_sb() and thelock_class_key fields are present since 2.6.18.)

Let us look at the fields of the struct file_system_type.

name

Here the filesystem type gives its name ("tue"), so that the kernelcan find it when someone does mount -t tue /dev/foo /dir.(The name is the third parameter of the mount system call.)It must be non-NULL. The name string lives in module space.Access must be protected either by a reference to the module,or by the file_systems_lock.

get_sb

At mount time the kernel calls the fstype->get_sb() routinethat initializes things and sets up a superblock.It must be non-NULL. Typically this is a 1-line routine that callsone of get_sb_bdev, get_sb_single, get_sb_nodev,get_sb_pseudo.

The routines get_sb_single and get_sb_nodev arealmost identical. Both are for virtual filesystems. The former is usedwhen there can be at most one instance of the filesystem.(Now an old instance is used if there is one, but its flags may be changed.)

kill_sb

At umount time the kernel calls the fstype->kill_sb() routineto clean up. It must be non-NULL. Typically one ofkill_block_super, kill_anon_super,kill_litter_super.

The first is normal for filesystems backed by block devices.The second for virtual filesystems, where the information isgenerated on the fly. The third for in-memory filesystems withoutbacking store - they need an additional dget() whena file is created (so that their dentries always have a nonzeroreference count and are not garbage collected), and the d_genocide()that is the difference between kill_anon_super andkill_litter_super does the balancing dput().

fs_flags

The fs_flags field of a struct file_system_typeis a bitmap, an OR of several possible flags with mostly obscure uses only.The flags are defined in fs.h.This field was introduced in 2.1.43. The number of flags, and theirmeanings, varies. In 2.6.19 there are the four flagsFS_REQUIRES_DEV, FS_BINARY_MOUNTDATA,FS_REVAL_DOT, FS_RENAME_DOES_D_MOVE.

FS_REQUIRES_DEV
The FS_REQUIRES_DEV flag (since 2.1.43) says that this is nota virtual filesystem - an actual underlying block device is required.It is used in only two places: when /proc/filesystemsis generated, its absence causes the filesystem type name to be prefixedby "nodev". And in fs/nfsd/export.c this flag is testedin the process of determining whether the filesystem can be exportedvia NFS. Earlier there were more uses.
See Also
Introduction — The Linux Kernel documentation Linux Kernel Documentation — The Linux Kernel documentation Introduction — The Linux Kernel documentation Kernel-doc comments — The Linux Kernel documentation
FS_BINARY_MOUNTDATA
The FS_BINARY_MOUNTDATA flag (since 2.6.5) is set to tellthe selinux code that the mount data is binary, and cannot behandled by the standard option parser. (This flag is set forafs, coda, nfs, smbfs.)
FS_REVAL_DOT
The FS_REVAL_DOT flag (since 2.6.0test4) is set to tellthe VFS code (in namei.c) to revalidate the paths "/", ".", ".."since they might have gone stale. (This flag is set for NFS.)
FS_RENAME_DOES_D_MOVE
The FS_RENAME_DOES_D_MOVE flag (since 2.6.19) says thatthe low-level filesystem will handle d_move() duringa rename(). Earlier (2.4.0test6-2.6.19) this was calledFS_ODD_RENAME and was used for NFS only, butnow this is also useful for ocfs2.See also the discussion of silly rename.
FS_NOMOUNT (gone)
The FS_NOMOUNT flag (2.3.99pre7-2.5.22) says that this filesystemmust never be mounted from userland, but is used only kernel-internally.This was used, for example, for pipefs, the implementation of Unix pipesusing a kernel-internal filesystem (see fs/pipe.c).Even though the flag has disappeared, the concept remains,and is now represented by the MS_NOUSER flag.
FS_LITTER (gone)
The FS_LITTER flag (2.4.0test3-2.5.7) says that after umounta d_genocide() is needed. This will remove one referencefrom all dentries in that tree, probably killing all of them, which isnecessary in case at creation time the dentries already got referencecount 1. (This is typically done for an in-core filesystem where dentriescannot be recreated when needed.) This flag disappeared in Linux 2.5.7when the explicit kill_super method kill_litter_superwas introduced.
FS_SINGLE (gone)
The FS_SINGLE flag (2.3.99pre7-2.5.4) says that there isonly a single superblock for this filesystem type, so thatonly a single instance of this filesystem may exist,possibly mounted in several places.
FS_IBASKET and FS_NO_DCACHE and FS_NO_PRELIM (gone)
The FS_IBASKET was defined in 2.1.43 but never used, andthe definition disappeared in 2.3.99pre4.The FS_NO_DCACHE and FS_NO_PRELIM flags were introducedin 2.1.43, but were a mistake and disappeared again in 2.1.44. However,the definitions survived until Linux 2.5.22.For the purposes of these flags, see the comment in 2.1.43:dcache.c.

owner

The owner field of a struct file_system_typepoints at the module that owns this struct. When doing things thatmight sleep, we must make sure that the module is not unloadedwhile we are using its data, and do this withtry_inc_mod_count(owner). If this fails then the modulewas just unloaded. If it succeeds we have incremented a referencecount so that the module will not go away before we are done.

This field is NULL for filesystems compiled into the kernel.

Example of the use of owner - sysfs

There exists a strange SYSV system call sysfsthat will return (i) a sequence number given a filesystem type,and (ii) a filesystem type given a sequence number, and(iii) the total number of filesystem types registered now.This call is not supported by libc or glibc.

These sequence numbers are rather meaningless since they may changeany moment. But this means that one can get a snapshot of thelist of filesystem types without looking at /proc/filesystems.For example, the program

#include <stdio.h>#include <linux/unistd.h>/* define the 3-arg version of sysfs() */static _syscall3(int,sysfs,int,option,unsigned int,fsindex,char *,buf);/* define the 1-arg version of sysfs() */static int sysfs1(int i) { return sysfs(i,0,NULL);}main(){ int i, tot; char buf[100]; /* how long is a filesystem type name?? */ tot = sysfs1(3); if (tot == -1) { perror("sysfs(3)"); exit(1); } for (i=0; i<tot; i++) { if (sysfs(2, i, buf)) { perror("sysfs(2)"); exit(1); } printf("%2d: %s\n", i, buf); } return 0;}

might give output like

0: ext2 1: minix 2: romfs 3: msdos 4: vfat 5: proc 6: nfs 7: smbfs 8: iso9660

The kernel code for copying the names to user space is instructive:

static int fs_name(unsigned int index, char * buf){ struct file_system_type * tmp; int len, res; read_lock(&file_systems_lock); for (tmp = file_systems; tmp; tmp = tmp->next, index--) if (index <= 0 && try_inc_mod_count(tmp->owner)) break; read_unlock(&file_systems_lock); if (!tmp) return -EINVAL; /* OK, we got the reference, so we can safely block */ len = strlen(tmp->name) + 1; res = copy_to_user(buf, tmp->name, len) ? -EFAULT : 0; put_filesystem(tmp); return res;}

In order to walk safely along a linked list we need the read lock.The routines that change links (like register_filesystem)need a write lock. Once the filesystem name with the desiredindex is found we cannot just copy this name to user space.Maybe the page we want to copy to was swapped out, and getting itback in core takes some time, and maybe the module is unloaded justat that point, and then, when we want to read the name we referencememory that is no longer present. The routine try_inc_mod_count()first gets the module unload lock, then looks whether the module stillis present; if so it increases the module's refcount and returns 1(after releasing the unload lock), otherwise it returns 0.After a successful return of try_inc_mod_count() we owna reference to the module, so that it cannot disappear while weare doing copy_to_user(). The put_filesystem()decreases the module's refcount again.

So this is how the owner field is used: it tells whichmodule must be pinned when we do something with this struct.A module stays as long as its refcount is positive, but candisappear any moment when the refcount becomes zero.

In fs/filesystems.c there is a global variable

static struct file_system_type *file_systems;

that is the head of the list of known filesystem types.A register_filesystem adds the filesystem to the linked list,an unregister_filesystem removes it again.The field next is the link in this simply linked list.It must be NULL when register_filesystem is called,and is reset to NULL by unregister_filesystem.The list is protected by the file_systems_lock.

fs_supers

The fs_supers field of a struct file_system_typeis the head of a list of all superblocks of this type.In each superblock the corresponding link is called s_instances.This list is protected by the spinlock sb_lock.This list is used in sget() for filesystems like NFSwhere we get a filehandle and must check each superblock ofthe given type whether it is the right one.

s_lock_key, s_umount_key

These are fields used when CONFIG_LOCKDEP is defined, and takeno space otherwise. Used for lock validation.

8.4 Mounting

The mount system call attaches a filesystem to the big filehierarchy at some indicated point. Ingredients needed:(i) a device that carries the filesystem (disk, partition,floppy, CDROM, SmartMedia card, ...), (ii) a directorywhere the filesystem on that device must be attached,(iii) a filesystem type.

In many cases it is possible to guess (iii) given the bitson the device, but heuristics fail in rare cases. Moreover,sometimes there is no difference on the device, as for examplein the case where a FAT filesystem without long filenamesmust be mounted. Is it msdos? or vfat? That information is onlyin the user's head. If it must be used later in an environmentthat cannot handle long filenames it should be mounted as msdos;if files with long names are going to be copied to it, as vfat.

The kernel does not guess (except perhaps at boot time, when theroot device has to be found), and requires the three ingredients.In fact the mount system call has five parameters:there are also mount flags (like "read-only") and options, likefor ext2 the choice between errors=continue anderrors=remount-ro and errors=panic.

The code for sys_mount() is found in fs/namespace.cand fs/super.c. The connection with the filesystem type nameis made in do_kern_mount():

 struct file_system_type *type = get_fs_type(fstype); struct super_block *sb; if (!type) return ERR_PTR(-ENODEV); sb = type->get_sb(type, flags, name, data);

and this is the only call of the get_sb() routine.

The code for sys_umount() is found in fs/namespace.cand fs/super.c. The counterpart of the just quoted codeis the cleanup in deactivate_super():

 fs->kill_sb(s);

and this is the only call of the kill_sb() routine.

8.5 The superblock

The superblock gives global information on a filesystem:the device on which it lives, its block size,its type, the dentry of the root of the filesystem,the methods it has, etc., etc.

struct super_block { dev_t s_dev; unsigned long s_blocksize; struct file_system_type *s_type; struct super_operations *s_op; struct dentry *s_root; ...}

struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); void (*read_inode) (struct inode *); void (*dirty_inode) (struct inode *); void (*write_inode) (struct inode *, int); void (*put_inode) (struct inode *); void (*drop_inode) (struct inode *); void (*delete_inode) (struct inode *); void (*put_super) (struct super_block *); void (*write_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); void (*write_super_lockfs) (struct super_block *); void (*unlockfs) (struct super_block *); int (*statfs) (struct super_block *, struct statfs *); int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode) (struct inode *); void (*umount_begin) (struct super_block *); int (*show_options)(struct seq_file *, struct vfsmount *);};

This is enough to get started:the dentry of the root directory tells us the inode of this root directory(and in particular its i_ino),and sb->s_op->read_inode(inode) will read this inode from disk.Now inode->i_op->lookup() allows us to find names in theroot directory, etc.

Each superblock is on six lists, with links through the fieldss_list, s_dirty, s_io, s_anon,s_files, s_instances, respectively.

The super_blocks list

All superblocks are collected in a list super_blockswith links in the fields s_list.This list is protected by the spinlock sb_lock.The main use is in super.c:get_super() or user_get_super()to find the superblock for a given block device.(Both routines are identical, except that one takes a bdev,the other a dev_t.)This list is also used various places where all superblocks must be sync'edor all dirty inodes must be written out.

The fs_supers list

All superblocks of a given type are collected in a list headed by thefs_supers field of the struct filesystem_type,with links in the fields s_instances.Also this list is protected by the spinlock sb_lock.See above.

The file list

All open files belonging to a given superblock are chained ina list headed by the s_files field of the superblock,with links in the fields f_list of the files.These lists are protected by the spinlock files_lock. This list is used for example in fs_may_remount_ro()to check that there are no files currently open for writing.See also below.

The list of anonymous dentries

Normally, all dentries are connected to root. However, whenNFS filehandles are used this need not be the case.Dentries that are roots of subtrees potentially unconnectedto root are chained in a list headed by the s_anonfield of the superblock, with links in the fields d_hash.These lists are protected by the spinlock dcache_lock.They are grown in dcache.c:d_alloc_anon() and shrunkin super.c:generic_shutdown_super().See the discussion in Documentation/filesystems/Exporting.

The inode lists s_dirty, s_io

Lists of inodes to be written out.These lists are headed at the s_dirty (resp. s_io)field of the superblock, with links in the fields i_list.These lists are protected by the spinlock inode_lock.See fs/fs-writeback.c.

8.6 Inodes

An (in-core) inode contains the metadata of a file:its serial number, its protection (mode), its owner, its size,the dates of last access, creation and last modification, etc.It also points to the superblock of the filesystem the file is in,the methods for this file, and the dentries (names) for this file.

struct inode { unsigned long i_ino; umode_t i_mode; uid_t i_uid; gid_t i_gid; kdev_t i_rdev; loff_t i_size; struct timespec i_atime; struct timespec i_ctime; struct timespec i_mtime; struct super_block *i_sb; struct inode_operations *i_op; struct address_space *i_mapping; struct list_head i_dentry; ...}

In early times, struct inode would end with a union

 union { struct minix_inode_info minix_i; struct ext2_inode_info ext2_i; struct ext3_inode_info ext3_i; struct hpfs_inode_info hpfs_i; ... } u;

to store the filesystemtype specific stuff.One could go from inode to e.g. struct ext3_inode_infovia inode->u.ext3_i.This setup was rather dissatisfactory, since it meant thata core data structure had to know about all possiblefilesystem types (even possible out-of-tree ones)and reserve enough room for the largest one among thestruct foofs_inode_info. It also wasted memory.

In Linux 2.5.3 this system was changed, and instead ofa big struct inode having a filesystemtype dependent part,we now have big filesystemtype dependent inodes, with a VFS part.Thus, struct ext3_inode_info has as its last fieldstruct inode vfs_inode;, and given the VFS inode inodeone finds the ext3 information via EXT3_I(inode), defined ascontainer_of(inode, struct ext3_inode_info, vfs_inode).See also the discussion of container_of.

The methods of an inode are given in the struct inode_operations.

struct inode_operations { int (*create) (struct inode *, struct dentry *, int); struct dentry * (*lookup) (struct inode *, struct dentry *); int (*link) (struct dentry *, struct inode *, struct dentry *); int (*unlink) (struct inode *, struct dentry *); int (*symlink) (struct inode *, struct dentry *, const char *); int (*mkdir) (struct inode *, struct dentry *, int); int (*rmdir) (struct inode *, struct dentry *); int (*mknod) (struct inode *, struct dentry *, int, dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); int (*readlink) (struct dentry *, char *,int); int (*follow_link) (struct dentry *, struct nameidata *); void (*truncate) (struct inode *); int (*permission) (struct inode *, int); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *, const void *, size_t, int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *);};

Each inode is on four lists, with links through the fieldsi_hash, i_list, i_dentry,i_devices.

The dentry list

All dentries belonging to this inode (names for this file)are collected in a list headed by the inode field i_dentrywith links in the dentry fields d_alias.This list is protected by the spinlock dcache_lock.

The hash list

All inodes live in a hash table, with hash collision chainsthrough the field i_hash of the inode.These lists are protected by the spinlock inode_lock.The appropriate head is found by a hash function; it will bean element of the inode_hashtable[] array when theinode belongs to a superblock, or anon_hash_chainif not.

i_list

Inodes are collected into lists that use the i_listfield as link field. The lists are protected by the spinlockinode_lock. An inode is either unused, and then onthe chain with head inode_unused, or in use but notdirty, and then on the chain with head inode_in_use,or dirty, and then on one of the per-superblock lists with headss_dirty or s_io, see above.

i_devices

Inodes belonging to a given block device are collected intoa list headed by the bd_inodes field of the block device,with links in the inode i_devices fields.The list is protected by the bdev_lock spinlock.It is used to set the i_bdev field to NULL and to reseti_mapping when the block device goes away.

8.7 Dentries

The dentries encode the filesystem tree structure, the namesof the files. Thus, the main parts of a dentry are the inode(if any) that belongs to it, the name (the final part of the pathname),and the parent (the name of the containing directory).There are also the superblocks, the methods, a list of subdirectories,etc.

struct dentry { struct inode *d_inode; struct dentry *d_parent; struct qstr d_name; struct super_block *d_sb; struct dentry_operations *d_op; struct list_head d_subdirs; ...}

struct dentry_operations { int (*d_revalidate)(struct dentry *, int); int (*d_hash) (struct dentry *, struct qstr *); int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *);};

Here the strings are given by

struct qstr { const unsigned char *name; unsigned int len; unsigned int hash;};

Each dentry is on five lists, with links through the fieldsd_hash, d_lru, d_child, d_subdirs,d_alias.

Naming flaw

Some of these names were badly chosen, and lead to confusion.We should do a global replace changing d_subdirs intod_children and d_child into d_sibling.

Value of a dentry

The pathname represented by a dentry, is the concatenation ofthe name of its parent d_parent, a slash character,and its own name d_name.

However, if the dentry is the root of a mounted filesystem(i.e., if dentry->d_covers != dentry), then its pathnameis the pathname of the mount point d_covers.Finally, the pathname of the root of the filesystem(with dentry->d_parent == dentry) is "/",and this is also its d_name.

The d_mounts and d_covers fields of a dentrypoint back to the dentry itself, except that the d_covers fieldof the dentry for the root of a mounted filesystem points back tothe dentry for the mount point, while the d_mounts fieldof the dentry for the mount point points at the dentry for the root of amounted filesystem.

The d_parent field of a dentry points back to thedentry for the directory in which it lives. It points backto the dentry itself in case of the root of a filesystem.

A dentry is called negative if it does not have anassociated inode, i.e., if it is a name only.

We see that although a dentry represents a pathname, theremay be several dentries for the same pathname, namely whenovermounting has taken place. Such dentries have different inodes.

Of course the converse, an inode with several dentries, can also occur.

The above description, with d_mounts and d_covers,was for 2.4. In 2.5 these fields have disappeared, and we onlyhave the integer d_mounted that indicates how manyfilesystems have been mounted at that point. In case it isnonzero (this is what d_mountpoint() tests), a hashtable lookup can find the actual mounted filesystem.

d_hash

Dentries are used to speed up the lookup operation.A hash table dentry_hashtable is used, with an indexthat is a hash of the name and the parent. The hash collisionchain has links through the dentry fields d_hash.This chain is protected by the spinlock dcache_lock.

d_lru

All unused dentries are collected in a list dentry_unusedwith links in the dentry fields d_lru.This list is protected by the spinlock dcache_lock.

d_child, d_subdirs

All subdirectories of a given directory are collected in a listheaded by the dentry field d_subdirs with linksin the dentry fields d_child.These lists are protected by the spinlock dcache_lock.

d_alias

All dentries belonging to the same inode are collected in a listheaded by the inode field i_dentrywith links in the dentry fields d_alias.This list is protected by the spinlock dcache_lock.

8.8 Files

File structures represent open files, that is, an inode togetherwith a current (reading/writing) offset. The offset can be setby the lseek() system call. Note that instead of apointer to the inode we have a pointer to the dentry - thatmeans that the name used to open a file is known. In particularsystem calls like getcwd() are possible.

struct file { struct dentry *f_dentry; struct vfsmount *f_vfsmnt; struct file_operations *f_op; mode_t f_mode; loff_t f_pos; struct fown_struct f_owner; unsigned int f_uid, f_gid; unsigned long f_version; ...}

Here the f_owner field gives the owner to use forasync I/O signals.

struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, char *, size_t, loff_t); ssize_t (*write) (struct file *, const char *, size_t, loff_t *); ssize_t (*aio_write) (struct kiocb *, const char *, size_t, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);};

Each file is in two lists, with links through the fieldsf_list, f_ep_links.

f_list

The list with links through f_list was discussedabove. It is the list of all filesbelonging to a given superblock. There is a second use:the tty driver collects all files that are opened instancesof a tty in a list headed by tty->tty_files withlinks through the file field f_list. Conversely,these files point back at the tty via their fieldprivate_data.

(This field private_data is also used elsewhere.For example, the proc code uses it to attach a struct seq_fileto a file.)

The event poll list

All event poll items belonging to a given file are collectedin a list with head f_ep_links,protected by the file field f_ep_lock.(For event poll stuff, see epoll_ctl(2).)

8.9 struct vfsmount

A struct vfsmount describes a mount.The definition lives in mount.h:

struct vfsmount { struct list_head mnt_hash; struct vfsmount *mnt_parent; /* fs we are mounted on */ struct dentry *mnt_mountpoint; /* dentry of mountpoint */ struct dentry *mnt_root; /* root of the mounted tree */ struct super_block *mnt_sb; /* pointer to superblock */ struct list_head mnt_mounts; /* list of children, anchored here */ struct list_head mnt_child; /* and going through their mnt_child */ atomic_t mnt_count; int mnt_flags; char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */ struct list_head mnt_list;};

Long ago (1.3.46) it was introduced as part of the quota code.There was a linked list of struct vfsmounts thatcontained a device number, device name, mount point name,mount flags, superblock pointer, semaphore, file pointersto quota files and time limits for how long an over-quota situationwould be allowed. Nowadays quota have independent bookkeeping,and a struct vfsmount only describes a mount.

These structs are allocated by alloc_vfsmnt() andreleased by free_vfsmnt() in namespace.c.

mnt_hash

Vfsmounts live in a hash headed by mount_hashtable[].The field mnt_hash is the link in the collision chain.This list does not seem to be protected by a lock.They are put into the hash by attach_mnt(), found thereby lookup_mnt(), and removed again by detach_mnt(),all from namespace.c.

mnt_parent

Vfsmount for parent.

mnt_mountpoint

Dentry for the mountpoint. The pair (mnt_mountpoint, mnt_parent)(returned by follow_up()) will be the dentry and vfsmountfor the parent.Used e.g. in d_path to return the pathname of a dentry.

mnt_root

Dentry for the root of the mounted tree.

mnt_sb

Superblock of the mounted filesystem.

mnt_mounts, mnt_child

The field mnt_mounts of a struct vfsmount is the head ofa cyclic list of all submounts (mounts on top of some pathrelative to the present mount). The remaining links of this cycliclist are stored in the mnt_child fields of its submountingvfsmounts. (And each of these points back at us with itsmnt_parent field.)Used in autofs4/expire.c and namespace.c (and nowhere else).

mnt_count

Keep track of users of this structure.Incremented by mntget, decremented by mntput.Initially 1. It will be 2 for a mount that may be unmounted.(Autofs also uses this to test whether a tree is busy.)

mnt_flags

The mount flags, like MNT_NODEV, MNT_NOEXEC, MNT_NOSUID.Earlier also MS_RDONLY (that now is stored in sb->s_flags)and MNT_VISIBLE (came in 2.4.0-test5, went in 2.4.5)that told whether this entry should be visible in /proc/mounts.

mnt_devname

Name used in /proc/mounts.

mnt_list

There was a global cyclic list vfsmntlistcontaining all mounts, used only to create the contents of/proc/mounts. These days we have per-process namespaces,and the global vfsmntlist has been replaced bycurrent->namespace->list. This list is orderedby the order in which the mounts were done, so that one can dothe umounts in reverse order.The field mnt_list contains the pointers for this cyclic list.

8.10 fs_struct

A struct fs_struct determines the interpretationof pathnames referred to by a process (and also, somewhatillogically, contains the umask). The typical referenceis current->fs. The definition lives in fs_struct.h:

struct fs_struct { atomic_t count; rwlock_t lock; int umask; struct dentry * root, * pwd, * altroot; struct vfsmount * rootmnt, * pwdmnt, * altrootmnt;};

Semantics of root and pwd are clear.Remains to discuss altroot.

altroot

In order to support emulation of different operating systemslike BSD and SunOS and Solaris, a small wart has been addedto the walk_init_root code that finds the root directoryfor a name lookup.

The altroot field of an fs_structis usually NULL. It is a function of the personalityand the current root, and the sys_personalityand sys_chroot system calls call set_fs_altroot().

The effect is determined at kernel compile time.One can define __emul_prefix() in <asm/namei.h>as some pathname, say "usr/gnemul/myOS/".The default is NULL, but some architectures have adefinition depending on current->personality.If this prefix is non-NULL, and the corresponding file is found,then set_fs_altroot() will set the altrootand altrootmnt fields of current->fsto dentry and vfsmnt of that file.

A subsequent lookup of a pathname starting with '/' will nowfirst try to use the altroot. If that fails the usual root is used.

8.11 nameidata

A struct nameidata represents the result of a lookup.The definition lives in fs.h:

struct nameidata { struct dentry *dentry; struct vfsmount *mnt; struct qstr last; unsigned int flags; int last_type;};

The typical use is:

struct nameidata nd; error = user_path_walk(filename, &nd); if (!error) path_release(&nd);

where path_release() does

dput(nd->dentry); mntput(nd->mnt);

The core of the routines user_path_walk_linkand user_path_walk (which call __user_walkwithout or with the LOOKUP_FOLLOW flag) is thefragment

if (path_init(name, flags, nd)) error = path_walk(name, nd);

So the basic routines handling nameidata are path_initand path_walk. The former finds the start of the walk,the latter does the walking. (However, the former returns0 in case it did the walking itself already.)

path_init

The routine path_init initialises the four fieldsdentry, mnt, flags, last_type.The flags field was given as an argument, anddentry and mnt are initialised to thoseof the current directory or those of the root directorydepending on whether name starts with a '/' or not.It will always return 1 except in a certain obscure casediscussed below, where the return 0 means that the completelookup was done already. (And this case cannot occur forsys_chroot, that is why the code there needs not checkthe return value.)

A wart

(path_init will always return 1, except when name startswith a '/', in which case it returns whatever walk_init_rootreturns.walk_init_root will always return 1, except whencurrent->fs->altroot is non-NULL and nd->flagsdoes not contain LOOKUP_NOALT (for sys_chroot it does)and __emul_lookup_dentry succeeds, which it does whenpathwalk succeeds - in this case no path_walk isrequired anymore)

NextPreviousContents