1da177e4c3
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!
177 lines
7.9 KiB
Plaintext
177 lines
7.9 KiB
Plaintext
|
|
Making Filesystems Exportable
|
|
=============================
|
|
|
|
Most filesystem operations require a dentry (or two) as a starting
|
|
point. Local applications have a reference-counted hold on suitable
|
|
dentrys via open file descriptors or cwd/root. However remote
|
|
applications that access a filesystem via a remote filesystem protocol
|
|
such as NFS may not be able to hold such a reference, and so need a
|
|
different way to refer to a particular dentry. As the alternative
|
|
form of reference needs to be stable across renames, truncates, and
|
|
server-reboot (among other things, though these tend to be the most
|
|
problematic), there is no simple answer like 'filename'.
|
|
|
|
The mechanism discussed here allows each filesystem implementation to
|
|
specify how to generate an opaque (out side of the filesystem) byte
|
|
string for any dentry, and how to find an appropriate dentry for any
|
|
given opaque byte string.
|
|
This byte string will be called a "filehandle fragment" as it
|
|
corresponds to part of an NFS filehandle.
|
|
|
|
A filesystem which supports the mapping between filehandle fragments
|
|
and dentrys will be termed "exportable".
|
|
|
|
|
|
|
|
Dcache Issues
|
|
-------------
|
|
|
|
The dcache normally contains a proper prefix of any given filesystem
|
|
tree. This means that if any filesystem object is in the dcache, then
|
|
all of the ancestors of that filesystem object are also in the dcache.
|
|
As normal access is by filename this prefix is created naturally and
|
|
maintained easily (by each object maintaining a reference count on
|
|
its parent).
|
|
|
|
However when objects are included into the dcache by interpreting a
|
|
filehandle fragment, there is no automatic creation of a path prefix
|
|
for the object. This leads to two related but distinct features of
|
|
the dcache that are not needed for normal filesystem access.
|
|
|
|
1/ The dcache must sometimes contain objects that are not part of the
|
|
proper prefix. i.e that are not connected to the root.
|
|
2/ The dcache must be prepared for a newly found (via ->lookup) directory
|
|
to already have a (non-connected) dentry, and must be able to move
|
|
that dentry into place (based on the parent and name in the
|
|
->lookup). This is particularly needed for directories as
|
|
it is a dcache invariant that directories only have one dentry.
|
|
|
|
To implement these features, the dcache has:
|
|
|
|
a/ A dentry flag DCACHE_DISCONNECTED which is set on
|
|
any dentry that might not be part of the proper prefix.
|
|
This is set when anonymous dentries are created, and cleared when a
|
|
dentry is noticed to be a child of a dentry which is in the proper
|
|
prefix.
|
|
|
|
b/ A per-superblock list "s_anon" of dentries which are the roots of
|
|
subtrees that are not in the proper prefix. These dentries, as
|
|
well as the proper prefix, need to be released at unmount time. As
|
|
these dentries will not be hashed, they are linked together on the
|
|
d_hash list_head.
|
|
|
|
c/ Helper routines to allocate anonymous dentries, and to help attach
|
|
loose directory dentries at lookup time. They are:
|
|
d_alloc_anon(inode) will return a dentry for the given inode.
|
|
If the inode already has a dentry, one of those is returned.
|
|
If it doesn't, a new anonymous (IS_ROOT and
|
|
DCACHE_DISCONNECTED) dentry is allocated and attached.
|
|
In the case of a directory, care is taken that only one dentry
|
|
can ever be attached.
|
|
d_splice_alias(inode, dentry) will make sure that there is a
|
|
dentry with the same name and parent as the given dentry, and
|
|
which refers to the given inode.
|
|
If the inode is a directory and already has a dentry, then that
|
|
dentry is d_moved over the given dentry.
|
|
If the passed dentry gets attached, care is taken that this is
|
|
mutually exclusive to a d_alloc_anon operation.
|
|
If the passed dentry is used, NULL is returned, else the used
|
|
dentry is returned. This corresponds to the calling pattern of
|
|
->lookup.
|
|
|
|
|
|
Filesystem Issues
|
|
-----------------
|
|
|
|
For a filesystem to be exportable it must:
|
|
|
|
1/ provide the filehandle fragment routines described below.
|
|
2/ make sure that d_splice_alias is used rather than d_add
|
|
when ->lookup finds an inode for a given parent and name.
|
|
Typically the ->lookup routine will end:
|
|
if (inode)
|
|
return d_splice(inode, dentry);
|
|
d_add(dentry, inode);
|
|
return NULL;
|
|
}
|
|
|
|
|
|
|
|
A file system implementation declares that instances of the filesystem
|
|
are exportable by setting the s_export_op field in the struct
|
|
super_block. This field must point to a "struct export_operations"
|
|
struct which could potentially be full of NULLs, though normally at
|
|
least get_parent will be set.
|
|
|
|
The primary operations are decode_fh and encode_fh.
|
|
decode_fh takes a filehandle fragment and tries to find or create a
|
|
dentry for the object referred to by the filehandle.
|
|
encode_fh takes a dentry and creates a filehandle fragment which can
|
|
later be used to find/create a dentry for the same object.
|
|
|
|
decode_fh will probably make use of "find_exported_dentry".
|
|
This function lives in the "exportfs" module which a filesystem does
|
|
not need unless it is being exported. So rather that calling
|
|
find_exported_dentry directly, each filesystem should call it through
|
|
the find_exported_dentry pointer in it's export_operations table.
|
|
This field is set correctly by the exporting agent (e.g. nfsd) when a
|
|
filesystem is exported, and before any export operations are called.
|
|
|
|
find_exported_dentry needs three support functions from the
|
|
filesystem:
|
|
get_name. When given a parent dentry and a child dentry, this
|
|
should find a name in the directory identified by the parent
|
|
dentry, which leads to the object identified by the child dentry.
|
|
If no get_name function is supplied, a default implementation is
|
|
provided which uses vfs_readdir to find potential names, and
|
|
matches inode numbers to find the correct match.
|
|
|
|
get_parent. When given a dentry for a directory, this should return
|
|
a dentry for the parent. Quite possibly the parent dentry will
|
|
have been allocated by d_alloc_anon.
|
|
The default get_parent function just returns an error so any
|
|
filehandle lookup that requires finding a parent will fail.
|
|
->lookup("..") is *not* used as a default as it can leave ".."
|
|
entries in the dcache which are too messy to work with.
|
|
|
|
get_dentry. When given an opaque datum, this should find the
|
|
implied object and create a dentry for it (possibly with
|
|
d_alloc_anon).
|
|
The opaque datum is whatever is passed down by the decode_fh
|
|
function, and is often simply a fragment of the filehandle
|
|
fragment.
|
|
decode_fh passes two datums through find_exported_dentry. One that
|
|
should be used to identify the target object, and one that can be
|
|
used to identify the object's parent, should that be necessary.
|
|
The default get_dentry function assumes that the datum contains an
|
|
inode number and a generation number, and it attempts to get the
|
|
inode using "iget" and check it's validity by matching the
|
|
generation number. A filesystem should only depend on the default
|
|
if iget can safely be used this way.
|
|
|
|
If decode_fh and/or encode_fh are left as NULL, then default
|
|
implementations are used. These defaults are suitable for ext2 and
|
|
extremely similar filesystems (like ext3).
|
|
|
|
The default encode_fh creates a filehandle fragment from the inode
|
|
number and generation number of the target together with the inode
|
|
number and generation number of the parent (if the parent is
|
|
required).
|
|
|
|
The default decode_fh extract the target and parent datums from the
|
|
filehandle assuming the format used by the default encode_fh and
|
|
passed them to find_exported_dentry.
|
|
|
|
|
|
A filehandle fragment consists of an array of 1 or more 4byte words,
|
|
together with a one byte "type".
|
|
The decode_fh routine should not depend on the stated size that is
|
|
passed to it. This size may be larger than the original filehandle
|
|
generated by encode_fh, in which case it will have been padded with
|
|
nuls. Rather, the encode_fh routine should choose a "type" which
|
|
indicates the decode_fh how much of the filehandle is valid, and how
|
|
it should be interpreted.
|
|
|
|
|