zoukankan      html  css  js  c++  java
  • vfs open系统调用flow之overall

    vfs open系统调用flow之overall

    最近在看vfs open系统调用的flow,这个flow也是比较的复杂了,涉及到open file path的解析、四大struct(file、dentry、inode、super_block)。

    而且open系统调用会建立很多关系,比如如果某个文件之前没有open过,则会建立这个文件的dentry并且会将这个dentry加入dentry hashtable(dcache),同时这个文件的inode也会被加入inode hashtable(icache),这样后续别人再open它时可以直接使用dcache/icache里的dentry/inode struct了,这样flow就简短很多了。同时会设置file struct f_op成员,这样open之后的read、write就是直接使用这个函数集来进行read/write了。

    现在将最近trace这部分code总结一下,分成4篇文章来写:

    1. vfs open系统调用flow之overall

    2. vfs open系统调用flow之link_path_walk()

    3. vfs open系统调用flow之do_last()

    4. vfs open系统调用flow之具体文件系统lookup(ext4 fs lookup)

    这4篇文章里会有很多“当前目录/文件”的表达,这个当前目录/文件是指正在解析的目录/文件,即nameiddata.last所表示的目录/文件,而非目前已经解析完的nameidata.path所表示的目录,这个目录是“当前目录/文件”的父目录

    LAST_NORM:最后一个分量是普通文件名
    LAST_ROOT:最后一个分量是“/”(也就是整个路径名为“/”)
    LAST_DOT:最后一个分量是“.”
    LAST_DOTDOT:最后一个分量是“..”
    LAST_BIND:最后一个分量是链接到特殊文件系统的符号链接

    open一个在文件系统上已经存在的文件为例,trace下这个open的flow,文件系统以ext4为例

    #define EMBEDDED_LEVELS 2
    struct nameidata {
        struct path    path;
        struct qstr    last;
        struct path    root;
        struct inode    *inode; /* path.dentry.d_inode */
        unsigned int    flags;
        unsigned    seq, m_seq;
        int        last_type;
        unsigned    depth;
        int        total_link_count;
        struct saved {
            struct path link;
            struct delayed_call done;
            const char *name;
            unsigned seq;
        } *stack, internal[EMBEDDED_LEVELS];
        struct filename    *name;
        struct nameidata *saved;
        struct inode    *link_inode;
        unsigned    root_seq;
        int        dfd;
    } __randomize_layout;
    • path成员:里面有struct vfsmount *mnt & struct dentry *dentry成员,在解析查找filename完整文件名时,每解析查找一级,就会更新path结构体,基于更新后的path结构体再解析查找其下的下一级目录。mnt指针指向当前路径对应的文件系统的vfsmount,dentry则是当前目录的dentry,比如一个路径/data/test,/data是ext4类型文件系统,test是一个常规的目录,则mnt表示/data所mount的文件系统的vfsmount;dentry表示/data/test目录的dentry(目录也是一种文件,目录文件里存的是其子目录/文件name、inode num)
    • name成员:其里面的name成员是当前解析查找的目录名,和path类似,每解析查找一级目录,就会将它更新一次

    vfs open系统调用flow概述

    struct file *do_filp_open(int dfd, struct filename *pathname,
            const struct open_flags *op)
    {
        struct nameidata nd;
        int flags = op->lookup_flags;
        struct file *filp;
    
        set_nameidata(&nd, dfd, pathname);
        filp = path_openat(&nd, op, flags | LOOKUP_RCU);
        if (unlikely(filp == ERR_PTR(-ECHILD)))
            filp = path_openat(&nd, op, flags);
        if (unlikely(filp == ERR_PTR(-ESTALE)))
            filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
        restore_nameidata();
        return filp;
    }

     在set_nameidata()将struct filename *name赋值给nameidata name成员,这个name代表open文件的完整文件名(包括路径):

    static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
    {
        struct nameidata *old = current->nameidata;
        p->stack = p->internal;
        p->dfd = dfd;
        p->name = name;
        p->total_link_count = old ? old->total_link_count : 0;
        p->saved = old;
        current->nameidata = p;
    }

    path_openat()调用path_init()以及link_path_walk()以及do_last()

    1. path_init()初始化nameidata里的path成员,如果open file的路径是绝对路径以根目录/开头,则init为根目录/对应的dentry以及vfsmount。为后续路径目录解析做准备。其返回值s指向open文件的完整路径字符串的开头;

    2. link_path_walk(const char *name, struct nameidata *nd) name参数即是path_init的返回值s。link_path_walk()完成的工作是逐级解析file路径,直到解析到最后一级路径,最终会将filename保存到nameidata的last成员以供do_last()处理最后的file open动作。解析每一级路径时,会从dcache(dentry_hashtable)中查找(fast lookup),如果有找到,将找到的dentry保存到path结构体(mnt&dentry);如果没有找到,说明这个目录之前没有被open过,需要创建dentry(slow path)。创建dentry会先alloc一个dentry,然后调用具体文件系统的lookup函数根据name去查找此目录的ext4_dir_entry_2,此结构体里有inode num,根据inode num到inode hash链表里查找,如果有找到,则不用分配inode;如果没有找到,则需要alloc一个inode,然后调用d_splice_alias()将dentry和inode关联起来,即将inode赋值给dentry里的d_inode成员。无论是fast path还是slow path,在各自path的最后会将找到的/分配的dentry保存到path结构体(dentry/mnt),然后调用step_into()将path结构体赋值给nameidata里的path成员(path_to_nameidata),这样nameidata即指向了当前目录,完成了一级目录的解析,然后返回link_path_walk()里接着下一级目录的解析。

    这一阶段解析的是目录

    3. do_last()根据link_path_walk()的最终解析查找结果,此时open file的路径已经都解析完了,只剩下最后的filename没有解析了。如果open flags里没有O_CREAT flag,do_last首先执行lookup_fast()查看file是否有对应的dentry,如果有则将此dentry保存至path结构体;如果有O_CREAT flag或者lookup_fast没有找到则执行lookup_open(),这个函数仍然会先在dcache中查找,如果没有找到,创建一个dentry,这个创建dentry的过程和link_path_walk() flow里的一样。无论是lookup_fast路径还是lookup_open路径,这两个路径都会设置path结构体,将找到的当前file的dentry或者分配的dentry保存到path结构体(dentry/mnt),然后会执行到step_into(),将path结构体赋值给nameidata.path(path_to_nameidata),此时nameidata已经'指向了'当前文件,也即完整的file路径。最后调用vfs_open()以执行具体文件系统的file_operations的open函数,比如ext4 fs,这个open函数是ext4_file_open()。

    这一阶段解析的是最后的file

    dentry的hash值计算

    hash_len = hash_name(nd->path.dentry, name),dentry的hash值计算是调用hash_name()来计算的,其参数dentry表示dentry的parent,name即是当前目录的名字,计算出来的hash值是一个32bit的整数,其值跟parent dentry和name均有关系,只要两者中一个有发生变化,计算出来的hash值就不一样。另外hash_name()还会计算出当前目录name字串长度,它里面有根据/分隔符来分割目录。比如name指向“test/test.txt”,此时计算出来的len是4。hash_name()函数计算hash/len的方法可以用下面的测试函数来测试得出:

    void hash_name_test(void)
    {
        int a = 0x1000;
        int b = 0x2000;
        
        int *p0 = &a;
        int *p1 = &b;
    
        u64 hashlen;
        
        hashlen = hash_name(p0, "p0 pointer");
        pr_info("p0(p0 pointer) hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen));
        hashlen = hash_name(p0, "p0 pointer");
        pr_info("p0(p0 pointer) hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen));
    
        hashlen = hash_name(p0, "p0 pointe");
        pr_info("p0(p0 pointe ) hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen));
        
        hashlen = hash_name(p1, "p1 pointer");
        pr_info("p1             hash: %#x, len: %d.\n", hashlen_hash(hashlen), hashlen_len(hashlen));
    
    }

    执行结果如下,可以看到hash_name()在第一个参数一样的情况下,第二个参数只少了一个r字符,计算出来的hash值也是相差迥异,计算出的len即是其第二个参数字串的长度(这个例子里这个字串里没有带/):

    [ 1762.282727] p0(p0 pointer) hash: 0x4272c576, len: 10.
    [ 1762.287975] p0(p0 pointer) hash: 0x4272c576, len: 10.
    [ 1762.293312] p0(p0 pointe ) hash: 0x08f34377, len: 9.
    [ 1762.298429] p1             hash: 0x22eea1ab, len: 10.

    在dcache里查找是否有当前目录对应的dentry匹配原则

    在dcache里查找有两个API,一个是__d_lookup_rcu,另外一个是__d_lookup。这两个的差异是前者不会使用到rcu以及dentry d_lock spinlock锁,而后者会使用到,所以前者查找的效率要高一些。

    其匹配原则是根据qstr name和比较对象dentry的hash值进行比较,如果相同,则会继续比较parent dentry/name是否一样.

    有些奇怪的是__d_lookup_rcu()里如果当前目录的parent dentry没有DCACHE_OP_COMPARE flag时,则只会比较name string,而看起来对于大部分文件系统来说,此时的dentry是没有此flag的,即没有提供d_compare函数,比如ext4 fs就没有此flag,所以只是先比较目录string len,如果相等再比较name string内容是否一致。不过此时有先比较parent dentry是否一样,如果一样,说明是在同一个parent目录下,再比较下目录name string是否一样,这样看起来也能唯一匹配当前目录,毕竟同一目录下,不可能存在相同名字的目录/文件。

    而__d_lookup()则先后比较了hash值、parent dentry、name string

    struct dentry *__d_lookup_rcu(const struct dentry *parent,
                    const struct qstr *name,
                    unsigned *seqp)
    {
        u64 hashlen = name->hash_len;
        const unsigned char *str = name->name;
        struct hlist_bl_head *b = d_hash(hashlen_hash(hashlen));
        struct hlist_bl_node *node;
        struct dentry *dentry;
    
        /*
         * Note: There is significant duplication with __d_lookup_rcu which is
         * required to prevent single threaded performance regressions
         * especially on architectures where smp_rmb (in seqcounts) are costly.
         * Keep the two functions in sync.
         */
    
        /*
         * The hash list is protected using RCU.
         *
         * Carefully use d_seq when comparing a candidate dentry, to avoid
         * races with d_move().
         *
         * It is possible that concurrent renames can mess up our list
         * walk here and result in missing our dentry, resulting in the
         * false-negative result. d_lookup() protects against concurrent
         * renames using rename_lock seqlock.
         *
         * See Documentation/filesystems/path-lookup.txt for more details.
         */
        hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
            unsigned seq;
    
    seqretry:
            /*
             * The dentry sequence count protects us from concurrent
             * renames, and thus protects parent and name fields.
             *
             * The caller must perform a seqcount check in order
             * to do anything useful with the returned dentry.
             *
             * NOTE! We do a "raw" seqcount_begin here. That means that
             * we don't wait for the sequence count to stabilize if it
             * is in the middle of a sequence change. If we do the slow
             * dentry compare, we will do seqretries until it is stable,
             * and if we end up with a successful lookup, we actually
             * want to exit RCU lookup anyway.
             *
             * Note that raw_seqcount_begin still *does* smp_rmb(), so
             * we are still guaranteed NUL-termination of ->d_name.name.
             */
            seq = raw_seqcount_begin(&dentry->d_seq);
            if (dentry->d_parent != parent)
                continue;
            if (d_unhashed(dentry))
                continue;
    
            if (unlikely(parent->d_flags & DCACHE_OP_COMPARE)) {
                int tlen;
                const char *tname;
                if (dentry->d_name.hash != hashlen_hash(hashlen))
                    continue;
                tlen = dentry->d_name.len;
                tname = dentry->d_name.name;
                /* we want a consistent (name,len) pair */
                if (read_seqcount_retry(&dentry->d_seq, seq)) {
                    cpu_relax();
                    goto seqretry;
                }
                if (parent->d_op->d_compare(dentry,
                                tlen, tname, name) != 0)
                    continue;
            } else {
                if (dentry->d_name.hash_len != hashlen)
                    continue;
                if (dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0)
                    continue;
            }
            *seqp = seq;
            return dentry;
        }
        return NULL;
    }
    struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
    {
        unsigned int hash = name->hash;
        struct hlist_bl_head *b = d_hash(hash);
        struct hlist_bl_node *node;
        struct dentry *found = NULL;
        struct dentry *dentry;
    
        /*
         * Note: There is significant duplication with __d_lookup_rcu which is
         * required to prevent single threaded performance regressions
         * especially on architectures where smp_rmb (in seqcounts) are costly.
         * Keep the two functions in sync.
         */
    
        /*
         * The hash list is protected using RCU.
         *
         * Take d_lock when comparing a candidate dentry, to avoid races
         * with d_move().
         *
         * It is possible that concurrent renames can mess up our list
         * walk here and result in missing our dentry, resulting in the
         * false-negative result. d_lookup() protects against concurrent
         * renames using rename_lock seqlock.
         *
         * See Documentation/filesystems/path-lookup.txt for more details.
         */
        rcu_read_lock();
        
        hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
    
            if (dentry->d_name.hash != hash)
                continue;
    
            spin_lock(&dentry->d_lock);
            if (dentry->d_parent != parent)
                goto next;
            if (d_unhashed(dentry))
                goto next;
    
            if (!d_same_name(dentry, parent, name))
                goto next;
    
            dentry->d_lockref.count++;
            found = dentry;
            spin_unlock(&dentry->d_lock);
            break;
    next:
            spin_unlock(&dentry->d_lock);
         }
         rcu_read_unlock();
    
         return found;
    }

    path_openat返回-ECHILD

    lookup_fast()如果在dcache里没有找到当前目录对应的dentry,然后调用unlazy_walk()后返回了-ECHILD后会终结当前的path_openat(),重新调用path_openat(),此时调用path_openat()将不会带有LOOKUP_RCU,这样后续的flow调用lookup_fast在dcache里查找时将会使用到rcu锁和dentry的spinlock,这个会降低在dcache里查找的效率;而之前的path_openat带有LOOKUP_RCU,所以lookup_fast()在dcache里lookup时将不需要去操作这些锁,效率会提高。

    struct file *do_filp_open(int dfd, struct filename *pathname,
            const struct open_flags *op)
    {
        struct nameidata nd;
        int flags = op->lookup_flags;
        struct file *filp;
    
        set_nameidata(&nd, dfd, pathname);
        filp = path_openat(&nd, op, flags | LOOKUP_RCU);
        if (unlikely(filp == ERR_PTR(-ECHILD)))
            filp = path_openat(&nd, op, flags);
        if (unlikely(filp == ERR_PTR(-ESTALE)))
            filp = path_openat(&nd, op, flags | LOOKUP_REVAL);
        restore_nameidata();
        return filp;
    }

    file struct f_op成员

    对于操作文件系统文件来说,file struct f_op成员,即file_operations成员,是该文件inode struct的i_fop成员,此成员是在文件系统在alloc inode后初始化inode时设置的,以ext4 fs为例,这个设置的地方在ext4_lookup/__ext4_iget(),其被赋值为ext4_file_operations

    __ext4_iget()
        if (S_ISREG(inode->i_mode)) { //常规文件
            inode->i_op = &ext4_file_inode_operations;
            inode->i_fop = &ext4_file_operations;
            ext4_set_aops(inode);
        } else if (S_ISDIR(inode->i_mode)) { //目录
            inode->i_op = &ext4_dir_inode_operations;
            inode->i_fop = &ext4_dir_operations;
        } else if (S_ISLNK(inode->i_mode)) { //link文件

    而又在do_open_dentry()将inode->i_fop赋值给file.f_op:

    do_dentry_open
        f->f_op = fops_get(inode->i_fop); //对于文件系统非驱动case,fops_get是会返回inode->i_fop的
  • 相关阅读:
    cmd 窗口中运行 Java 程序
    局部变量保证线程安全
    AQS源码详细解读
    理解 Java 内存模型的因果性约束
    高性能Java序列化框架Fse发布
    心跳与超时:高并发高性能的时间轮超时器
    支持内部晋升的无锁并发优先级线程池
    最终一致性:BASE论文笔记
    Activiti架构分析及源码详解
    理解OAuth2
  • 原文地址:https://www.cnblogs.com/aspirs/p/15730173.html
Copyright © 2011-2022 走看看