zoukankan      html  css  js  c++  java
  • Linux Kernel文件系统写I/O流程代码分析(二)bdi_writeback

    Linux Kernel文件系统写I/O流程代码分析(二)bdi_writeback

    上一篇# Linux Kernel文件系统写I/O流程代码分析(一),我们看到Buffered IO,写操作写入到page cache后就直接返回了,本文主要分析脏页是如何刷盘的。

    概述

    由于内核page cache的作用,写操作实际被延迟写入。当page cache里的数据被用户写入但是没有刷新到磁盘时,则该page为脏页(块设备page cache机制因为以前机械磁盘以扇区为单位读写,引入了buffer_head,每个4K的page进一步划分成8个buffer,通过buffer_head管理,因此可能只设置了部分buffer head为脏)。
    脏页在以下情况下将被回写(write back)到磁盘上:

    • 脏页在内存里的时间超过了阈值。
    • 系统的内存紧张,低于某个阈值时,必须将所有脏页回写。
    • 用户强制要求刷盘,如调用sync()、fsync()、close()等系统调用。

    以前的Linux通过pbflush机制管理脏页的回写,但因为其管理了所有的磁盘的page/buffer_head,存在严重的性能瓶颈,因此从Linux 2.6.32开始,脏页回写的工作由bdi_writeback机制负责。bdi_writeback机制为每个磁盘都创建一个线程,专门负责这个磁盘的page cache或者
    buffer cache的数据刷新工作,以提高I/O性能。

    BDI系统

    BDI是backing device info的缩写,它用于描述后端存储(如磁盘)设备相关的信息。相对于内存来说,后端存储的I/O比较慢,因此写盘操作需要通过page cache进行缓存延迟写入。
    最初的BDI子系统里,模块启动的时候创建bdi-default进程,然后为每个注册的设备创建flush-x:y(x,y为主次设备号)的进程,用于脏数据的回写。在Linux 3.10.0版本之后,BDI子系统使用workqueue机制代替原来的线程创建,需要回写时,将flush任务提交给workqueue,最终由通用的[kworker]进程负责处理。BDI子系统初始化的代码如下:

    static int __init default_bdi_init(void)
    {
    	int err;
    
    	bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_FREEZABLE |
    					      WQ_UNBOUND | WQ_SYSFS, 0);
    	if (!bdi_wq)
    		return -ENOMEM;
    
    	err = bdi_init(&default_backing_dev_info);
    	if (!err)
    		bdi_register(&default_backing_dev_info, NULL, "default");
    	err = bdi_init(&noop_backing_dev_info);
    
    	return err;
    }
    subsys_initcall(default_bdi_init);
    

    设备注册

    当执行mount流程时,底层文件系统定义自己的struct backing_dev_info结构并将其注册到BDI子系统,如下是FUSE代码示例:

    static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
    {
    	int err;
    
    	fc->bdi.name = "fuse";
    	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
    	/* fuse does it's own writeback accounting */
    	fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB | BDI_CAP_STRICTLIMIT;
    
    	err = bdi_init(&fc->bdi);
    	if (err)
    		return err;
    
    	fc->bdi_initialized = 1;
    
    	if (sb->s_bdev) {
    		err =  bdi_register(&fc->bdi, NULL, "%u:%u-fuseblk",
    				    MAJOR(fc->dev), MINOR(fc->dev));
    	} else {
    		err = bdi_register_dev(&fc->bdi, fc->dev);
    	}
    
    	if (err)
    		return err;
    
    	/*
    	 *    /sys/class/bdi/<bdi>/max_ratio
    	 */
    	bdi_set_max_ratio(&fc->bdi, 1);
    
    	return 0;
    }
    

    该函数先通过bdi_init()初始化struct backing_dev_info,然后通过bid_register()将其注册到BDI子系统。
    其中bdi_init()会调用bdi_wb_init()初始化struct bdi_writeback

    static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
    {
    	memset(wb, 0, sizeof(*wb));
    
    	wb->bdi = bdi;
    	wb->last_old_flush = jiffies;
    	INIT_LIST_HEAD(&wb->b_dirty);
    	INIT_LIST_HEAD(&wb->b_io);
    	INIT_LIST_HEAD(&wb->b_more_io);
    	spin_lock_init(&wb->list_lock);
    	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
    }
    

    其中初始化了一个默认处理函数为bdi_writeback_workfn的work,用于回写处理。

    数据回写

    在上一篇的基础上,将图补充了bdi回写的部分,如下所示:
    数据写入缓存和回写

    bdi_queue_work

    BDI子系统使用workqueue机制进行数据回写,其回写接口为bdi_queue_work()将具体某个bdi的回写请求(wb_writeback_work)挂到bdi_wq上。代码如下:

    static void bdi_queue_work(struct backing_dev_info *bdi,
    			   struct wb_writeback_work *work)
    {
    	trace_writeback_queue(bdi, work);
    
    	spin_lock_bh(&bdi->wb_lock);
    	if (!test_bit(BDI_registered, &bdi->state)) {
    		if (work->done)
    			complete(work->done);
    		goto out_unlock;
    	}
    	list_add_tail(&work->list, &bdi->work_list);
    	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
    out_unlock:
    	spin_unlock_bh(&bdi->wb_lock);
    }
    

    调用该函数的地方包括:

    • sync_inode_sb(): 将该super block上所有的脏inode回写。
    • writeback_inodes_sb_nr():回写super block上指定个数脏inode。
    • __bdi_start_writeback():定时调用或者需要释放pages或者需要更多内存时调用。

    bdi_writeback_workfn

    bdi_queue_work()提交了work给bdi_wq上,由对应的bdi处理函数进行处理,默认的函数为bdi_writeback_workfn,其代码如下:

    void bdi_writeback_workfn(struct work_struct *work)
    {
    	struct bdi_writeback *wb = container_of(to_delayed_work(work),
    						struct bdi_writeback, dwork);
    	struct backing_dev_info *bdi = wb->bdi;
    	long pages_written;
    
    	set_worker_desc("flush-%s", dev_name(bdi->dev));
    	current->flags |= PF_SWAPWRITE;
    
    	if (likely(!current_is_workqueue_rescuer() ||
    		   !test_bit(BDI_registered, &bdi->state))) {
    		/*
    		 * The normal path.  Keep writing back @bdi until its
    		 * work_list is empty.  Note that this path is also taken
    		 * if @bdi is shutting down even when we're running off the
    		 * rescuer as work_list needs to be drained.
    		 */
    		do {
    			pages_written = wb_do_writeback(wb);
    			trace_writeback_pages_written(pages_written);
    		} while (!list_empty(&bdi->work_list));
    	} else {
    		/*
    		 * bdi_wq can't get enough workers and we're running off
    		 * the emergency worker.  Don't hog it.  Hopefully, 1024 is
    		 * enough for efficient IO.
    		 */
    		pages_written = writeback_inodes_wb(&bdi->wb, 1024,
    						    WB_REASON_FORKER_THREAD);
    		trace_writeback_pages_written(pages_written);
    	}
    
    	if (!list_empty(&bdi->work_list))
    		mod_delayed_work(bdi_wq, &wb->dwork, 0);
    	else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
    		bdi_wakeup_thread_delayed(bdi);
    
    	current->flags &= ~PF_SWAPWRITE;
    }
    

    首先判断当前workqueue能否获得足够的worker进行处理,如果能则将bdi上所有work全部提交,否则只提交一个work并限制写入1024个pages。
    正常情况下通过调用wb_do_writeback函数处理回写。

    wb_do_writeback

    该函数代码如下,遍历bdi上所有work,通过调用wb_writeback()进行数据写入。

    static long wb_do_writeback(struct bdi_writeback *wb)
    {
    	struct backing_dev_info *bdi = wb->bdi;
    	struct wb_writeback_work *work;
    	long wrote = 0;
    
    	set_bit(BDI_writeback_running, &wb->bdi->state);
    	while ((work = get_next_work_item(bdi)) != NULL) {
    
    		trace_writeback_exec(bdi, work);
    
    		wrote += wb_writeback(wb, work);
    
    		/*
    		 * Notify the caller of completion if this is a synchronous
    		 * work item, otherwise just free it.
    		 */
    		if (work->done)
    			complete(work->done);
    		else
    			kfree(work);
    	}
    
    	/*
    	 * Check for periodic writeback, kupdated() style
    	 */
    	wrote += wb_check_old_data_flush(wb);
    	wrote += wb_check_background_flush(wb);
    	clear_bit(BDI_writeback_running, &wb->bdi->state);
    
    	return wrote;
    }
    

    wb_writeback()函数最终调用__writeback_single_inode()将某个inode上脏页刷回。

    __writeback_single_inode

    __writeback_single_inode()的代码如下,最终通过调用do_writepages()函数写盘:

    static int
    __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
    {
    	struct address_space *mapping = inode->i_mapping;
    	long nr_to_write = wbc->nr_to_write;
    	unsigned dirty;
    	int ret;
    
    	WARN_ON(!(inode->i_state & I_SYNC));
    
    	trace_writeback_single_inode_start(inode, wbc, nr_to_write);
    
    	ret = do_writepages(mapping, wbc);
    
    	/*
    	 * Make sure to wait on the data before writing out the metadata.
    	 * This is important for filesystems that modify metadata on data
    	 * I/O completion. We don't do it for sync(2) writeback because it has a
    	 * separate, external IO completion path and ->sync_fs for guaranteeing
    	 * inode metadata is written back correctly.
    	 */
    	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
    		int err = filemap_fdatawait(mapping);
    		if (ret == 0)
    			ret = err;
    	}
    
    	/*
    	 * Some filesystems may redirty the inode during the writeback
    	 * due to delalloc, clear dirty metadata flags right before
    	 * write_inode()
    	 */
    	spin_lock(&inode->i_lock);
    	/* Clear I_DIRTY_PAGES if we've written out all dirty pages */
    	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
    		inode->i_state &= ~I_DIRTY_PAGES;
    	dirty = inode->i_state & I_DIRTY;
    	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
    	spin_unlock(&inode->i_lock);
    	/* Don't write the inode if only I_DIRTY_PAGES was set */
    	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
    		int err = write_inode(inode, wbc);
    		if (ret == 0)
    			ret = err;
    	}
    	trace_writeback_single_inode(inode, wbc, nr_to_write);
    	return ret;
    }
    

    do_writepages

    函数do_writepages()在上一篇已经介绍过了,它负责调用底层文件系统的a_ops->writepages将pages写入后端存储。

  • 相关阅读:
    深入Android 【一】 —— 序及开篇
    Android中ContentProvider和ContentResolver使用入门
    深入Android 【六】 —— 界面构造
    The service cannot be activated because it does not support ASP.NET compatibility. ASP.NET compatibility is enabled for this application. Turn off ASP.NET compatibility mode in the web.config or add the AspNetCompatibilityRequirements attribute to the ser
    Dynamic Business代码片段总结
    对文件的BuildAction以content,resource两种方式的读取
    paraview 3.12.0 windows下编译成功 小记
    百度网盘PanDownload使用Aria2满速下载
    netdata的安装与使用
    用PS给证件照排版教程
  • 原文地址:https://www.cnblogs.com/jimbo17/p/10491223.html
Copyright © 2011-2022 走看看