dm-bufio

可以理解为 device-mapper framework 自己实现的简单 buf 库(dm 喜欢自己封装提供自己的 API 接口,这样的好处是可以隐藏 block layer 的一些变动,不好的地方是熟悉这些 API 需要时间,好在实现简洁易懂,实际中用到的时候可以方便查看实现细节),可以看作是 dm 的 malloc 库,dm_bufio_client 看作 buf pool 的句柄,dm_buffer 看作 pool 中单个 buf 句柄。用户有了这些句柄之后,就可以读写数据到 buf;与 malloc 不同的是,dm-bufio 的使用中隐含了磁盘上的地址空间,这样会进行 IO 而不需要用户自己构造 bio。另外 dm_bufio_cache wq 也会定时来清理 pool buf

使用 dm-bufio 不需要关注 IO 和 memory management 细节,其 IO 是以 fix-sized block 为单位的(创建 dm_bufio_client 时指定),要关注的是 alloc_callbackwrite_callbacksubmit io 的 target address 对应的 block nruse_biouse_dmio

系统中可能存在多个 dm_bufio_client,每个 client 管理一组 buffers。buffer 的内存有三种分配渠道:slab cache、buddy、vmalloc。IO submit 有两种方式:bio 和 dm-io,而 dm-io 实际上是将一个 vmalloc 分配的 large buffer 通过 split 成多个 bio 来 submit

The dm-bufio interface allows you to do cached I/O on devices and acts as a cache, holding recently-read blocks in memory and performing delayed writes.

Usage can be viewed and changed in /sys/module/dm_bufio/parameters/

We don't use buffer cache or page cache already present in the kernel, because: * we need to handle block sizes larger than a page * we can't allocate memory to perform reads or we'd have deadlocks

Currently, when a cache is required, we limit its size to a fraction of available memory.

To avoid deadlocks, these conditions are observed:
    - At most one thread can hold at most "reserved_buffers" simultaneously.
    - Each other threads can hold at most one buffer.
    - Threads which call only dm_bufio_get can hold unlimited number of buffers.
    - If there some previous write going on, wait for it to finish (we can't have two writes on the same buffer simultaneously 
    dm_bufio_wq = alloc_workqueue("dm_bufio_cache", WQ_MEM_RECLAIM, 0);
    INIT_DELAYED_WORK(&dm_bufio_work, work_fn);
    queue_delayed_work(dm_bufio_wq, &dm_bufio_work, DM_BUFIO_WORK_TIMER_SECS * HZ);
        每隔 DM_BUFIO_WORK_TIMER_SECS * HZ 由 dm_bufio_wq 执行 cleanup_old_buffers():对系统中每个 dm_bufio_client 通过 __evict_old_buffers 来整理 buf memory

I/O on the buffer

Bio interface is faster but it has some problems: * the vector list is limited (increasing this limit increases memory-consumption per buffer, so it is not viable) * the memory must be direct-mapped, not vmalloced

If the buffer is small enough (up to DM_BUFIO_INLINE_VECS pages) and it is not vmalloced, try using the bio interface.

If the buffer is big, if it is vmalloced or if the underlying device rejects the bio because it is too large, use dm-io layer to do the I/O. The dm-io layer splits the I/O into multiple requests, avoiding the above shortcomings.

write dirty buffers

__write_dirty_buffer:Initiate a write on a dirty buffer, but don't wait for it. If there some previous write going on, wait for it to finish (we can't have two writes on the same buffer simultaneously 不能同时写一个 buffer) wait_on_bit_lock_io(&b->state, B_WRITING, TASK_UNINTERRUPTIBLE). Submit our write and don't wait on it. We set B_WRITING indicating that there is a write in progress 这样其它进程就不能同时写这个 buffer 了

write_endio clear B_WRITING bit and wake anyone who was waiting on it

Memory management policy

Limit the number of buffers to DM_BUFIO_MEMORY_PERCENT of main memory or DM_BUFIO_VMALLOC_PERCENT of vmalloc memory (whichever is lower).

    mem = (__u64)mult_frac(totalram_pages - totalhigh_pages,
                DM_BUFIO_MEMORY_PERCENT, 100) << PAGE_SHIFT;
    if (mem > ULONG_MAX)
                mem = ULONG_MAX;
    #ifdef CONFIG_MMU
    if (mem > mult_frac(VMALLOC_TOTAL, DM_BUFIO_VMALLOC_PERCENT, 100))
                mem = mult_frac(VMALLOC_TOTAL, DM_BUFIO_VMALLOC_PERCENT, 100);
    #endif

Always allocate at least DM_BUFIO_MIN_BUFFERS buffers. Start background writeback when there are DM_BUFIO_WRITEBACK_PERCENT dirty buffers.

    #define DM_BUFIO_MIN_BUFFERS        8
    #define DM_BUFIO_MEMORY_PERCENT     2
    #define DM_BUFIO_VMALLOC_PERCENT    25
    #define DM_BUFIO_WRITEBACK_PERCENT  75

API

    /* Create a buffered IO cache on a given device */
    struct dm_bufio_client *dm_bufio_client_create(
                struct block_device *bdev, unsigned block_size,
                unsigned reserved_buffers, unsigned aux_size,
                void (*alloc_callback)(struct dm_buffer *),
                void (*write_callback)(struct dm_buffer *));
            @block_size 必须为 512 的整数倍,但不一定是 2 的幂次
            if (is_power_of_2(block_size))
                c->sectors_per_block_bits = __ffs(block_size) - SECTOR_SHIFT;
            else
                c->sectors_per_block_bits = -1;
            c->need_reserved_buffers = @reserved_buffers
            c->minimum_buffers = DM_BUFIO_MIN_BUFFERS
            c->dm_io = dm_io_client_create()

            if (block_size <= KMALLOC_MAX_SIZE && (block_size < PAGE_SIZE || !is_power_of_2(block_size)))
                创建 c->slab_cache "dm_bufio_cache-<block_size>",对象大小为 @block_size

            if (aux_size)
                创建 c->slab_buffer "dm_bufio_buffer-<aux_size>"
            else
                创建 c->slab_buffer "dm_bufio_buffer"
            对象大小 sizeof(struct dm_buffer) + aux_size

            while (c->need_reserved_buffers)
                struct dm_buffer *b = alloc_buffer(c, GFP_KERNEL);
                __free_buffer_wake(b);

            c->shrinker.count_objects = dm_bufio_shrink_count;
            c->shrinker.scan_objects = dm_bufio_shrink_scan;
            c->shrinker.seeks = 1;
            c->shrinker.batch = 0;
            register_shrinker(&c->shrinker)

            mutex_lock(&dm_bufio_clients_lock);
            dm_bufio_client_count++;
            list_add(&c->client_list, &dm_bufio_all_clients);
            __cache_size_refresh();
            mutex_unlock(&dm_bufio_clients_lock);

    /* Release a buffered IO cache. */
    void dm_bufio_client_destroy(struct dm_bufio_client *c);
            释放内存并更新系统计数

    /*
     * Set the sector range.
     * When this function is called, there must be no I/O in progress
     * on the bufio client.
     */
    void dm_bufio_set_sector_offset(struct dm_bufio_client *c, sector_t start);
            设置 @c->start = @start


    enum new_flag {
        NF_FRESH    = 0,
        NF_READ     = 1,
        NF_GET      = 2,
        NF_PREFETCH = 3
    };

    /*
     * Read a given block from disk. Returns pointer to data. 
     * Returns a pointer to dm_buffer that can be used to release the buffer
     * or to make it dirty.
     */
    void *dm_bufio_read(struct dm_bufio_client *c, sector_t block, struct dm_buffer **bp);
                BUG_ON(dm_bufio_in_request())

    /*
     * Like dm_bufio_read, but return buffer from cache, don't read
     * it. If the buffer is not in the cache, return NULL.
     */
    void *dm_bufio_get(struct dm_bufio_client *c, sector_t block, struct dm_buffer **bp);

    /*
     * Like dm_bufio_read, but don't read anything from the disk.  It is
     * expected that the caller initializes the buffer and marks it dirty.
     */
    void *dm_bufio_new(struct dm_bufio_client *c, sector_t block, struct dm_buffer **bp);

    /*
     * Prefetch the specified blocks to the cache.
     * The function starts to read the blocks and returns without waiting for
     * I/O to finish.
     */
    void dm_bufio_prefetch(struct dm_bufio_client *c, sector_t block, unsigned n_blocks);
                BUG_ON(dm_bufio_in_request())

    /*
     * Release a reference obtained with dm_bufio_{read,get,new}. The data
     * pointer and dm_buffer pointer is no longer valid after this call.
     */
    void dm_bufio_release(struct dm_buffer *b);

    /*
     * Mark a buffer dirty. It should be called after the buffer is modified.
     *
     * In case of memory pressure, the buffer may be written after
     * dm_bufio_mark_buffer_dirty, but before dm_bufio_write_dirty_buffers.  So
     * dm_bufio_write_dirty_buffers guarantees that the buffer is on-disk but
     * the actual writing may occur earlier.
     */
    void dm_bufio_mark_buffer_dirty(struct dm_buffer *b);

    /*
     * Mark a part of the buffer dirty.
     *
     * The specified part of the buffer is scheduled to be written. dm-bufio may
     * write the specified part of the buffer or it may write a larger superset.
     */
    void dm_bufio_mark_partial_buffer_dirty(struct dm_buffer *b,
                                unsigned start, unsigned end);

    /* Initiate writing of dirty buffers, without waiting for completion. */
    void dm_bufio_write_dirty_buffers_async(struct dm_bufio_client *c);

    /*
     * Write all dirty buffers. Guarantees that all dirty buffers created prior
     * to this call are on disk when this call exits.
     */
    int dm_bufio_write_dirty_buffers(struct dm_bufio_client *c);

    /* Send an empty write barrier to the device to flush hardware disk cache. */
    int dm_bufio_issue_flush(struct dm_bufio_client *c);        REQ_PREFLUSH | REQ_SYNC

    /*
     * Like dm_bufio_release but also move the buffer to the new
     * block. dm_bufio_write_dirty_buffers is needed to commit the new block.
     */
    void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block);
            We first delete any other buffer that may be at that new location.
            Then, we write the buffer to the original location if it was dirty.
            Then, if we are the only one who is holding the buffer, relink the buffer 
            in the buffer tree for the new location.
            If there was someone else holding the buffer, we write it to the new 
            location but not relink it, because that other user needs to have the 
            buffer at the same place.

    /*
     * Free the given buffer.
     * This is just a hint, if the buffer is in use or dirty, this function
     * does nothing.
     */
    void dm_bufio_forget(struct dm_bufio_client *c, sector_t block);


    /* Set the minimum number of buffers before cleanup happens. */
    void dm_bufio_set_minimum_buffers(struct dm_bufio_client *c, unsigned n);
            设置 struct dm_bufio_client->minimum_buffers = n

    unsigned dm_bufio_get_block_size(struct dm_bufio_client *c);
            返回 struct dm_bufio_client ->block_size

    sector_t dm_bufio_get_device_size(struct dm_bufio_client *c);
            i_size_read(c->bdev->bd_inode) 对应的 blocks num

    sector_t dm_bufio_get_block_number(struct dm_buffer *b);
            获取 dm_buffer 对应的 block nr 即 @b->block

    void *dm_bufio_get_block_data(struct dm_buffer *b);
            获取 dm_buffer 的 data 指针,即 @b->data

    void *dm_bufio_get_aux_data(struct dm_buffer *b);
            获取 dm_buffer 的 aux data 指针,header 后面紧跟的是 data,即 @b+1

    struct dm_bufio_client *dm_bufio_get_client(struct dm_buffer *b);
            获取 dm_buffer 所属 client 的指针,即 @b->c

系统计数

    /* The current number of clients. */
    static int dm_bufio_client_count;

    /* The list of all clients. */
    static LIST_HEAD(dm_bufio_all_clients);

    /* Default cache size: available memory divided by the ratio. */
    static unsigned long dm_bufio_default_cache_size;       初始化为 ratio mem

    /* Total cache size set by the user. */
    static unsigned long dm_bufio_cache_size;

    /*
     * A copy of dm_bufio_cache_size because dm_bufio_cache_size can change
     * at any time.  If it disagrees, the user has changed cache size.
     */
    static unsigned long dm_bufio_cache_size_latch;

    /*
     * Per-client cache: dm_bufio_cache_size / dm_bufio_client_count
     */
    static unsigned long dm_bufio_cache_size_per_client;
                = dm_bufio_cache_size_latch / (dm_bufio_client_count ? : 1)

    /*
     * This mutex protects dm_bufio_cache_size_latch,
     * dm_bufio_cache_size_per_client and dm_bufio_client_count
     */
    static DEFINE_MUTEX(dm_bufio_clients_lock);

    static unsigned long dm_bufio_peak_allocated;
    static unsigned long dm_bufio_allocated_kmem_cache;
    static unsigned long dm_bufio_allocated_get_free_pages;
    static unsigned long dm_bufio_allocated_vmalloc;
    static unsigned long dm_bufio_current_allocated;

    /* Check buffer ages in this interval (seconds) */
    #define DM_BUFIO_WORK_TIMER_SECS        30
    /* Free buffers when they are older than this (seconds) */
    #define DM_BUFIO_DEFAULT_AGE_SECS       300
    /* The nr of bytes of cached data to keep around. */
    #define DM_BUFIO_DEFAULT_RETAIN_BYTES   (256 * 1024)
    /*
     * Align buffer writes to this boundary.
     * Tests show that SSDs have the highest IOPS when using 4k writes.
     */
    #define DM_BUFIO_WRITE_ALIGN            4096

    /* Buffers are freed after this timeout */
    static unsigned dm_bufio_max_age = DM_BUFIO_DEFAULT_AGE_SECS;

    static unsigned long dm_bufio_retain_bytes = DM_BUFIO_DEFAULT_RETAIN_BYTES;

static void adjust_total_allocated(unsigned char data_mode, long diff) 根据内存分配释放情况调整系统统计数据 dm_bufio_allocated_kmem_cache、dm_bufio_allocated_get_free_pages、dm_bufio_allocated_vmalloc、dm_bufio_current_allocated、dm_bufio_peak_allocated

module param

    module_param_named(max_cache_size_bytes, dm_bufio_cache_size, ulong, S_IRUGO | S_IWUSR);
            "Size of metadata cache"

    module_param_named(max_age_seconds, dm_bufio_max_age, uint, S_IRUGO | S_IWUSR);
            "Max age of a buffer in seconds"

    module_param_named(retain_bytes, dm_bufio_retain_bytes, ulong, S_IRUGO | S_IWUSR);
            "Try to keep at least this many bytes cached in memory"

    module_param_named(peak_allocated_bytes, dm_bufio_peak_allocated, ulong, S_IRUGO | S_IWUSR);
            "Tracks the maximum allocated memory"

    module_param_named(allocated_kmem_cache_bytes, dm_bufio_allocated_kmem_cache, ulong, S_IRUGO);
            "Memory allocated with kmem_cache_alloc"

    module_param_named(allocated_get_free_pages_bytes, dm_bufio_allocated_get_free_pages, ulong, S_IRUGO);
            "Memory allocated with get_free_pages"

    module_param_named(allocated_vmalloc_bytes, dm_bufio_allocated_vmalloc, ulong, S_IRUGO);
            "Memory allocated with vmalloc"

    module_param_named(current_allocated_bytes, dm_bufio_current_allocated, ulong, S_IRUGO);
            "Memory currently used by the cache"

struct dm_bufio_client

    /* dm_buffer->list_mode */
    #define LIST_CLEAN  0
    #define LIST_DIRTY  1
    #define LIST_SIZE   2

    /* Buffer state bits. */
    #define B_READING   0
    #define B_WRITING   1
    #define B_DIRTY     2


    /*
     * Linking of buffers:
     *  All buffers are linked to buffer_tree with their node field.
     *
     *  Clean buffers that are not being written (B_WRITING not set)
     *  are linked to lru[LIST_CLEAN] with their lru_list field.
     *
     *  Dirty and clean buffers that are being written are linked to
     *  lru[LIST_DIRTY] with their lru_list field. When the write
     *  finishes, the buffer cannot be relinked immediately (because we
     *  are in an interrupt context and relinking requires process
     *  context), so some clean-not-writing buffers can be held on
     *  dirty_lru too. They are later added to lru in the process
     *  context.
     */
    struct dm_bufio_client {
        struct mutex            lock;

        struct list_head        lru[LIST_SIZE];
                                        alloc_buffer 不成功的时候,__get_unclaimed_buffer 
                                        会尝试从 lru[LIST_CLEAN] 或者 lru[LIST_DIRTY] 上
                                        获取现有的 dm_buffer 来 make sure clean 后满足分配需要
        unsigned long           n_buffers[LIST_SIZE];   对应 lru[] 列表的长度

        struct block_device     *bdev;
        unsigned                block_size;
        s8                      sectors_per_block_bits;
        void (*alloc_callback)(struct dm_buffer *);
        void (*write_callback)(struct dm_buffer *);

        struct kmem_cache       *slab_buffer;   对象大小 sizeof(struct dm_buffer) + aux_size
        struct kmem_cache       *slab_cache;    对象大小 block_size

        struct dm_io_client     *dm_io;

        struct list_head        reserved_buffers;
        unsigned                need_reserved_buffers;
                                        每次从 reserved_buffers 里面分配一个 dm_buffer 则将 
                                        ->need_reserved_buffers++   
                                        __free_buffer_wake 在释放一个 buf 时会考虑补充该链表,
                                        如果补充了则更新 ->need_reserved_buffers-- 
        unsigned                minimum_buffers;

        struct rb_root          buffer_tree;
                                    red/black tree acts as an index for all the buffers
        wait_queue_head_t       free_buffer_wait;

        sector_t                start;

        int                     async_write_error;

        struct list_head        client_list;    链入 dm_bufio_all_clients
        struct shrinker         shrinker;
    };

static void __cache_size_refresh(void) Change the number of clients and recalculate per-client limit 更新 dm_bufio_cache_size_latch 和 dm_bufio_cache_size_per_client

static struct dm_buffer *alloc_buffer(struct dm_bufio_client *c, gfp_t gfp_mask) Allocate buffer and its ->data,然后调用 adjust_total_allocated 更新系统计数 static void free_buffer(struct dm_buffer *b) 调用 adjust_total_allocated 更新系统计数后释放 ->data 和 buffer 结构

struct dm_buffer

    /*
     * Describes how the block was allocated:
     * kmem_cache_alloc(), __get_free_pages() or vmalloc().
     * See the comment at alloc_buffer_data.
     */
    enum data_mode {
        DATA_MODE_SLAB              = 0,
        DATA_MODE_GET_FREE_PAGES    = 1,
        DATA_MODE_VMALLOC           = 2,
        DATA_MODE_LIMIT             = 3
    };

    struct dm_buffer {
        struct rb_node      node;
        struct list_head    lru_list;

        sector_t            block;
        void                *data;
        unsigned char       data_mode;      /* DATA_MODE_* */

        unsigned char       list_mode;      /* LIST_* */

        blk_status_t        read_error;
        blk_status_t        write_error;

        unsigned            hold_count;     引用计数,> 0 表示 inuse
        unsigned long       state;
                                            /* Buffer state bits. */
                                            #define B_READING   0
                                            #define B_WRITING   1
                                            #define B_DIRTY     2
        unsigned long       last_accessed;      link/relink to the rbtree and clean or dirty queue 的时间戳
        unsigned            dirty_start;
        unsigned            dirty_end;
        unsigned            write_start;
        unsigned            write_end;

        struct dm_bufio_client  *c;

        struct list_head    write_list;
        void (*end_io)(struct dm_buffer *, blk_status_t);
    #ifdef CONFIG_DM_DEBUG_BLOCK_STACK_TRACING
    #define MAX_STACK 10
        struct stack_trace  stack_trace;
        unsigned long       stack_entries[MAX_STACK];
    #endif
    };

static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask, unsigned char *data_mode) 为 buf 分配 data 空间。Small buffers are allocated with kmem_cache, to use space optimally. For large buffers, we choose between get_free_pages and vmalloc. Each has advantages and disadvantages. __get_free_pages can randomly fail if the memory is fragmented. __vmalloc won't randomly fail, but vmalloc space is limited (it may be as low as 128M) so using it for caching is not appropriate. If the allocation may fail we use __get_free_pages. Memory fragmentation won't have a fatal effect here, but it just causes flushes of some other buffers and more I/O will be performed. Don't use __get_free_pages if it always fails (i.e. order >= MAX_ORDER). If the allocation shouldn't fail we use __vmalloc. This is only for the initial reserve allocation, so there's no risk of wasting all vmalloc space; __vmalloc allocates the data pages and auxiliary structures with gfp_flags that were specified, but pagetables are always allocated with GFP_KERNEL, no matter what was specified as gfp_mask; Consequently, we must set per-process flag PF_MEMALLOC_NOIO so that all allocations done by this process (including pagetables) are done as if GFP_NOIO was specified. 优先尝试 slab cache;如果没有为 client 创建 slab cache,则尝试其它 2 种分配。如果 c->block_size <= KMALLOC_MAX_SIZE && gfp_mask & __GFP_NORETRY 则使用 __get_free_pages 分配,否则使用 vmalloc

static void free_buffer_data(struct dm_bufio_client *c, void *data, unsigned char data_mode) 释放 buf 的 data 空间:DATA_MODE_SLAB kmem_cache_free,DATA_MODE_GET_FREE_PAGES free_pages,DATA_MODE_VMALLOC vfree