The multiqueue block layer

Jonathan Corbet June 5, 2013, lwn.net

The kernel's block layer is charged with managing I/O to the system's block ("disk drive") devices. It was designed in an era when a high-performance drive could handle hundreds of I/O operations per second (IOPs); the fact that it tends to fall down with modern devices, capable of handling possibly millions of IOPs, is thus not entirely surprising. It has been known for years that significant changes would need to be made to enable Linux to perform well on fast solid-state devices 高速 SSD 对 block layer 有更高的效率要求. The shape of those changes is becoming clearer as the multiqueue block layer patch set, primarily the work of Jens Axboe and Shaohua Li, gets closer to being ready for mainline merging.

The basic structure of the block layer has not changed a whole lot since it was described for 2.6.10 in Linux Device Drivers . It offers two ways for a block driver to hook into the system, one of which is the "request" interface. When run in this mode, the block layer maintains a simple request queue; new I/O requests are submitted to the tail of the queue and the driver receives requests from the head. While requests sit in the queue, the block layer can operate on them in a number of ways: they can be reordered to minimize seek operations, adjacent requests can be coalesced into larger operations, and policies for fairness and bandwidth limits can be applied, for example.

request-based interface 的不足之处 This request queue turns out to be one of the biggest bottlenecks in the entire system. It is protected by a single lock which, on a large system, will bounce frequently between the processors. It is a linked list, a notably cache-unfriendly data structure especially when modifications must be made — as they frequently are in the block layer. As a result, anybody who is trying to develop a driver for high-performance storage devices wants to do away 不能满足高性能存储设备的要求 with this request queue and replace it with something better.

The second block driver mode — the "make request" interface — allows a driver to do exactly that. It hooks the driver into a much higher part of the stack, shorting out the request queue and handing I/O requests directly to the driver. This interface was not originally intended for high-performance drivers; instead, it is there for stacked drivers (the MD RAID implementation, for example) that need to process requests before passing them on to the real, underlying device. 高效但是适用场景受限 Using it in other situations incurs a substantial cost: all of the other queue processing done by the block layer is lost and must be reimplemented in the driver.

The multiqueue block layer work tries to fix this problem by adding a third mode for drivers to use. In this mode, the request queue is split into a number of separate queues: multiple submission queues and one or more hw dispatch queues

Submission queues are set up on a per-CPU or per-node basis. Each CPU submits I/O operations into its own queue, with no interaction with the other CPUs. Contention for the submission queue lock is thus eliminated (when per-CPU queues are used) or greatly reduced (for per-node queues).
One or more hardware dispatch queues simply buffer I/O requests for the driver.

While requests are in the submission queue, they can be operated on by the block layer in the usual manner. SSD 存取数据与传统磁盘不同 Reordering of requests for locality offers little or no benefit on solid-state devices; indeed, spreading requests out across the device might help with the parallel processing of requests. So reordering will not be done, but coalescing requests will reduce the total number of I/O operations, improving performance somewhat. Since the submission queues are per-CPU, there is no way to coalesce requests submitted to different queues. With no empirical evidence whatsoever, your editor would guess that adjacent requests are most likely to come from the same process and, thus, will automatically find their way into the same submission queue, so the lack of cross-CPU coalescing is probably not a big problem.

The block layer will move requests from the submission queues into the hardware queues up to the maximum number specified by the driver. Most current devices will have a single hardware queue, but high-end devices already support multiple queues to increase parallelism. On such a device, the entire submission and completion path should be able to run on the same CPU as the process generating the I/O, maximizing cache locality (and, thus, performance). If desired, fairness or bandwidth-cap policies can be applied as requests move to the hardware queues, but there will be an associated performance cost. Given the speed of high-end devices, it may not be worthwhile to try to ensure fairness between users; everybody should be able to get all the I/O bandwidth they can use.

This structure makes the writing of a high-performance block driver (relatively) simple. The driver provides a queue_rq() function to handle incoming requests and calls back to the block layer when requests complete. Those wanting to look at an example of how such a driver would work can see null_blk.c in the new-queue branch of Jens's block repository:

`git://git.kernel.dk/linux-block.git`

In the current patch set, the multiqueue mode is offered in addition to the existing two modes, so current drivers will continue to work without change. According to this paper on the multiqueue block layer design [PDF], the hope is that drivers will migrate over to the multiqueue API, allowing the eventual removal of the request-based mode.

This patch set has been significantly reworked in the last month or so; it has gone from a relatively messy series into something rather cleaner. Merging into the mainline would thus appear to be on the agenda for the near future. Since use of this API is optional, existing drivers should continue to work and this merge could conceivably happen as early as 3.11. But, given that the patch set has not yet been publicly posted to any mailing list and does not appear in linux-next, 3.12 seems like a more likely target. Either way, Linux seems likely to have a much better block layer by the end of the year or so.

comment

Hypothetical progams might desire cross-CPU coalescing of IO

Posted Jun 6, 2013 5:38 UTC (Thu) by faramir

Our worthy editor suggests that not coalescing IO requests across CPUs is probably not a big problem. If we restrict ourselves to the original model of UNIX computation (single process/private memory space), I would agree.

If we consider multiple processes with synchronization (perhaps via shared memory), or multi-threaded programs; I'm not so sure. Imagine some kind of processing of file data which can be done a block at a time (say a block based cipher or non-adaptive compression). A multi-threaded/multi-process version of such a program may in fact be running code on multiple CPUs but reading from/writing to the same files (and therefore generating coalescible IO requests.) Reads from the input file could come from any of the threads/processes engaged in the task.

In the case of compression; due to variable length output chunks; the writer side will have to be coalesced into a single stream in the program itself in order to put the output in the correct order. Although that might be done by having a management thread simply inform each compression thread when to write; so the actual write calls might still come from different CPUs.

A block based cipher program could probably use lseek() on multiple fds opened to the same output file to maintain correct ordering from each thread.

In either case, it would appear that coalescing across CPUs would be useful. At least if the actual processing required was negligible relative to IO time. It may be that CPUs aren't fast enough to do this for anything beyond ROT13 encryption or simple RLE compression; so it might not matter for now. But it would seem to be at least be a theoretical issue.

Posted Jun 6, 2013 7:08 UTC (Thu) by dlang

but how many people are running such parallel processing of single files?

And of those who are doing so, how much do they care if their files get processed one at a time using every CPU for that one file, or many files at a time with each CPU processing a different file (and therefor not needing the combined I/O logic)

Yes, there are going to be some, but is it really worth crippling the more common cases to help this rare case?

Posted Jun 6, 2013 7:28 UTC (Thu) by dlang

a comment I posted elsewhere is also relevant to this discussion. I'll post a link rather than reposting the longer comment

in https://lwn.net/Articles/553086/ I talk about how I/O coalescing should only be done when the I/O is busy (among other things)

Posted Jun 6, 2013 12:06 UTC (Thu) by ejr

See uses of HDF5 and NetCDF file formats. There are many software systems that store in a single, large file, emulating a file system more aligned with the higher-level semantics. Also, think of databases. Counting a case as rare v. common requires an application area.

But... Parallel HDF5, etc. interfaces handle some coalescing in the "middleware" layer. They've found that relying on the OS leads to, um, sufficient performance variation across different systems and configurations. Yet another standard parallel I/O layer in user-space could help more than trying to jam the solution into the OS.

But relying on user-space won't help the case multiqueue is attacking.

Posted Jun 7, 2013 2:27 UTC (Fri) by dlang

when you have things like databases where many applications are accessing one file, how common is it for the different applications to be making adjacent writes at the same time? It may happen, but it's not going to be the common case.

Posted Jun 7, 2013 10:06 UTC (Fri) by ejr

In my world (HPC), having multiple coordinated processes accessing slices of a range is the common case. We have to fight for anything else to have even reasonable performance support. See lustre.

But this case often is better handled outside the OS. There could be different interface where clients post their read buffers and the OS fills them in some hardware-optimal order, but that's been considered for >20 years and has no solution yet.

Posted Jun 7, 2013 18:18 UTC (Fri) by dlang

you are processing slices of a range, but are you really writing the adjacent slices at almost exactly the same time from multiple processes? that's what it would take to give the OS the chance to combine the output from the different processes into one write to disk.

Also, in HPC aren't you dealing with many different systems accessing your data over the network rather than multiple processes on one machine?

What we are talking about here is the chance that things running on different CPUs in a single system are generating disk writes that the OS could combine into a single write before sending it to the drives

for reads this isn't as big a deal, readahead should go a long way towards making the issue moot.

Posted Jun 6, 2013 14:22 UTC (Thu) by axboe

If you write programs like that and expect IO performance to be stellar, you have a problem. It's already the case that Linux does not merge (注意是 merge 而不是 reorder) IO from independent threads, unless it just happens to either detect this or if explicitly asked to by sharing an IO context.

So in general it's not a huge problem. For "legacy" devices that benefit a lot from merging, we can help them out a little bit. They will typically be single queue anyway, and merging at dispatch time on that queue would be trivial to do.

Posted Jun 6, 2013 14:38 UTC (Thu) by alankila

The algorithms used for data processing on modern disks are quite fast, for instance lzo compression has bandwidth approaching 1 GB/s per core.

Posted Jun 14, 2013 6:39 UTC (Fri) by kleptog

Actually, I'd consider write combining something to be done by the disk itself. These days with storage arrays having gigabytes of battery backed storage there's no real reason to have the kernel worry about things like write combining. Maybe if it affected the communication overhead, but that can be dealt with in other ways (like parallel submission).

The multiqueue block layer

comment

Hypothetical progams might desire cross-CPU coalescing of IO

Categories

Tags