explicit on-stack per-task plugging

末节的 Discussion 很精彩，大神 Neil Brown 答疑解惑
A block layer introduction part 1: the bio layer, Neil Brown, October 25, 2017, lwn.net
Explicit block device plugging, Jens Axboe, April 13, 2011, lwn.net
No more global unplugging, corbet, March 10, 2004, lwn.net

plugging 进化

global plugging/unplugging : performance and scalability problem
- It has a single, global lock which keeps multiple processors from trying to restart the queues at the same time; this lock has become a bit of a contention point on some systems.
- A call to blk_run_queues() also restarts all block devices on the system, even though there is typically only one queue that truly needs to be unplugged.
per-queue plugging
- no way to unplug all devices when going to sleep waiting for page I/O. This meant that the virtual memory subsystem had to be able to unplug the specific device that would be servicing page I/O. A special hack was added for this: sync_page() in struct address_space_operations
  - the sync_page() hook was always hated by the memory management people
- MD/DM stack: would in turn unplug any lower-level device. The unplug event would thus percolate down the stack --- the device was automatically plugged but had to be explicitly unplugged. 对此引入了 auto-unplugging: Some heuristics were added to auto-unplug the device if a certain depth of requests had been added, or if some period of time had passed before the unplug event was seen. 虽然粗暴但是 would chuck along if someone missed an unplug call after I/O submission
  - The asymmetric nature of the API was always ugly and a source of bugs
- 对于快速设备比如 SSD，plugging had become a scalability problem: hacks were again added to avoid this.
  - Essentially we disabled plugging on solid-state devices that were able to do queueing, while plugging originally was a good win.
explicit on-stack per-task plugging
- commit 73c101011926 block: initial patch for on-stack per-task plugging
- Instead of maintaining these I/O fragments as shared state in the device, a new on-stack structure was created to contain this I/O for a short period, allowing the submitter to build up a small queue of related requests.
  - tracking plug inside the task structure of the current process 支持 auto-unplug: be able to automatically flush the queued I/O should the task end up blocking between the call to blk_start_plug() and blk_finish_plug(). If that happens, we want to ensure that pending I/O is sent off to the devices immediately. This is important from a performance perspective 性能, but also to ensure that we don't deadlock. If the task is blocking for a memory allocation, memory management reclaim could end up wanting to free a page belonging to a request that is currently residing on our private plug. Similarly, the caller may itself end up waiting for some of the plugged I/O to finish. By flushing this list when the process goes to sleep 规避死锁, we avoid these types of deadlocks.
  - 排序: Since the plug state is now device agnostic, we may end up in a situation where multiple devices have pending I/O on this plug list. These may end up on the plug list in an interleaved fashion, potentially causing blk_finish_plug() to grab and release the related queue locks multiple times. To avoid this problem, a should_sort flag in the blk_plug structure is used to keep track of whether we have I/O belonging to more than I/O distinct queue pending. If we do, the list is sorted to group identical queues together. This scales better than grabbing and releasing the same locks multiple times.
  - blk_delay_queue(queue, delay_in_msecs); unplug_fn() 被丢弃，但是 some drivers used plugging to delay I/O operations in response to resource shortages. One example of that was the SCSI midlayer; if we failed to map a new SCSI request due to a memory shortage, the queue was plugged to ensure that we would call back into the dispatch functions later on. blk_delay_queue() must only be used for conditions where the caller doesn't necessarily know when that condition will change states. If resources internal to the driver cause it to need to halt operations for a while, it is more efficient to use blk_stop_queue() and blk_start_queue() to manage those directly.

Neil Brown 给出了一些细节
- https://lwn.net/Articles/438974
  - There is no unplugging timer. There was a timer in the previous code that would unplug after 3ms, but that was mainly a 'just in case' measure. Normally the unplug would happen much earlier. If it was only the timer that triggered an unplug then you are right, performance would be terrible.
  - plugging was only relevant when the device was idle. If the queue was not empty it would not get plugged. So with long, nearly continuous writes, the queue would only be plugged once at the very beginning.
    - I think it is very hard to imaging plugging slowing down even a very fast low latency device. The page cache has a bunch of pages that it wants to perform IO on and it assembles them into a big chunk and sends them all to the device. It is true that the first request won't get there quite as soon, but unless the device is as fast as memory, then it will still get there fast enough that you probably cannot measure the difference.
    - And if a device is as fast as memory, then it probably shouldn't be under the request_queue code at all - a driver more like the umem.c driver might be appropriate. It just takes bios directly and turns them into DMA descriptors.. But even that uses plugging so it can start a chain of dmas at once rather than just one.
  - The purpose of plugging is primarily about latency 传统磁盘访问磁头花费时间长，通过 plug 来减少磁头移动时间, not bandwidth. If bandwidth is an issue, your queue will not be empty (plug 只是在磁盘 idle 时作用，如果磁盘 busy 那就不会有 plug 的影响，不会影响到 bandwidth), so the time it takes for a request to get to the front of the queue is long enough for any other related requests to be seen and merged so there is no point in plugging (at least the old style - the new style still brings benefits).
  - The new plugging code is a bit different. Rather than only plugging idle devices it doesn't really plug devices at all. It plugs request submitters instead. (So the new code is a lot closer to the page cache than the old code).
    - So when a process submits a request, it gets queued in the process, not in the device. Once a process has submitted all that it wants to submit (or whenever it schedules - to avoid deadlocks), the requests queued in the process are released.
    - So you get a similar effect on the starting transient - a large number of pre-sorted requests gets handled all at once. However there are other advantages as well in terms of lock contention.
- https://lwn.net/Articles/440619/
  - In the linux kernel, plugging is not timer based. (There was a timer in the previous implementation, but it was only a last-ditch unplug in case there were bugs: slow is better than frozen). I agree that having a timed unplug event doesn't make much sense
  - In the old code a device would plug whenever it wanted to which was typically when a new request arrived for an empty queue. It would then unplug as soon as some thread started waiting for a request on that device to complete.
  - The new plugging code is quite different. The unplug happens when the thread submitting requests has finished submitting a bunch of requests. It is explicit rather than using the heuristic of 'unplug when someone waits'. This means it happens a little bit sooner - there is never any timer delay at all.
  - Rather than thinking of it as 'plugging' it is probably best to think of it as early-aggregation
- https://lwn.net/Articles/440788/
  - a partial answer to "how can the kernel know when the application has finished submitting a bunch of requests" is "the application calls 'fsync' - if it cares".
  - the kernel does break things into a bunch of requests which then need to be sorted. If a file is not contiguous on disk, then you need at least one request each separate chunk. Plugging allows this sorting to happen before the first request is started.
  - There is a good reason why the page cache submits lots of individual requests rather than a single list with lots of requests. Every request requires an allocation. when memory gets tight (which affects writes more than reads) it could be that I cannot allocate memory for another request until the previous ones have been submitted and completed. So we submit the requests individually, but combine them at a level a little lower down, and 'unplug' that queue either when all have been submitted or when the thread 'schedules' - which it will typically only do if it blocks on a memory allocation.
  - page cache 和 plug
    - Firstly there is the page cache which deliberately delays writes and expedites reads to allow large requests independent of the request size used by the application.
    - Then there is the fact that the page cache sends smallish requests to the device, but tends to send a lot in quick succession. These need to be combined when possible, but also flushed as soon as there is any sign of any complication. This last is what "plugging" does.

Device queue plugging

Storage devices often have significant per-request overheads, so it can be more efficient to gather a batch of requests together and submit them as a unit. When the device is relatively slow it will often have a large queue of pending requests and that queue provides plenty of opportunity for identifying suitable batches. When a device is quite fast, or when a slow device is idle, there is less opportunity to find batches naturally. To address this challenge, the Linux block layer has a concept called "plugging".

Originally, plugging applied only to an empty queue. Before submitting a request to an empty queue, the queue would be plugged so that no requests could flow through to the underlying device for a while. Bios submitted by the filesystem could then queue up and allow batches to be identified. The queue would be unplugged explicitly by the filesystem requesting it, or implicitly after a short timeout. It is hoped that by this time some suitable batches would have been found and that the small delay in starting work is more than compensated for by the larger batches that are ultimately submitted. Since Linux 2.6.39 a new plugging mechanism has been in place that works on a per-process basis rather than per-device. This scales better on multi-CPU machines.

When a filesystem or other client of a block device submits requests it will normally bracket a collection of generic_make_request() calls with blk_start_plug() and blk_finish_plug(). This sets up current->plug to point to a data structure that can contain a list of struct blk_plug_cb (and also a list of struct request that we find out more about in the next article). As these lists are per-process, entries can be added without any locking. The make_request_fn that is given individual bios can choose to add the bio to a list in the plug if that might allow it to work more efficiently.

When blk_finish_plug() is called, or whenever the process calls schedule() (such as when waiting for a mutex, or when waiting for memory allocation), each entry stored in current->plug is processed. This processing will complete everything that the driver would have done if it had not decided to add the bio to the plug list, or if no plug has been enabled.

The fact that the plug is processed from schedule() calls means that bios are only delayed while new bios are being produced. If the process blocks to wait for anything, the list assembled so far is processed immediately. This protects against the possibility that the process might be waiting for a bio that has already been submitted, but is currently queued behind the plug.

Performing the plugging at the process level like this maintains the benefit that 好处一 batches of related bios are easy to detect and keep together, and adds the benefit that 好处二 locking can be reduced. Without this per-process plugging a spinlock, or at least an atomic memory operation, would be needed to handle every individual bio. With per-process plugging, it is often possible to create a per-process list of bios, and then take the spinlock just once to merge them all into the common queue.

Explicit block device plugging

Since the dawn of time, or for at least as long as I have been involved, the Linux kernel has deployed a concept called "plugging" on block devices. (plug 存在的意义) When I/O is queued to an empty device, that device enters a plugged state. This means that I/O isn't immediately dispatched to the low level device driver, instead it is held back by this plug. When a process is going to wait on the I/O to finish (如果有进程等待 IO 完成则 unplug 设备来进行 IO 操作以免让进程等待时间过长), the device is unplugged and request dispatching to the device driver is started. The idea behind plugging is to allow a buildup of requests to better utilize the hardware and to allow merging of sequential requests into one single larger request. The latter is an especially big win on most hardware; writing or reading bigger chunks of data at the time usually yields good improvements in bandwidth. With the release of the 2.6.39-rc1 kernel, block device plugging was drastically changed. Before we go into that, lets take a historic look at how plugging has evolved.

Back in the early days, plugging a device involved global state. This was before SMP scalability was an issue, and having global state made it easier to handle the unplugging. If a process was about to block for I/O, any plugged device was simply unplugged. This scheme persisted in pretty much the same form until the early versions of the 2.6 kernel, where it began to severely impact SMP scalability on I/O-heavy workloads.

In response to this problem, the plug state was turned into a per-device entity in 2004. This scaled well, but now you suddenly had no way to unplug all devices when going to sleep waiting for page I/O. This meant that the virtual memory subsystem had to be able to unplug the specific device that would be servicing page I/O. A special hack was added for this: sync_page() in struct address_space_operations; this hook would unplug the device of interest.

If you have a more complicated I/O setup with device mapper or RAID components, those layers would in turn unplug any lower-level device. The unplug event would thus percolate down the stack. Some heuristics were also added to auto-unplug the device if a certain depth of requests had been added, or if some period of time had passed before the unplug event was seen. With the asymmetric nature of plugging where the device was automatically plugged but had to be explicitly unplugged, we've had our fair share of I/O stall bugs in the kernel. While crude, the auto-unplug would at least ensure that we would chuck along if someone missed an unplug call after I/O submission.

With really fast devices hitting the market, once again plugging had become a scalability problem and hacks were again added to avoid this. Essentially we disabled plugging on solid-state devices that were able to do queueing. While plugging originally was a good win, it was time to reevaluate things. The asymmetric nature of the API was always ugly and a source of bugs, and the sync_page() hook was always hated by the memory management people. The time had come to rewrite the whole thing.

The primary use of plugging was to allow an I/O submitter to send down multiple pieces of I/O before handing it to the device. Instead of maintaining these I/O fragments as shared state in the device, a new on-stack structure was created to contain this I/O for a short period, allowing the submitter to build up a small queue of related requests. The state is now tracked in struct blk_plug, which is little more than a linked list and a should_sort flag informing blk_finish_plug() whether or not to sort this list before flushing the I/O. We'll come back to that later.

    struct blk_plug {
        unsigned long magic;
        struct list_head list;
        unsigned int should_sort;
    };

The magic member is a temporary addition to detect uninitialized use cases, it will eventually be removed. The new API to do this is straightforward and simple to use:

    struct blk_plug plug;

    blk_start_plug(&plug);
    submit_batch_of_io();
    blk_finish_plug(&plug);

blk_start_plug() takes care of initializing the structure and tracking it inside the task structure of the current process. The latter is important to be able to automatically flush the queued I/O should the task end up blocking between the call to blk_start_plug() and blk_finish_plug(). If that happens, we want to ensure that pending I/O is sent off to the devices immediately. This is important from a performance perspective, but also to ensure that we don't deadlock. If the task is blocking for a memory allocation, memory management reclaim could end up wanting to free a page belonging to a request that is currently residing on our private plug. Similarly, the caller may itself end up waiting for some of the plugged I/O to finish. By flushing this list when the process goes to sleep, we avoid these types of deadlocks.

If blk_start_plug() is called and the task already has a plug structure registered, it is simply ignored. (重复注册 plug) This can happen in cases where the upper layers plug for submitting a series of I/O, and further down in the call chain someone else does the same. I/O submitted without the knowledge of the original plugger will thus end up on the originally assigned plug, and be flushed whenever the original caller ends the plug by calling blk_finish_plug(), or if some part of the call path goes to sleep or is scheduled out.

Since the plug state is now device agnostic, we (sort 的原因) may end up in a situation where multiple devices 多个设备的 request 在一个列表中，需要排序 have pending I/O on this plug list. These may end up on the plug list in an interleaved fashion, potentially causing blk_finish_plug() to grab and release the related queue locks multiple times. To avoid this problem, a should_sort flag in the blk_plug structure is used to keep track of whether we have I/O belonging to more than I/O distinct queue pending. If we do, the list is sorted to group identical queues together. This scales better than grabbing and releasing the same locks multiple times.

With this new scheme in place, the device need no longer be notified of unplug events. The queue unplug_fn() used to exist for this purpose alone, it has now been removed. For most drivers it is safe to just remove this hook and the related code. However, some drivers used plugging to delay I/O operations in response to resource shortages. One example of that was the SCSI midlayer; if we failed to map a new SCSI request due to a memory shortage, the queue was plugged to ensure that we would call back into the dispatch functions later on. Since this mechanism no longer exists, a similar API has been provided for such use cases. Drivers may now use blk_delay_queue() for this:

    blk_delay_queue(queue, delay_in_msecs);

The block layer will re-invoke request queueing after the specified number of milliseconds have passed. It will be invoked from process context, just as it would have been with the unplug event. blk_delay_queue() honors the queue stopped state, so if blk_stop_queue() was called before blk_delay_queue(), or if is called after the fact but before the delay has passed, the request handler will not be invoked. blk_delay_queue() must only be used for conditions 注意适用场合 where the caller doesn't necessarily know when that condition will change states. If resources internal to the driver cause it to need to halt operations for a while, it is more efficient to use blk_stop_queue() and blk_start_queue() to manage those directly.

These changes have been merged for the 2.6.39 kernel. While a few problems have been found (and fixed), it would appear that the plugging changes have been integrated without greatly disturbing Linus's calm development cycle.

No more global unplugging — 引入 per-queue plugging，去掉 global unplugging

The block layer supports the notion of "plugging" a request queue for a block device. (解释什么是 plug) A plugged queue passes no requests to the underlying device; it allows them to accumulate, instead, so that the I/O scheduler has a chance to reorder them and optimize performance. There comes a time, however, when the plug has to be pulled and the device restarted. Often, code within the filesystem or virtual memory layers decides that, for whatever reason, it's time to get block I/O moving again. In the current 2.6 kernel, there is a function (blk_run_queues()) which performs this task.

The problem is that blk_run_queues() has turned out to be a bit of a performance and scalability problem. (不足一) It has a single, global lock which keeps multiple processors from trying to restart the queues at the same time; this lock has become a bit of a contention point on some systems. (不足二) A call to blk_run_queues() also restarts all block devices on the system, even though there is typically only one queue that truly needs to be unplugged.

To address these problems, Jens Axboe has posted a patch which does away with blk_run_queues() altogether. This change is a result of a fundamental realization: there is always one specific queue which needs to be kickstarted. So blk_run_queues() has been replaced with blk_run_queue() (which takes the specific queue to start as a parameter) and blk_run_address_space() (which takes a pointer to a address_space structure). With these functions, higher-level code can fire up the request queue which belongs to a specific device or which ultimately underlies a particular non-anonymous mapping.

This patch is going straight into the -mm tree; Andrew Morton commented "This is such an improvement over what we have now it isn't funny." He also noted that "...the next -mm is starting to look like linux-3.1.0..." The 2.6 kernel looks to be interesting for a while.

COMMENTS

Posted Apr 15, 2011 23:31 UTC (Fri) by giraffedata

The idea behind plugging is to allow a buildup of requests to better utilize the hardware and to allow merging of sequential requests into one single larger request. The latter is an especially big win on most hardware; writing or reading bigger chunks of data at the time usually yields good improvements in bandwidth.

理解错误 I have never understood this analysis.

First of all, plugging doesn't improve utilization -- it decreases it. It causes the device to be idle more than it otherwise would for a given workload. Improved utilization would be eliminating waste of hardware so you can have less hardware. Plug-free I/O doesn't waste hardware; it uses only time that in a plugged scenario would be unused.

"Bandwidth" here means the amount you can shove down the pipe to the disk, and is meaningful only in a situation where you drive the disk as fast as it will go -- it is never idle. In that case, plugging plays no role.

I saw plugging help once, when the device was poorly designed so that it sucked up all the I/Os at interface speed into a large buffer, then proceeded to do the mechanical processing without reordering or coalescing. In that case, defeating that in-device queue via Linux plugging was a win. The device should have simply pushed back as soon as it had one turnaround time's worth of I/O in its queue, then the block layer would have coalesced and reordered without any need for plugging.

I've also imagined some carefully constructed burst patterns where response time (not bandwidth or utilization) improves with plugging. But in more common cases, plugging hurts response time for the obvious reason that it lets the device be idle for a brief period while the user is waiting for I/O.

Posted Apr 17, 2011 7:05 UTC (Sun) by walex

。。。被 Neil 怼了

This comment about plugging is amazingly misguided because it attacks the one good point about plugging: that it does improve device utilization.

The reason is that "giraffedata" seems to be entirely unaware that there are devices called "disks" that have extremely high and variable latencies, and given these it is possible that bunching IO requests allows the elevator to minimize latencies in such a way that the pauses imposed by plugging may be worthwhile.

There are two very big problems with plugging, or rather its current implementation, which makes it a tremendously stupid thing as a result:

Putting it in the block layer is extremely bad, because there are storage devices that don't have high and variable latencies and for which plugging is counterproductive. If plugging makes any sense it should be in the device drivers.
Plugging quantizes the flow of IO requests making them essentially synchronous with the periodic expiry of the plugging timer, and the resulting bunching, which is indeed the intended effect as to scheduling, can have bad consequences on page cache usage, and limits the bandwidth usable for the device in common cases (long nearly contiguous writes).

Plugging was introduced IIRC as a way to cheat on some common benchmark.

Plugging is a gross mistake that should be entirely removed from the page cache, and perhaps turned into some kind of scheduling library available to device drivers for use when the relevant device latency profiles might conceivably benefit (almost never actually).

Posted Apr 17, 2011 9:44 UTC (Sun) by neilbrown

Your comments make it sound to make like you don't understand how Linux plugging works - though maybe I misunderstand you...

There is no unplugging timer. There was a timer in the previous code that would unplug after 3ms, but that was mainly a 'just in case' measure. Normally the unplug would happen much earlier. If it was only the timer that triggered an unplug then you are right, performance would be terrible.

Also plugging was only relevant when the device was idle. If the queue was not empty it would not get plugged. So with long, nearly continuous writes, the queue would only be plugged once at the very beginning.

The purpose of plugging is primarily about latency, not bandwidth. If bandwidth is an issue, your queue will not be empty, so the time it takes for a request to get to the front of the queue is long enough for any other related requests to be seen and merged so there is no point in plugging (at least the old style - the new style still brings benefits).

The new plugging code is a bit different. Rather than only plugging idle devices it doesn't really plug devices at all. It plugs request submitters instead. (So the new code is a lot closer to the page cache than the old code). So when a process submits a request, it gets queued in the process, not in the device. Once a process has submitted all that it wants to submit (or whenever it schedules - to avoid deadlocks), the requests queued in the process are released.

So you get a similar effect on the starting transient - a large number of pre-sorted requests gets handled all at once. However there are other advantages as well in terms of lock contention.

I think it is very hard to imaging plugging slowing down even a very fast low latency device. The page cache has a bunch of pages that it wants to perform IO on and it assembles them into a big chunk and sends them all to the device. It is true that the first request won't get there quite as soon, but unless the device is as fast as memory, then it will still get there fast enough that you probably cannot measure the difference. And if a device is as fast as memory, then it probably shouldn't be under the request_queue code at all - a driver more like the umem.c driver might be appropriate. It just takes bios directly and turns them into DMA descriptors.. But even that uses plugging so it can start a chain of dmas at once rather than just one.

I've probably rambled a but there, but I really think you aren't being fair to plugging. It may not be perfect but I would need a lot more evidence before I could see any justification for it being a gross mistake.

Posted Apr 27, 2011 5:26 UTC (Wed) by dlang

I'm also puzzled about this issue.

we had a similar discussion in rsyslog when we introduced the capability to write log entries to databases in batches rather than individually. my initial proposal was to try and queue a set number of entries to write at once (with a timer to make sure they get written reasonably soon in any case), but it was pointed out that if you just write what's ready, and let everything else queue up in the meantime, the size of the writes auto-tunes itself. i.e. by always writing whatever's pending in the queue (up to a max) when you have the ability to write, you achieve both low-latency for the initial writes (and low load), and high efficiency under heavy load (because the queue backs up while you are doing the 'inefficient' small writes). This auto-tunes for the lowest available latency 但是考虑到磁盘磁头寻址的花费就不一样了，毕竟 rsyslog 不需要考虑这个问题.

I see two costs in doing this. 1. more device actions than an optimally batched mode 2. more CPU cycles used to process the small batches

if the system is idle enough, these don't matter, I could see them becoming an issue if they either cost more power, or use a resource that could otherwise be used by another process (cpu cycles, or bus bandwidth)

has anyone tried just doing away with plugging and see what the results are? (especially on anything that measures more than how short the device utilized time can be)

Posted Apr 29, 2011 7:04 UTC (Fri) by andresfreund

we had a similar discussion in rsyslog when we introduced the capability to write log entries to databases in batches rather than individually. my initial proposal was to try and queue a set number of entries to write at once (with a timer to make sure they get written reasonably soon in any case), but it was pointed out that if you just write what's ready, and let everything else queue up in the meantime, the size of the writes auto-tunes itself.

If your piece of code does two writes to a normal rotating media disk without plugging - as far as I understand it - the first write will cause a disk activity which will take up to 15ms for an idle disk. Some microseconds later your code will submit a second page. Unfortunately that will wait in some queue until the disk is finished writing the first block. Which means you will need ~30ms in the worst case.

On the other hand, if you plugged the device before doing those writes, it will sort those writes to be in disk order and the disk will be able to do it in one rotation. Which means its ~15ms.

Posted Apr 29, 2011 7:27 UTC (Fri) by dlang

almost correct if you have just a couple of writes.

if the second write arrives 15ms after the first and the device has been plugged, then with plugging it takes 30ms to get both writes on disk (15ms of delay + 15 ms of activity), without plugging it takes 15ms to get both writes on disk (15ms of activity for the first one and 15ms of activity for the second one)

if the data is submitted in a shorter period of time, then the two writes may finish faster with plugging, but if they arrive further apart (and the device remains plugged) all you are doing is delaying how long it takes for the first one to get to disk.

so you should never plug longer than it would take to write the first block to disk, but figuring out how long that will be is hard, so the timer to unplug is set long

but if you are writing a larger amount, during the 15ms while the disk is processing the first write, multiple additional requests will queue up and be able to be combined.

and if they don't, then the disk is active for 30ms instead of for 15ms, but nothing else had a need for it so why do you care? If anything else had a need for the disk it would have generated additional requests that would be combined with the second request instead of the first and second being combined and finished with the third being processed independently (possibly with plugging of it's own)

yes, some mobile users may care in an attempt to save power, but in that case they really want to have much higher latency, on the order of seconds or tens of seconds to avoid spinning up the drive, so that's not the relevant use case.

if this was being done under application control (because after all, the application is the only thing that knows what's going to happen in the future) I could see it. but you are trying to have the kernel guess if there is going to be more activity in the near future.

case 1. if there is no activity in the near future the plug just delayed the write. if there is a tiny bit of activity in the near future, each one can be treated independently as if there is no activity in the future.

case 2. if there is a lot of activity, the second request will get delayed by the time it takes to process the first request instead of by the time the plug is in place.

is there really good enough prediction of future activity to make the the kernel guessing correctly that case 2 will happen be worth the complexity, time spent manging the plugs, nd increased latency for the first activity?

Posted Apr 29, 2011 7:40 UTC (Fri) by dlang

or possibly a better way of putting it

assume that each disk action takes 15ms and data arrives every 10ms

with plugging of up to 15ms or a second item you write

2 blocks starting at 10ms finishing at 25ms 2 blocks starting at 30ms finishing at 45ms 2 blocks starting at 50ms finishing at 65ms 2 blocks starting at 70ms finishing at 85ms etc

without plugging you write

1 block starting at 0ms finishing at 15ms 1 block starting at 15ms finishing at 30ms 2 blocks starting at 30ms finishing at 45ms 1 block starting at 45 ms finishing at 60ms 2 blocks starting at 60ms finishing at 75ms 1 block starting at 75ms finishing at 90ms etc

does this really make a difference? yes, in the second case the disk is busy continuously rather than having a 5ms pause between activity but does that matter?

say the data arrives twice as fast (ever 5ms)

with plugging 2 blocks starting at 5ms finishing at 20ms 3 blocks starting at 20ms finishing at 35ms 3 blocks starting at 35ms finishing at 50ms

without plugging 1 block starting at 0ms finishing at 15ms 3 blocks starting at 15ms finishing at 30ms 3 blocks starting at 30ms finishing at 45ms

where is the gain?

Posted Apr 29, 2011 9:29 UTC (Fri) by neilbrown

While I don't disagree with your logic, I do disagree with its relevance.

In the linux kernel, plugging is not timer based. (There was a timer in the previous implementation, but it was only a last-ditch unplug in case there were bugs: slow is better than frozen).

In the old code a device would plug whenever it wanted to which was typically when a new request arrived for an empty queue. It would then unplug as soon as some thread started waiting for a request on that device to complete. I think it also would unplug explicitly in some cases after submitting lots of writes that were expected to by synchronous, but I'm not 100% certain.

So in the read case for example a read syscall would submit a request to read a page, then another request to read the next page (Because it was an 8K read), then maybe a few more requests to read-ahead some more pages, then wait for that first read to complete. Waiting for the read-ahead requests maybe isn't critical, but waiting for that second page would reduce latency. Now to be fair, if the two pages were adjacent on the disk they would probably have been combined into a single request before begin submitted, and if there aren't then maybe keeping them together isn't so important. But as soon as you get 3 non-adjacent pages in the read, there is a real possible gain from sorting before starting IO.

The new plugging code is quite different. The unplug happens when the thread submitting requests has finished submitting a bunch of requests. It is explicit rather than using the heuristic of 'unplug when someone waits' (hence the title of the article). This means it happens a little bit sooner - there is never any timer delay at all.

Rather than thinking of it as 'plugging' it is probably best to think of it as early-aggregation. hch has suggested that this be even more explicit. i.e. the thread generates a collection of related requests (quite possibly several files full of writes in the write-back case) and submits them all to the device at once. Not only does this clearly give a good opportunity to sort requests - more importantly it means we only take the per-device lock once for a large number of requests. If multiple threads are writing to a device concurrently, this will reduce lock contention making it useful even when the device queue is fairly full (when normal plugging would not apply at all).

The equivalent logic in a 'syslogd' style program would be to simply always service read requests before write requests. So when a log message comes in, it is queued to be written. Before you actually write it though you check if another log message is ready to be read from some incoming socket. If it is you read it and queue it. You only write when there is nothing to be read, or your queue is full (sort 的好处).

I agree that having a timed unplug event doesn't make much sense.

Posted Apr 29, 2011 14:57 UTC (Fri) by dlang

note that in my example, the timer was only used to indicate the max amount of time to wait for the next item to be submitted.

how can the kernel know when the application has finished submitting a bunch of requests?

or is it that the application submits one request, but something in the kernel is breaking it into a bunch of requests that all get submitted at once, and plugging is an attempt to allow the kernel to recombine them? (but that doesn't match your comment about sorting 3 non-adjacent requests being a win, how can one action by an application generate 3 non-adjacent requests?)

I'm obviously missing something here.

if the application is doing multiple read/write commands, I don't see how the kernel can possibly know how soon the next activity will be submitted after each command is run.

if the application is doing something with a single command, it seems like the problem is that it shouldn't be broken up to begin with, so there would be no need to plug to try and combine them

Posted Apr 29, 2011 22:58 UTC (Fri) by neilbrown

The actual times between plug and unplug are typically microseconds (I suspect). The old timeout was set at 3 milliseconds and that was very slow. It is almost nothing compared to device IO times.

Actions of the application and requests to devices are fairly well disconnected thanks to the page cache. An app writes to the page cache and the page cache doesn't even think about writing to the device for typically 30 seconds. Of course if the app calls fsync, that expedites things. So a partial answer to "how can the kernel know when the application has finished submitting a bunch of requests" is "the application calls 'fsync' - if it cares".

On the read side, the page cache performs read-ahead so that hopefully every read request can be served from cache - and certainly the device gets large read requests even if the app is making lots of small read requests.

Also the kernel does break things into a bunch of requests which then need to be sorted. If a file is not contiguous on disk, then you need at least one request each separate chunk. Plugging allows this sorting to happen before the first request is started.

There is a good reason why the page cache submits lots of individual requests rather than a single list with lots of requests. Every request requires an allocation. when memory gets tight (which affects writes more than reads) it could be that I cannot allocate memory for another request until the previous ones have been submitted and completed. So we submit the requests individually, but combine them at a level a little lower down, and 'unplug' that queue either when all have been submitted or when the thread 'schedules' - which it will typically only do if it blocks on a memory allocation.

So there are two distinct things here that could get confused.

Firstly there is the page cache which deliberately delays writes and expedites reads to allow large requests independent of the request size used by the application.

Then there is the fact that the page cache sends smallish requests to the device, but tends to send a lot in quick succession. These need to be combined when possible, but also flushed as soon as there is any sign of any complication. This last is what "plugging" does.