[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-aio
Subject: aio design notes - very early work in progress
From: Suparna Bhattacharya <suparna () in ! ibm ! com>
Date: 2001-12-28 16:01:47
[Download RAW message or body]
Hello,
I just started out on something for this, but find progress a little slow
as I guess I've been running out of steam a bit. Figuring out things is the
fun part - writing it up, well is harder ...
So thought I'd share this early on this mailing list at least, so there some
questions / discussions can start and people can join in and contribute.
Besides I know Ben and Dan are also working on some parts too, so would
like to avoid redundant work early.
All that's in here now is just a generic overview of goals and alternatives
- haven't even got down to the design details as yet.
But the idea was to try to answer some of the bigger/key questions on
people's minds in terms of the direction, so early feedback could help keep
this more focussed rather than turn into a lengthy doc that no one has
time to read :)
Some key questions as I saw it:
Why aio vs various alternatives ?
Are the interfaces sufficient, simple and efficient ?
Are we duplicting functionality ?
Why choose Ben's aio implementation vs other alternative patches
or possible design approaches ?
Is the basic internal design principle good or is it error prone ?
A few more, which I think are important, but haven't started talking about
yet:
What needs to be improved ?
If we do this incrementally which decisions need to be taken
early ?
How much work is needed to get this complete (even incremental
pieces) ?
So here goes ...
Regards
Suparna
--------------------------------------------------
Notes on Asynchronous I/O (aio) for Linux (Work in Progress - draft)
1. Motivation
Asynchronous i/o overlaps application processing with i/o operations
for improved utilization of CPU and devices, and improved application
performance, in a dynamic/adaptive manner, especially under high loads
involving large numbers of i/o operations.
1.1 Where aio could be used:
Application performance and scalable connection management:
(a) Communications aio:
Web Servers, Proxy servers, LDAP servers, X-server
(b) Disk/File aio:
Databases, I/O intensive applications
(c) Combination
Web Servers serving data files direct from disk to n/w
Note:
The POSIX spec has examples of using aio in a journalization model, a data
acquisition model and in supercomputing applications. It mentions that
supercomputing and database architectures may often have specialized h/w
that can provide true asynchrony underlying the logical aio interface.
Aio enables an application to keep a device busy (e.g. raw i/o), potentially
improving throughput. While maximum gains are likely to be for unbuffered
i/o case, aio should be supported by all types of files and devices in
the same standard manner.
Besides, being able to do things like hands off zero-copy async sendfile
can be quite useful for web-servers.
1.2 Things that aio helps with:
- Ability for a thread to initiate operations or trigger actions
without having to wait for them to complete.
- Ability to queue up batches of operations and later issue a single wait to
wait for completion of any of operations or at least a certain number of
operations (could be at least one).
- Multiplexing large no of connections or input sources in a scalable manner
typically into an event driven service model.
- Flexible/dynamic concurrency control tuning and load balancing.
- Performance implications
(a) Application thread gets to utilize its CPU time better
(b) Avoids overhead of extra threads (8KB per kernel thread in linux)
(c) System throughput helped by reducing context switches (since wait causes
less than time-slice runs)
1.2.1 Other expected features (aka POSIX):
- Support for synchronous polling as well as asynchronous notification
(signals/callbacks) of completion status, with ability to co-relate
event(s) with the i/o request(s).
- Allow multiple outstanding aio's to the same open instance and to multiple
open instances (sequence might be affected by synchronized data integrity
requirements or priorities)
- Option to wait for notification of aio and non-aio events through a single
interface
- Support for cancellation of outstanding i/o requests
- Specification of relative priorities of aio requests (optional)
1.2.2 Also Desirable:
- Ability to drive certain sequences of related async operations/transfers
in one shot from an application e.g. zero-copy async transfers across
devices (zero-copy sendfile aio)
1.3 Alternatives to aio
1.Using more threads (has its costs)
- static committed resource overhead per thread
- potentially more context switches
2.Communications aio alternatives - /dev/*poll
- specialized device node based interface for registration and
notifications of events
- suitable for readiness notification on sockets, but not for
driving i/o.
[Todo: Mention BSD keventd : level and edge trigger concepts]
3.Real-time signals
- only a notification mechanism
- requires fcntl (F_SETSIG) for edge triggered readiness notification
enablement or aio interfaces (aio_sigevent settings: SIGEV_SIGNAL)
for completion notification enablement through RT signals.
- the mechanism has potential overflow issues (when signal queue
limits are hit) where signals could get lost, especially with
fasync route (which tends to generate a signal for every event
rather than aggregate for an fd) and needs to be supplemented with
some other form of polling
- relatively heavy when it comes to large numbers of events
[Question to Ponder: More efficient implementation and extensions to RT signal
interfaces, or have a different interface altogether ? ]
Please refer to www.kegel.com/c10k.html for a far more detailed coverage of
these mechanisms, and how they can be used by applications.
Reasons for prefering aio:
- Desirable to have a unified approach, rather than multiple isolated
mechanisms if it can be done efficiently
- Multiplexing across different kinds of operations and sources
- Clear cut well-known system call interface preferable to more indirect
interfaces
- Driving optimizations from low level/core primitives can be more efficient
and beneficial across multiple subsystems
- Separate the event completion queue and notification mechanisms for
flexiblity and efficiency. (Can have tunable wakeup semantics, tunable
queue lengths, more efficient event ring buffer implementation)
2. Design Philosophy and Interface Design
2.1 System and Interface design philosophy:
Alternatives:
a. Entire system built on an asynchronous model, all the way through
(e.g NT i/o subsystem). So most operations can be invoked in sync or async
mode (sub-options of the same operation specific interface).
Internally, the sync mode = async mode + wait for completion.
b. Async operations are initiated through a separate interface, and could
follow a separate path from the synchronous operations, to a degree
(use common code, and low down things may be truly async and common for
for both, but at the higher level the paths could be different)
The POSIX aio interface is aligned with (b). This is the approach that the
Linux implementation takes. Submission of all async i/o ops happens
through a single call with different command options, and data used for
representing different operations.
Advantages:
- No change in existing sync interfaces (can't afford to do that anyway)
- Less impact on existing sync i/o path. This code does not have the overhead
of maintaing async state (can use the stack), and can stay simple.
Disadvantages:
- Need to introduce interfaces or cmd structures for each operation
that can be async. (A little akin to an ioctl style approach)
- Different code paths implies some amount of duplication/maintenance
concerns. Can be minimized by using as much common code as possible.
2.2 Approaches for implementing aio
2.2.1 Alternative ways of driving the operation to completion
1. Using threads to make things _look_ async to the application
a. User level threads
- glibc approach (one user thread per operation ?)
poor scalability, performance
b. Pool of threads
- have a pool of threads servicing an aio request queue for the
task - tradeof between degree of concurrency/utilization and
resource consumption.
2. Hybrid approach (SGI kaio uses this)
- If the underlying operation is async in nature, initiate it right away
(better utilization of underlying device), and just handle waiting for
completion via thread pool (could become a serialization point depending
on load and number of threads) unless operation completes in a
non-blocking manner.
- If underlying operation is sync, then initiate it via the thread pool
Note:
- SGI kaio has internal async i/o initiation interfaces for raw i/o and
generic read.
- SGI kaio has these slave threads in the context of the aio task => at
least one per task
3. Implement a true async state machine for each type of aio operation.
(i.e a sequence of non-blocking steps, continuation driven by IRQ and event
threads, based on low level primitives designed for this purpose)
- Relatively harder to get right, and harder to debug, but provides
more flexibility, and greater asynchrony
[Question for Ben: Is posix synchronized data integrity handled
in the user space posix implementation ? ]
Ben's aio implementation takes approach 3 (with some caveats, as we shall see
later).
[Todo: Make a note of Andi Kleen's approach - raw and comm aio only ? ]
2.2.1.1 Optimization/Fast-path for non-blocking case
In case an operation can complete in a non-blocking manner via the normal
path, the additional async state path can be avoided. An F_ATOMIC flag check
has been introduced down the sync i/o path to check for this, thus providing
a fast path for aio.
2.2.2 Handling User Space Data Tranfer
With asynchronous i/o steps of the operation aren't guaranteed to execute
in the caller's context. Hence transfers/copies to/from user space need to be
handled carefully. Most of this discussion is relevant for buffered i/o,
since direct i/o avoids user/kernel space data copies.
In a thread pool approach, if a per-task thread pool is used, then such
transfers can happen in the context of one of these threads. Typically
the copy_to_user operations required to read transfered data into user
space buffers after i/o completion would be handled by these aio threads.
It may be possible to pass down all the user space data for the operation
when initiating i/o while in the caller's context without blocking, though
this is inherently likely to use extra kernel space memory. The same is
true on the way up on i/o completion, where it may be possible to continue
holding on to the in-kernel buffers until the caller actually gathers
completion data, so that copy into user space can happen in the caller's
context. However this again holds up additional memory resources which may
not be suitable especially for large data transfers.
[BTW, on windows NT, iirc this sort of stuff happens through APCs or
asynchronous procedure calls, in a very crude sense somewhat like softirqs
running in the context of a specified task]
Instead, an approach similar to that taken with direct i/o has been adopted,
where the user space buffers are represented in terms of physical memory
descriptors (a list of tuples of the form <page, offset, len>), called kvecs,
rather than by virtual address, so that they are uniformly accessible in any
process context. This required new in-kernel *kvec* interfaces which operate on
this form of i/o currency or memory descriptors. Each entry/tuple in the kvec is
called a kveclet, and represents a contiguous area of physical memory. A
virtual address range or iovec (in the case readv/writev) would map to a set
of such tuples which makes up a kvec.
[Note: This fits in very nicely with the current multi-page bio implementation
which also uses a similar vector representation, and also with the zero-copy
network code implementation. Ben has submiited some patches to make this
all a common data structure. TBD: I think some changes are needed in the
multi-page bio code to get this to work properly without requiring a copy
of the descriptors.
There is a discussion on various alternative representations that have been
considered in the past in sec 1.2.2 in
http://lse.sourceforge.net/io/bionotes.txt
]
2.3 Extent of true async behaviour - Queue depth/Throttle points
There has been some discussion about the extent to which asynchronous
behaviour should be supported in case the operation has to wait for some
resource to become available (typically memory, or request queue slots).
There obviously has to be some kind of throttling of requests by the system
beyond which it cannot take in any more asynchronous io for processing.
In such cases, it should return an error (as it does for non-blocking i/o)
indicating temporary resource unavailability (-EAGAIN), rather than block
waiting for resource (or could there be value in the the latter option ?).
It seems appropriate for these bounds to be determined by the aio queue depth
and associated resource limits, rather than by other system resources (though
the allowable queue depth could be related to general resource availability).
This would mean that ideally, when one initiates an async i/o
operation, the operation gets queued without blocking anywhere, or returns
an error in the event it hits the aio resource limits.
[Note/TBD: I think this is the intended direction, but the current
implementation isn't exactly there yet. Async raw aio would probably block if
it needs to wait for request queue slots. Async file i/o attempts to
avoid blocking the app due to sub i/os for bmap kind of operations but it
currently could block waiting for the inode semaphore. The long term direction
is to convert this wait to an async state driven mechanism. Also the
wait for bmap operations seems to have been just pushed out of the app's
context to that of the event thread that drives the next step of the state
machine (which means that it could block keventd temporarily)]
2.4 Sequencing of aio operations
Specifying serialization restrictions or relative priorities:
- posix_synchronized_io (for multiple requests to the same fd)
says that reads should see data written by requests preceding it - enforces
ordering to that extent, if specified.
- aio_req_prio (not supported in the current implementation)
app can indicate some requests are lower priority than others, so the system
can optimize system throughput and latency of other requests at the cost
latency of such requests.
[Todo: Discuss more later. Relative to process priority, so can't starve
other process's requests etc. Priority as a hint vs barriers or sync i/o
reqmts for strict ordering]
Beyond these restrictions and hints, sequencing is up to the system:
- Maximize throughput (global decision)
- Ideally minimize latency (local, for a request)
Inherent tradeoffs, though improving system throughput could help with
average latency, provided pipeline startup time isn't significant. The goal
could be to maximize throughput within reasonable latency bounds.
Since each operation could involve several steps which could potentially
run into temporary resource contention or availability delay points, the
sequence in which operations complete, or even reach the target device are
affected by system scheduling decisions in terms of resource acquisition
at each of these stages.
Note/TBD: Since the current implementation uses event threads to drive stages of
the async state machine, in situations where a sub-step isn't completely
non-blocking (as desired), then the implementation ends up causing some
degree of serialization, or rather further accentuating the order in which
the requests reached the sub-step. This may be quite reasonable and possibly
even beneficial for operations that are likely to contend for the same
resources (e.g requests to the same device), but not optimal for requests
that can proceeed in a relatively independent fashion.
2.5 Completion/Readiness notification:
Note: Readiness notification can be treated as a completion of an asynchonous
operation to await readiness.
2.5.1 Requirements:
1. Efficient for large numbers of events and connections
- The interface to register events to wait for must be
separate from the interface used to actually poll/wait for
the registered events to complete (unlike traditional
poll/select), so that registrations can hold across multiple
poll waits with minimum user-kernel transfers.
(its better to handle this at interface definition level than
through some kind of an internal poll cache)
- Ability to reap many events together (unlike current sigtimedwait
and sigwaitinfo interfaces)
- Scalable/tunable queue lengths
- More flexible/tunable wakeup semantics for better concurrency
control
2. Flexible grouping of operations
- Ability to wait for at least a specified number of operations from
a group to complete (at least N vs at least 1 helps with batching
on the way up, so that the application can perform its post
processing activities in a batch, without redundant context switches)
- Support dynamic additions to the group rather than static or one time
list passed through a single call (e.g. aggregation across multiple
submissions)
[Question: Is the option of the completion group being different from the
submission batch/group (i.e. per iocb grouping field) useful to have ?
This is possible in POSIX with sigevent specifications if signals are
used for completion notification but not otherwise today]
3. Flexible distribution of responsibility across multiple
threads/components
- Different threads can handle submission for different operations,
and another pool of threads could wait on completion
- Degree of concurrency can be improved simply by increasing threads
in the pool that wait for and process completion of operations for
that group
3. Should also be able to wait for a specific operation to complete (without
being very inefficient about it)
- Either have low overhead group setup/teardown (setup/teardown is
needed in order to reserve resources in advance), so this can be
a single operation group (costs can be amortized across multiple
such operations by reusing the same group if they happen in a
serialized fashion)
- Or have an interface to wait for a specific operation to complete
4. Work reasonably well for light i/o loads as well
- Tunability for low application loads, so system impact is minimum
(tunable wakeup semantics could help)
2.5.1 I/O completion port/queue concept
2.6 Other Goals
- POSIX as well as completion port style interfaces/wrappers
- Low overhead for user (mmaped ring buffer, vsyscalls) and kernel
- Extensible to newer operations
-
-
3. Interfaces
- general overview
- pointer to man pages ?
- about aio wrappers
- what is missing or nice-to-have
4. Design Internals
4.1 Low Level Primitives :
4.1.1 wait_queue func
async and sync waiters both use wait queue (transparent to the caller
of wakeup)
Callback-chaining
4.1.2 wtd for async state machine
- use of keventd or dedicated system worker threads
4.1.3 Synchronization
4.1.4 Issues/Comments:
-
4.2 Generic async event handling pieces
- data structures, ioctx setup/destroy, submit, getevents
- scaling
- handling process fork/exec/exit
- queue limits
4.3 In-kernel interfaces
- kvecs
- file system ops
- alternatives
- fastpath
4.4 Async poll
4.5 Raw-disk aio
4.6 File-system/buffered aio
Explain the state machine in each of these cases
4.7 Network aio
4.8 Extending aio to other operations (e.g sendfile)
5. Performance Characteristics
?
6. Todo Items and General Comments and Issues
- i/o cancellation implementation
- direct aio path
- mmaped ring buffer
- kernel memory pinning issues
- any races in current filesystem implementation
- implementations for other filesystems
- network aio improvements or things to complete
- drivers which aren't totally async (use down)
-
-
7. References/Related patches:
1. SGI's kaio implementation
2. /dev/epoll patch, RT signals and various links from Dan Kegel's c10k page
3. NT I/O completion port, Solaris, AIX aio implementations
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic