'aio design notes - very early work in progress'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-aio
Subject:    aio design notes - very early work in progress
From:       Suparna Bhattacharya <suparna () in ! ibm ! com>
Date:       2001-12-28 16:01:47
[Download RAW message or body]

Hello,

I just started out on something for this, but find progress a little slow
as I guess I've been running out of steam a bit. Figuring out things is the
fun part - writing it up, well is harder ... 

So thought I'd share this early on this mailing list at least, so there some 
questions / discussions can start and people can join in and contribute. 
Besides I know Ben and Dan are also working on some parts too, so would 
like to avoid redundant work early.

All that's in here now is just a generic overview of goals and alternatives
- haven't even got down to the design details as yet.
But the idea was to try to answer some of the bigger/key questions on
people's minds in terms of the direction, so early feedback could help keep
this more focussed rather than turn into a lengthy doc that no one has
time to read :)

Some key questions as I saw it:
	Why aio vs various alternatives ?
	Are the interfaces sufficient, simple and efficient ?
	Are we duplicting functionality ?
	Why choose Ben's aio implementation vs other alternative patches 
	or possible design approaches ?
	Is the basic internal design principle good or is it error prone ?
	
A few more, which I think are important, but haven't started talking about
yet:
	What needs to be improved ?
	If we do this incrementally which decisions need to be taken
	early ?
	How much work is needed to get this complete (even incremental
	pieces) ?

So here goes ... 

Regards
Suparna

--------------------------------------------------


Notes on Asynchronous I/O (aio) for Linux (Work in Progress - draft)

1. Motivation

Asynchronous i/o  overlaps application processing with i/o operations
for improved utilization of CPU and devices, and improved application 
performance, in a dynamic/adaptive manner, especially under high loads 
involving large numbers of i/o operations.

1.1 Where aio could be used:

Application performance and scalable connection management:
(a) Communications aio:
  Web Servers, Proxy servers, LDAP servers, X-server
(b) Disk/File aio:
  Databases, I/O intensive applications
(c) Combination
  Web Servers serving data files direct from disk to n/w

Note:
The POSIX spec has examples of using aio in a journalization model, a data 
acquisition model and in supercomputing applications. It mentions that
supercomputing and database architectures may often have specialized h/w
that can provide true asynchrony underlying the logical aio interface. 
Aio enables an application to keep a device busy (e.g. raw i/o), potentially 
improving throughput. While maximum gains are likely to be for unbuffered 
i/o case, aio should be supported by all types of files and devices in 
the same standard manner.

Besides, being able to do things like hands off zero-copy async sendfile 
can be quite useful for web-servers.

1.2 Things that aio helps with:

- Ability for a thread to initiate operations or trigger actions  
  without having to wait for them to complete. 
- Ability to queue up batches of operations and later issue a single wait to
  wait for completion of any of operations or at least a certain number of
  operations (could be at least one). 
- Multiplexing large no of connections or input sources in a scalable manner
  typically into an event driven service model.
- Flexible/dynamic concurrency control tuning and load balancing.
- Performance implications
  (a) Application thread gets to utilize its CPU time better 
  (b) Avoids overhead of extra threads (8KB per kernel thread in linux)
  (c) System throughput helped by reducing context switches (since wait causes
  less than time-slice runs)

1.2.1 Other expected features (aka POSIX):
- Support for synchronous polling as well as asynchronous notification 
  (signals/callbacks) of completion status, with ability to co-relate 
  event(s) with the i/o request(s).
- Allow multiple outstanding aio's to the same open instance and to multiple 
  open instances (sequence might be affected by synchronized data integrity
  requirements or priorities) 
- Option to wait for notification of aio and non-aio events through a single
  interface
- Support for cancellation of outstanding i/o requests
- Specification of relative priorities of aio requests (optional)

1.2.2 Also Desirable:
- Ability to drive certain sequences of related async operations/transfers 
  in one shot from an application e.g. zero-copy async transfers across 
  devices (zero-copy sendfile aio)

1.3 Alternatives to aio

1.Using more threads (has its costs) 
	- static committed resource overhead per thread
	- potentially more context switches
2.Communications aio alternatives - /dev/*poll 
	- specialized device node based interface for registration and
	  notifications of events
  	- suitable for readiness notification on sockets, but not for 
	  driving i/o.
 [Todo: Mention BSD keventd : level and edge trigger concepts]
3.Real-time signals 
	- only a notification mechanism 
	- requires fcntl (F_SETSIG) for edge triggered readiness notification 
	  enablement or aio interfaces (aio_sigevent settings: SIGEV_SIGNAL) 
	  for completion notification enablement through RT signals.
	- the mechanism has potential overflow issues (when signal queue 
	  limits are hit) where signals could get lost, especially with 
	  fasync route (which tends to generate a signal for every event 
	  rather than aggregate for an fd) and needs to be supplemented with 
	  some other form of polling
	- relatively heavy when it comes to large numbers of events
 [Question to Ponder: More efficient implementation and extensions to RT signal
 interfaces, or have a different interface altogether ? ]

 Please refer to www.kegel.com/c10k.html for a far more detailed coverage of
 these mechanisms, and how they can be used by applications.

Reasons for prefering aio:
- Desirable to have a unified approach, rather than multiple isolated 
  mechanisms if it can be done efficiently
- Multiplexing across different kinds of operations and sources
- Clear cut well-known system call interface preferable to more indirect
  interfaces
- Driving optimizations from low level/core primitives can be more efficient 
  and beneficial across multiple subsystems 
- Separate the event completion queue and notification mechanisms for
  flexiblity and efficiency. (Can have tunable wakeup semantics, tunable 
  queue lengths, more efficient event ring buffer implementation)

2. Design Philosophy and Interface Design

2.1 System and Interface design philosophy:
Alternatives:
a. Entire system built on an asynchronous model, all the way through
  (e.g NT i/o subsystem). So most operations can be invoked in sync or async
  mode (sub-options of the same operation specific interface). 
  Internally, the sync mode = async mode + wait for completion.
b. Async operations are initiated through a separate interface, and could
  follow a separate path from the synchronous operations, to a degree
  (use common code, and low down things may be truly async and common for
  for both, but at the higher level the paths could be different)
 
The POSIX aio interface is aligned with (b). This is the approach that the
Linux implementation takes. Submission of all async i/o ops happens 
through a single call with different command options, and data used for 
representing different operations.
Advantages:
- No change in existing sync interfaces (can't afford to do that anyway)
- Less impact on existing sync i/o path. This code does not have the overhead
  of maintaing async state (can use the stack), and can stay simple.
Disadvantages:
- Need to introduce interfaces or cmd structures for each operation 
  that can be async. (A little akin to an ioctl style approach)
- Different code paths implies some amount of duplication/maintenance 
  concerns. Can be minimized by using as much common code as possible.

2.2 Approaches for implementing aio

2.2.1 Alternative ways of driving the operation to completion 

1. Using threads to make things _look_ async to the application
  a. User level threads 
	- glibc approach (one user thread per operation ?)
	  poor scalability, performance
  b. Pool of threads 
	- have a pool of threads servicing an aio request queue for the
	  task - tradeof between degree of concurrency/utilization and 
	  resource consumption.
2. Hybrid approach (SGI kaio uses this)
  - If the underlying operation is async in nature, initiate it right away
    (better utilization of underlying device), and just handle waiting for 
    completion via thread pool (could become a serialization point depending
    on load and number of threads) unless operation completes in a 
    non-blocking manner.
  - If underlying operation is sync, then initiate it via the thread pool
  Note:
  - SGI kaio has internal async i/o initiation interfaces for raw i/o and
    generic read. 
  - SGI kaio has these slave threads in the context of the aio task => at
    least one per task 
3. Implement a true async state machine for each type of aio operation.
   (i.e a sequence of non-blocking steps, continuation driven by IRQ and event 
   threads, based on low level primitives designed for this purpose)
  - Relatively harder to get right, and harder to debug, but provides 
    more flexibility, and greater asynchrony
    [Question for Ben: Is posix synchronized data integrity handled
    in the user space posix implementation ? ]

Ben's aio implementation takes approach 3 (with some caveats, as we shall see
later).

[Todo: Make a note of Andi Kleen's approach - raw and comm aio only ? ]

2.2.1.1 Optimization/Fast-path for non-blocking case

In case an operation can complete in a non-blocking manner via the normal 
path, the additional async state path can be avoided. An F_ATOMIC flag check 
has been introduced down the sync i/o path to check for this, thus providing 
a fast path for aio.

2.2.2 Handling User Space Data Tranfer

With asynchronous i/o steps of the operation aren't guaranteed to execute
in the caller's context. Hence transfers/copies to/from user space need to be
handled carefully. Most of this discussion is relevant for buffered i/o,
since direct i/o avoids user/kernel space data copies.

In a thread pool approach, if a per-task thread pool is used, then such 
transfers can happen in the context of one of these threads. Typically 
the copy_to_user operations required to read transfered data into user 
space buffers after i/o completion would be handled by these aio threads.

It may be possible to pass down all the user space data for the operation
when initiating i/o while in the caller's context without blocking, though
this is inherently likely to use extra kernel space memory. The same is
true on the way up on i/o completion, where it may be possible to continue
holding on to the in-kernel buffers until the caller actually gathers 
completion data, so that copy into user space can happen in the caller's
context. However this again holds up additional memory resources which may
not be suitable especially for large data transfers.
[BTW, on windows NT, iirc this sort of stuff happens through APCs or 
asynchronous procedure calls, in a very crude sense somewhat like softirqs 
running in the context of a specified task]

Instead, an approach similar to that taken with direct i/o has been adopted,
where the user space buffers are represented in terms of physical memory
descriptors (a list of tuples of the form <page, offset, len>), called kvecs,
rather than by virtual address, so that they are uniformly accessible in any 
process context. This required new in-kernel *kvec* interfaces which operate on 
this form of i/o currency or memory descriptors. Each entry/tuple in the kvec is
called a kveclet, and represents a contiguous area of physical memory. A
virtual address range or iovec (in the case readv/writev) would map to a set
of such tuples which makes up a kvec. 

[Note: This fits in very nicely with the current multi-page bio implementation
which also uses a similar vector representation, and also with the zero-copy
network code implementation. Ben has submiited some patches to make this
all a common data structure. TBD: I think some changes are needed in the 
multi-page bio code to get this to work properly without requiring a copy
of the descriptors.
There is a discussion on various alternative representations that have been 
considered in the past in sec 1.2.2 in 
http://lse.sourceforge.net/io/bionotes.txt
]


2.3 Extent of true async behaviour - Queue depth/Throttle points

There has been some discussion about the extent to which asynchronous 
behaviour should be supported in case the operation has to wait for some 
resource to become available (typically memory, or request queue slots). 
There obviously has to be some kind of throttling of requests by the system
beyond which it cannot take in any more asynchronous io for processing.
In such cases, it should return an error (as it does for non-blocking i/o)
indicating temporary resource unavailability (-EAGAIN), rather than block
waiting for resource (or could there be value in the the latter option ?).
It seems appropriate for these bounds to be determined by the aio queue depth 
and associated resource limits, rather than by other system resources (though
the allowable queue depth could be related to general resource availability).
This would mean that ideally, when one initiates an async i/o
operation, the operation gets queued without blocking anywhere, or returns
an error in the event it hits the aio resource limits.

[Note/TBD: I think this is the intended direction, but the current
implementation isn't exactly there yet. Async raw aio would probably block if
it needs to wait for request queue slots.  Async file i/o attempts to
avoid blocking the app due to sub i/os for bmap kind of operations but it
currently could block waiting for the inode semaphore. The long term direction
is to convert this wait to an async state driven mechanism. Also the 
wait for bmap operations seems to have been just pushed out of the app's 
context to that of the event thread that drives the next step of the state 
machine (which means that it could block keventd temporarily)]

2.4 Sequencing of aio operations

Specifying serialization restrictions or relative priorities:
- posix_synchronized_io (for multiple requests to the same fd)
  says that reads should see data written by requests preceding it - enforces
  ordering to that extent, if specified.
- aio_req_prio (not supported in the current implementation)
  app can indicate some requests are lower priority than others, so the system
  can optimize system throughput and latency of other requests at the cost
  latency of such requests.
 [Todo: Discuss more later. Relative to process priority, so can't starve
  other process's requests etc. Priority as a hint vs barriers or sync i/o
  reqmts for strict ordering]
   
Beyond these restrictions and hints, sequencing is up to the system:
- Maximize throughput (global decision)
- Ideally minimize latency (local, for a request) 
Inherent tradeoffs, though improving system throughput could help with 
average latency, provided pipeline startup time isn't significant. The goal
could be to maximize throughput within reasonable latency bounds.

Since each operation could involve several steps which could potentially
run into temporary resource contention or availability delay points, the
sequence in which operations complete, or even reach the target device are
affected by system scheduling decisions in terms of resource acquisition 
at each of these stages.

Note/TBD: Since the current implementation uses event threads to drive stages of
the async state machine, in situations where a sub-step isn't completely
non-blocking (as desired), then the implementation ends up causing some
degree of serialization, or rather further accentuating the order in which
the requests reached the sub-step. This may be quite reasonable and possibly 
even beneficial for operations that are likely to contend for the same 
resources (e.g requests to the same device), but not optimal for requests
that can proceeed in a relatively independent fashion.

2.5 Completion/Readiness notification:

Note: Readiness notification can be treated as a completion of an asynchonous
operation to await readiness.

2.5.1 Requirements:

1. Efficient for large numbers of events and connections
- The interface to register events to wait for must be 
  separate from the interface used to actually poll/wait for 
  the registered events to complete (unlike traditional 
  poll/select), so that registrations can hold across multiple 
  poll waits with minimum user-kernel transfers.
  (its better to handle this at interface definition level than 
   through some kind of an internal poll cache)
- Ability to reap many events together (unlike current sigtimedwait
  and sigwaitinfo interfaces)
- Scalable/tunable queue lengths
- More flexible/tunable wakeup semantics for better concurrency
  control

2. Flexible grouping of operations 	
- Ability to wait for at least a specified number of operations from 
  a group to complete (at least N vs at least 1 helps with batching
  on the way up, so that the application can perform its post
  processing activities in a batch, without redundant context switches)
- Support dynamic additions to the group rather than static or one time
  list passed through a single call (e.g. aggregation across multiple
  submissions)
  [Question: Is the option of the completion group being different from the 
  submission batch/group (i.e. per iocb grouping field) useful to have ?
  This is possible in POSIX with sigevent specifications if signals are 
  used for completion notification but not otherwise today]

3. Flexible distribution of responsibility across multiple 
   threads/components
- Different threads can handle submission for different operations,
  and another pool of threads could wait on completion
- Degree of concurrency can be improved simply by increasing threads
  in the pool that wait for and process completion of operations for 
  that group

3. Should also be able to wait for a specific operation to complete (without
   being very inefficient about it)
- Either have low overhead group setup/teardown (setup/teardown is
  needed in order to reserve resources in advance), so this can be
  a single operation group (costs can be amortized across multiple
  such operations by reusing the same group if they happen in a 
  serialized fashion)
- Or have an interface to wait for a specific operation to complete

4. Work reasonably well for light i/o loads as well
- Tunability for low application loads, so system impact is minimum
  (tunable wakeup semantics could help) 

2.5.1 I/O completion port/queue concept



2.6 Other Goals
- POSIX as well as completion port style interfaces/wrappers
- Low overhead for user (mmaped ring buffer, vsyscalls) and kernel 
- Extensible to newer operations  
-
-



3. Interfaces 

- general overview
- pointer to man pages ?
- about aio wrappers
- what is missing or nice-to-have

4. Design Internals

4.1 Low Level Primitives :
4.1.1 wait_queue func
    async and sync waiters both use wait queue (transparent to the caller
    of wakeup)

Callback-chaining

4.1.2 wtd for async state machine
- use of keventd or dedicated system worker threads 
4.1.3 Synchronization

4.1.4 Issues/Comments:
-

4.2 Generic async event handling pieces
- data structures, ioctx setup/destroy, submit, getevents
- scaling
- handling process fork/exec/exit
- queue limits

4.3 In-kernel interfaces
- kvecs
- file system ops
- alternatives
- fastpath

4.4 Async poll

4.5 Raw-disk aio

4.6 File-system/buffered aio

Explain the state machine in each of these cases

4.7 Network aio

4.8 Extending aio to other operations (e.g sendfile)


5. Performance Characteristics
?

6. Todo Items and General Comments and Issues

- i/o cancellation implementation
- direct aio path
- mmaped ring buffer
- kernel memory pinning issues
- any races in current filesystem implementation
- implementations for other filesystems 
- network aio improvements or things to complete
- drivers which aren't totally async (use down)
-
-

7. References/Related patches:
1. SGI's kaio implementation
2. /dev/epoll patch, RT signals and various links from Dan Kegel's c10k page
3. NT I/O completion port, Solaris, AIX aio implementations

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic