'[Lse-tech] Status of read-copy update for Linux'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lse-tech
Subject:    [Lse-tech] Status of read-copy update for Linux
From:       "Paul McKenney" <Paul.McKenney () us ! ibm ! com>
Date:       2001-01-12 4:54:25
[Download RAW message or body]

I am making good progress working this through IBM's open-source process,
going for GPL.  I hope to get this taken care of sometime next month.  In
the meantime, Dipankar is getting the PTX version ported over.  I have
attached PTX's analog of a HowTo document, which includes PTX's API in the
hope that this may be of use.  I will be working to put together a Linux
version of this API.

                                   Thanx, Paul

(See attached file: rclock_Linux.html)
["rclock_Linux.html" (text/html)]

<HEAD>
<TITLE>Usage of Read-Copy Mutual-Exclusion Mechanism</TITLE>
</HEAD>
<BODY>
<H1>Read-Copy Mutual-Exclusion</H1>
<H1>Table of Contents</H1>
<OL>
<LI>	<A HREF="#OVERVIEW">OVERVIEW</A>
<LI>	<A HREF="#MOTIVATION">MOTIVATION</A>
<LI>	<A HREF="#DESCRIPTION">DESCRIPTION</A>
<LI>	<A HREF="#BENEFITS OF READ-COPY MECHANISM">
	BENEFITS OF READ-COPY MECHANISM</A>
<LI>	<A HREF="#RESTRICTIONS FOR USE OF READ-COPY MECHANISM">
	RESTRICTIONS FOR USE OF READ-COPY MECHANISM</A>
<LI>	<A HREF="#APPLICATIONS OF THIS MECHANISM">
	APPLICATIONS OF THIS MECHANISM</A>
<LI>	<A HREF="#OVERVIEW OF INTERFACE">OVERVIEW OF INTERFACE</A>
<LI>	<A HREF="#DESIGN RULES FOR kmem_deferred_alloc_embedded()">
	DESIGN RULES FOR kmem_deferred_alloc_embedded()</A>
<LI>	<A HREF="#ADDITIONAL DESIGN RULES TO PROTECT LIST OF ELEMENTS">
	ADDITIONAL DESIGN RULES TO PROTECT LIST OF ELEMENTS</A>
<LI>	<A HREF="#EXAMPLE OF ELIMINATING TABLE-LOOKUP MUTEX">
	EXAMPLE OF ELIMINATING TABLE-LOOKUP MUTEX</A>
<LI>	<A HREF="#EXAMPLE OF rc_callback_wrapper()">
	EXAMPLE OF rc_callback_wrapper()</A>
<LI>	<A HREF="#EXAMPLE OF kmem_deferred_free_embedded()">
	EXAMPLE OF kmem_deferred_free_embedded()</A>
<LI>	<A HREF="#DOCUMENTATION OF INTERFACES">DOCUMENTATION OF INTERFACES</A>
<LI>	<A HREF="#CPUS WITH WEAK MEMORY BARRIERS">
	CPUS WITH WEAK MEMORY BARRIERS</A>
</OL>
<P>
<H1>AVAILABILITY</H1>
<P>
These primitives are available in ptx4.x.
Additional interfaces will be available in ptx4.3.
<P>
<H1><A NAME="OVERVIEW">OVERVIEW</A></H1>
<P>
Straightforward, well-understood mutual-exclusion mechanisms
have served quite well for over a decade.  However, times
are changing, and computer architectures are changing with
them.  These changes have caused tried-and-true locking
primitives to become increasingly expensive when compared
to the cost of ordinary instructions, and provide motivation
for development and use of alternative locking mechanisms.
<P>
As usual, there is no free lunch.  Higher-performance locking
mechanisms tend to be more specialized than the more familar
general-purpose mechanisms.  One useful specialization stems
from the observation that many data structures are accessed
much more frequently than they are modified.  These data
structures could benefit from locking mechanisms that reduce
the overhead for accessing the structure, while possibly
increasing the cost of modifying the structure.
<P>
One such locking mechanism is the reader-writer spinlock.
This mechanism gives readers better cache locality (and
writers worse cache locality), thereby speeding up readers
(and slowing down writers).  This mechanism is described
in detail in a separate document.
<P>
Another specialized locking mechanism is the read-copy
``lock'' (actually an update discipline).  This mechanism
reduces the readers' locking overheads to zero.  Since the
readers are doing absolutely nothing to protect their
accesses, the writers must take careful steps to ensure
that their updates will allow the readers to safely access
the data at any time.  The read-copy update discipline
mechanism provides primitives that allow writers to easily
take such careful steps.
<P>
The overhead of these primitives is relatively low, but is
considerably greater than than of a simple spinlock.
Therefore, like the reader-writer spinlock, this mechanism
improves performance only when the protected data structure
is read much more frequently than written.  In addition,
the readers must be able to tolerate using stale data.
Although this restriction sounds severe, it is satisfied
quite often.  One example is a routing table in a network
protocol, which is updated at most every minute or so, but
may be accessed several thousand times per second, and
which can tolerate stale routing information.  Even in the
unforgiving realm of numerical analysis, there are algorithms
that can tolerate stale data -- so-called ``chaotic
relaxation'' is one example.  Other example applications
of this mechanism are listed in the
<A HREF="#APPLICATIONS OF THIS MECHANISM">
APPLICATIONS OF THIS MECHANISM</A> section.
<P>
The remainder of this document describes the motivation
for the read-copy update mechanism (e.g., why is it so
important to eliminate lock operations), a description of
the PTX4.0 implementation, benefits and drawbacks, applications
of this mechanism, some example usages of the mechanism,
followed by a detailed description of the interface that
is present in PTX4.0.  The examples and interface are
specialized for the PTX kernel, but the mechanism can be
adapted to a variety of other environments, including user
processes and other kernels.
<P>
<H1><A NAME="MOTIVATION">MOTIVATION</A></H1>
<P>
CPU speeds have been increasing by a factor of from 1.5x to
2x per year for more than a decade.  However, interconnect
speeds (printed circuit boards and busses) have only been
increasing by roughly 15 percent per year over the same
period.
<P>
To illustrate this effect, consider the number of simple
instructions that can be executed in the time required to
acquire and release a simple in-kernel spinlock for Sequent's
Symmetry Model-B and the NUMA-Q "Scorpion" system containing Xeon processors.
The Model B shipped in
mid-1988, gets about 5 MIPS, and requires about 10 microseconds
to acquire and release a spinlock.  The Scorpion shipped in
late 1988, gets about 540 MIPS, and requires about 4
microseconds to acquire and release a spinlock.
<P>
In the 10 years between the Model B and the Scorpion, the
CPU speed has increased more than 100-fold (1.6x increase per year), while
the lock speed has increased by only 2.5x (1.1x increase
per year).  The number of instructions that can be executed
in the time required to acquire and release a lock has risen
from 50 for the model B to over 2,000 for the Scorpion.  If a lock
is acquired every 200 instructions, then the locking overhead
has increased from 20% to over 90% in a 3.5 year period.
This vividly illustrates why reducing the number of lock
operations can be just as important as reducing lock contention.
<P>
This increase in overhead provides the motivation for
constructing mutual-exclusion schemes that do not rely so
heavily on communication over the relatively slow interconnects
between the relatively fast CPUs.
<P>
Note that the MIPS ratings given above assume a low miss rate
from the on-chip cache.  An analysis of the effect of cache
misses would uncover a similar increasing penalty for data
movement between memory and caches.  This is a subject for
another document.  Note however that low-end machines that
trade low latency for limited scaling have lower miss penalties
than do higher-end machines such as ours.  Larger miss
penalties result in locks having higher overhead on high-end
machines than on machines with the low latencies made possible
by limited scaling.
<P>
The read-copy update mechanism allows locking operations to be
eliminated on read accesses to the data it protects.
<P>
<H1><A NAME="DESCRIPTION">DESCRIPTION</A></H1>
<P>
This section describes the read-copy mechanism as implemented
on PTX4.0.  The principles underlying the mechanism can be
applied to other environments (such as other kernels, to
real-time systems, and even to application programs).
These underlying principles are documented in a
<A HREF="http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf">
PDCS'98 paper</A>.
The rest of this section gives an overview of these principles.
This description assumes a non-preemptive kernel, however, a
preemptive kernel may be handled easily.
<P>
The basis of the read-copy mechanism is the use of the
execution history of each CPU to demonstrate that a given
destructive operation can be safely performed at a particular
time.  The mechanism is currently used in three general ways:
(1) as an update discipline that allows readers to access
a changing data structure without performing any synchronization
operations, (2) to guarantee existence of data structures referenced
by global pointers, thereby avoiding complex race conditions that
would otherwise occur when freeing up these data structures,
(3) to reduce deadlock-avoidance complexity,
and (4) as a barrier-like synchronization
mechanism that allows one process to exclude other processes
without requiring these other processes to execute any
expensive locking primitives or atomic instructions.
<P>
The key point is that all kernel data structures are
manipulated only by the kernel context of a user process
or by an interrupt routine.  Neither the idle loop nor a
user process running in user mode may manipulate kernel
data structures.  The idle loop, user-mode execution, and
(by special dispensation) a context switch are therefore
``quiescent states'':  a CPU in a quiescent state cannot
interfere with manipulation of in-kernel data or be interfered
with by manipulation of in-kernel data.  Of course, a given
CPU could switch out of a quiescent state at any time, but
such a CPU would only be able to access or modify shared
data that was globally accessible at or after that time.
Therefore, if a given process removes all globally-accessible
references to a particular data structure, then blocks
until all CPUs pass through a quiescent state (refering to
a summary of CPU execution history to determine when this
occurs), it can manipulate that data structure secure in
the knowledge that no other CPU can interfere.  Likewise,
if a given process makes a change to a global variable,
then blocks until all CPUs have passed through a quiescent
state, it can be sure that all the other CPUs are now aware
of the change, even if that variable is not protected by
any sort of mutex.
<P>
The core of the read-copy mechanism is a primitive named
<TT>rc_callback()</TT> that allows a callback to be scheduled
once all CPUs have passed through a quiescent state.
<P>
An example use of this mechanism is the deletion of a
data structure that is still being used by processes and
interrupt routines running on other CPUs.  To delete the
data structure, simply do the following:
<UL>
<LI>	Copy a pointer to it.
<P>
<LI>	Store <TT>NULL</TT> into the global pointer so that any
	subsequent processes or interrupts will be unable
	to access the data structure.
<P>
<LI>	Use <TT>rc_callback()</TT> to schedule a callback that
	frees up the data structure.
</UL>
This procedure will allow all kernel threads of execution
that might still be using the data structure to terminate
before actually deleting the structure.  Note that spinlocks
or reference counters could be used to get the same effect,
but that these techniques would impose overhead on everything
accessing the data structure.
<P>
The <TT>rc_callback()</TT> primitive takes a pointer to an
<TT>rc_callback_t</TT> control block, which may be allocated
via the <TT>rc_alloc_callback()</TT> primitive and freed via
the <TT>rc_free_callback()</TT> primitive.  The <TT>rc_alloc_callback()</TT>
primitive fills in the <TT>rc_callback_t</TT> with the address of
the function to be called back and the values of two arguments.
<P>
When the function is invoked, it will be passed a pointer to
the <TT>rc_callback_t</TT> and the two arguments.  The function
may reuse the <TT>rc_callback_t</TT> by rescheduling itself, otherwise
it must free it up.  The <TT>rc_set_func()</TT> primitive can be
used as a shorthand for freeing up one <TT>rc_callback_t</TT> and
immediately allocating another.  The reason that the two
arguments are passed as separate parameters even though they
are present in the <TT>rc_callback_t</TT> is to protect layered
products from any changes in the <TT>rc_callback_t</TT> format.
<P>
An <TT>RC_MEMSYNC()</TT> primitive is used to maintain memory
consistency on CPUs that can reorder memory reads and writes
(note however that the existing spinlock primitives and the
<TT>rc_callback()</TT> primitive act as implicit <TT>RC_MEMSYNC()</TT>s).
This primitive must be used between the act of filling in
the fields internal to a given structure and the act of
creating a globally-accessible pointer to that structure.
Otherwise, an aggressive super-scalar CPU might reorder the
memory write to the global pointer before some
of the writes that initialized the structure itself.  This
reordering could allow other CPUs to access the structure
before it was fully initialized, which could significantly reduce
the longevity of the kernel.
<P>
<TT>RC_RDPROTECT()</TT> and <TT>RC_RDUNPROTECT()</TT> primitives are
used to indicate sections of code that are relying on the
read-copy mechanism.  These generate no code in .std kernels,
but generate debug code in .mfg kernels that checks for proper
nesting.
<P>
The preceding primitives form the interface to the core read-copy
mechanism, and are described in detail later in this document.
There are two facilities layered on top of this core mechanism:
the deferred-free facility and the callback-wrapper facility.
<P>
The deferred-free primitives provide an efficient and
easy-to-use interface to the rc_callback core primitive
that allows the caller to specify that a given data structure
be freed after all CPUs (which might still be referencing
the structure) have passed through a quiescent state.  The
workhorse of this interface is <TT>kmem_deferred_free()</TT>.
The example data structure deletion given above would use
<TT>kmem_deferred_free()</TT> as follows:
<UL>
<LI>	Copy a pointer to the data structure to be deleted.
<P>
<LI>	Store <TT>NULL</TT> into all global pointers to it so that
	any subsequent processes or interrupts will be unable
	to access the data structure.
<P>
<LI>	Use <TT>kmem_deferred_free()</TT> delete the data structure
	once it becomes safe to do so.
</UL>
The <TT>kmem_deferred_free()</TT> primitive requires some
additional memory to keep track of all data structures
waiting for a deferred free.  This memory may be unioned
with fields in the data structure that are no longer needed
once the deferred free is in progress, or may be allocated
separately.  The <TT>kma_alloc_defer_item()</TT> and
<TT>kma_defer_free_item()</TT> primitives allow this memory to
be allocated and freed independently of the data structure,
while the <TT>kmem_deferred_alloc_embedded()</TT> and
<TT>kmem_alloc_free_embedded()</TT> primitives allow it to be
allocated along with the structure itself.  If the additional
memory is to be unioned with fields in the data structure,
<TT>kma_alloc_defer_item_size()</TT> should be used to verify
that the fields are large enough.
<P>
The callback-wrapper facility consists of the
<TT>rc_callback_wrapper()</TT> function, which may be specified
as a callback to the <TT>rc_callback()</TT> primitive.  The
<TT>rc_callback_wrapper()</TT> function takes another function
as its first parameter and that function's single parameter
as its second parameter.  This allows functions such as
<TT>v_sema()</TT> to be conveniently invoked using the read-copy
callback mechanism.
<P>
So where did the name ``read-copy'' come from?  It turns out
that this is how the mechanism is used to modify (rather
than simply delete) a data structure while allowing readers
full concurrent access to that data structure.  The procedure is
as follows:
<OL>
<LI>	Allocate new memory for the modified data structure.
<P>
<LI>	Copy the contents of the old data structure to the new.
<P>
<LI>	Modify the new data structure as needed.
<P>
<LI>	Update all pointers that currently point to the old
	copy to instead point to the new copy.
<P>
<LI>	Do a deferred free of the old copy.
</OL>
In other words, readers can continue reading while updaters make
new copies, hence ``read-copy''.  Use of the mechanism in this
manner is sometimes called a ``read-copy lock''; this nomenclature
was stolen from the ``reader-writer lock''.  This analogy with
spinlock primitives can be useful, but it is better to think
of the read-copy mechanism as enabling the read-copy update
procedure shown above than to risk confusing the mechanism with
explicit locking.
<P>
<H1><A NAME="BENEFITS OF READ-COPY MECHANISM">BENEFITS OF READ-COPY \
MECHANISM</A></H1> <P>
<OL>
<LI>	Requires execution of fewer explicit mutex operations
	are required to protect accesses to shared data
	structures.  In many cases <EM>no</EM> explicit lock
	operations are required.  This results in less
	contention and lower CPU overhead.  Of course,
	<EM>modification</EM> of shared data structures still
	requires locking operations.  However, all the
	well-known techniques (such as hashing) may be used
	to reduce the lock contention incurred when modifying
	shared data.
<P>
<LI>	Requires execution of fewer special instructions that
	enforce sequential consistency (e.g., instructions
	taking the ``lock'' prefix on the Intel 80x86
	instruction-set architecture).  These instructions
	result in CPU pipeline flushes, resulting in degraded
	performance, especially on the newer super-scalar
	microprocessors containing multiple instruction
	pipelines.
<P>
<LI>	In code not subject to blocking preemption, requires fewer
	interrupt-level manipulation operations when accessing
	shared data structures.  These operations often
	require interaction with relatively slow external
	interrupt-distribution hardware, and are frequently
	required even on uniprocessors (in which case the
	read-copy update mechanism can be said to have <EM>negative</EM>
	overhead with respect to traditional uniprocessor
	algorithms).  In PTX, interrupt-level manipulation
	operations are usually combined with the locking
	operations and are still usually required when <EM>modifying</EM>
	shared data structures.
<P>
	An example of this occurs when a data structure is
	modified at interrupt level.  If the interrupt code
	can use the read-copy mechanism, then any process-level
	code that reads the data structure no longer needs
	to exclude the update code in the interrupt routine.
	Therefore, the process-level code no longer needs
	to block interrupts for the duration of its access.
<P>
	In environments allowing preemption, interrupt-level
	manipulations or operations that disable preemption
	may still be required.
<P>
<LI>	This mechanism is not as susceptible to deadlock as
	are explicit lock operations.  The reduced
	susceptibility to deadlock results in reduction in
	or elimination of code that would otherwise be
	required to detect and repair or to avoid deadlocks,
	which in turn reduces software development and
	maintenance costs.
</OL>
<P>
<H1><A NAME="RESTRICTIONS FOR USE OF READ-COPY MECHANISM">RESTRICTIONS FOR USE OF \
READ-COPY MECHANISM</A></H1> <P>
As noted in the overview, high-performance locking mechanism
like rclock are specialized, and can therefore be used only
in certain situations.  This section lists these restrictions
along with metrics that can help to determine when it is
beneficial to use the rclock mechanism.
<OL>
<LI>	Writers cannot exclude readers.  This can be an
	advantage in cases where there may be bursts of
	very heavy write activity, causing conventional
	locking primitives to slow down or or entirely lock
	out read activity.  However, readers must be excluded
	in those cases where writers cannot make their
	modifications in such a way that each intermediate
	state appears consistent to a reader.  This read-copy
	mechanism is unable to provide this form of exclusion.
<P>
<LI>	Readers cannot exclude writers.  Readers may therefore
	see stale data.  Therefore, algorithms employing the
	rclock mechanism must tolerate stale data, or must
	explicitly flag stale data (see Pugh's skip-list
	papers for ways of accomplishing this).
<P>
<LI>	There is a significant latency imposed by the primitives
	making up the write side of the read-copy mechanism, even
	in absence of contending readers.  This latency averages
	about 25 milliseconds on lightly-loaded systems.  In other
	words, there will be about a 25-millisecond delay between
	the time that <TT>rc_callout()</TT> is invoked until the time that
	the rclock subsystem invokes the corresponding callout.
<P>
	This figure assumes a 10-millisecond clock interrupt as well
	as short in-kernel code paths.
	Varying the clock frequency will change the average delay.
	Long in-kernel code paths will increase the average delay.
	Note that speeding up the clock (thus decreasing the average
	delay) will in turn decrease the number of requests satisfied
	by a single quiescent-state detection, which in turn can
	degrade performance.
<P>
	Note that the <TT>kmem_deferred_free()</TT> family of
	primitives shields you from the <TT>rc_callout()</TT>
	latency as long as you have enough memory to handle
	25 milliseconds worth of deletions from your data
	structure.
<P>
<LI>	The write side of the read-copy update mechanism can be
	stalled by bugs in the kernel that result in long-running
	loops.  For example, if a buggy kernel driver spins for
	one second, there can be up to a one-second latency
	imposed on the write-side primitives.  Note that this
	latency does <EM>not</EM> consume CPU time.  PTX 4.0.1 and
	4.1 (as well as later OS versions) have a version of the rclock
	mechanism that issues warnings to the console when such stalls
	occur.  In even later versions of the OS, these warnings
	include a short-form stack trace of the overly long code
	segment.
<P>
<LI>	In cases where it is permissible to exclude readers,
	the read-to-write ratio must be rather high in
	order for the read-copy mechanism to have a
	performance benefit (in cases where it is not
	permissible to exclude readers, read-copy is the
	only game in town).  The exact breakeven ratio will
	vary depending on the details of the situation,
	but, as a rule of thumb, the read-copy update mechanism
	will almost certainly provide a performance benefit
	when the read-to-write ratio is N-to-1, where N is the
	number of CPUs.  If there are a large number of
	read-copy updates per unit time, the overhead of
	each update is amortized, so that read-copy update
	provide a performance benefit with read-to-write
	ratios as low as 1.10.
<P>
	The reason for this wide range is that the primitives
	making up the read-copy mechanism make heavy use of
	batching techniques that allows overhead to be spread
	over multiple concurrent requests.  
<P>
<LI>	The use of the read-copy mechanism for operations that
	modify existing data structures sometimes requires
	that the modification process be restructured into
	read-copy form.  This restructuring is necessary
	in cases where the modification cannot be made
	without the data structure passing through inconsistent
	intermediate states.
<P>
	The read-copy form requires that a new data structure
	be allocated and copied from the old one.  The new
	data structure is then modified as required.  The old
	data structure is then replaced with the newly modified
	one.  Finally, the old data structure is deleted when
	it is safe to do so.
<P>
	Such restructuring will likely require the addition of 
	memory allocation, which can fail.  These failures must
	be correctly handled, either by blocking until memory
	is available, aborting the modification with an error,
	preallocating the memory,
	or using some mechanism to retry the modification at
	a later time.
<P>
	In some cases, the complexity added by this restructuring
	preclude use of the read-copy mechanism.
<P>
	Note that modifications that do not leave the data
	structure in inconsistent intermediate states do not
	require this sort of restructuring.  Examples include
	"blind" writes to single aligned bytes, words, or longs,
	or atomic updates of single aligned bytes, words, longs,
	or (on Pentiums and better) 8-byte quantities.
<P>
<LI>	The read-copy mechanism may not be used to protect
	access to a data structure across a blocking operation.
	This same restriction applies also applies to spinlocks.
<P>
	Therefore, data structures containing semaphores
	are not candidates for the read-copy mechanism
	unless the semaphores can be guaranteed to have no
	processes sleeping on them at the time of deletion.
	(It is possible to create a mechanism analogous to
	the read-copy mechanism where CPUs are replaced by
	processes and where a different set of quiescent
	states is selected so that this new mechanism can
	tolerate blocking.  Such a mechanism was created
	as part of ptx4.5, and is used to protect per-process state
	such as credentials, current directory, and root directory.)
<P>
<LI>	A read-copy update requires the updater to locate
	all globally accessible pointers to the structure being updated.
	Local temporary variables need <I>not</I> be located, as
	the read-copy callback will delay until all such temporaries
	are no longer in use.
<P>
<LI>	Readers cannot block while accessing a data structure
	protected by the rclock mechanism.  If a reader blocks,
	it must restart the access from the beginning, as the
	data structure may well have undergone arbitrary changes
	in the meantime.  In principle, it is possible to
	recast the rclock mechanism into a per-process form
	that would tolerate reader blocking, such as was
	done to protect per-process state in ptx4.5.
</OL>
<P>
Given all of these restrictions, it might seem that the
read-copy mechanism is useful only in a few esoteric
situations.  In fact, these situations are surprisingly common.
The following section lists some areas where the read-copy mechanism
is useful and lists some conditions that must be satisfied for it to
be used.
<P>
<H1><A NAME="APPLICATIONS OF THIS MECHANISM">APPLICATIONS OF THIS MECHANISM</A></H1>
<P>
The following are a few specific examples of applications
of this mechanism.
<OL>
<LI>	Routing tables in computer communications protocol
	implementations.  Here, the readers look up routes in
	the table for the packets leaving the system.  The
	writers apply routing updates to the table.  With the
	advent of fibrechannel, disk I/O must often be routed
	just like traditional comms traffic.  Read-copy update
	may therefore be used to route disk-I/O traffic through
	fibrechannel networks.
<P>
<LI>	Address-resolution tables in computer communications
	protocol implementations.  Readers look up link-level
	addresses for the packets leaving the system and writers
	apply address-resolution updates to the table.
<P>
<LI>	Device state tables.  Readers access and possibly modify
	the individual entries, while writers may add and delete
	entries.  If readers can modify the individual entries,
	then there will need to be some sort of mutex on the
	individual entries.
<P>
<LI>	Tables whose entries can be more efficiently deallocated
	in batches than one at a time (for example, by spreading
	locking overhead over more than one entry).
<P>
<LI>	Phased state change.  A data structure with an intermediate
	state between each existing state can switch to the
	intermediate state, then switch to the destination
	state once there are no more threads that have reason to
	assume that the data structure was still in the initial
	state.
<P>
	This is used in ptx/CLUSTERS when starting the recovery
	phase.  When a lock-manager instance determines that
	recovery is necessary, it sets a flag and registers a
	callback with the read-copy mechanism.  Once the callback
	is invoked, the lock-manager can be sure that all
	subsequent processing will be aware that a recovery is
	in progress.  The alternative is to protect all code
	checking or setting the flag with an explicit (and
	expensive) mutex.
<P>
<LI>	Existence guarantees.  This is used to more simply handle
	races between use of a service and teardown of that service.
	This approach is used in TCP/IP to handle
	the case where an IP protocol (such as TCP, UDP, or,
	of more practical interest, ICMP) is shut down while
	packets are still being received and processed.  Each
	such protocol has a pointer to its private data
	structures, and these data structures must be freed up
	as part of the shutdown process.  Clearly, it would be
	disasterous if a packet, received just before shutdown,
	was still being processed when the relevant data
	structures were freed.  This disaster could be prevented
	using global locks or reference counts on each and every
	packet reception, but this would result in cache thrashing.
<P>
	Instead, we first <TT>NULL</TT> out the pointer to the private
	data structures so that any subsequent packets will be
	properly rejected.  Then, the read-copy mechanism is
	used to post a callback when all current kernel threads
	of execution -- including those currently processing
	recently-received packets -- have terminated.  At this
	point, it is safe to delete the private data structures.
<P>
	Existence guarantees are also used to
	ensure that all CPUs are finished using a data structure
	before deleting it.  This case can occur during processing
	of a Unix <TT>close()</TT> system call.  The kernel context of the
	process doing the close can set a flag indicating that the
	data structure is no longer available.  The mechanism may
	then be used to wait until all currently-executing kernel
	threads of control (which may not have seen the new value
	of the flag) have terminated, at which point it is safe
	to delete the data structure.
<P>
<LI>	Poor-man's garbage collection.  A function that returns a
	pointer to a dynamically-allocated structure could relieve
	the caller of the responsibility for freeing the structure
	by doing a deferred deletion before returning.  It is not
	yet clear whether this is a useful tool or a shameless
	kludge.
<P>
<LI>	Simplify locking design when freeing a structure.  It is
	sometimes convenient to hold a structure's lock throughout
	the entire deletion process.  However, one problem with
	this approach is that if some other CPU attempts to acquire
	the lock just after the deleting CPU acquires it, the second
	CPU can end up holding the lock just after the element is
	returned to the freelist, where it might be reused for some
	unrelated purpose arbitrarily quickly.  Doing a deferred
	rather than an immediate deletion is one way to resolve this
	problem -- the element could not be returned to the freelist
	until the second CPU was finished looking at it.  This would allow
	the deleting CPU to set a flag in the structure indicating
	that it was no longer valid.
<P>
	This technique is used in TCP 4.2.0 and later to avoid deadlocks
	in STREAMS accept/t_accept processing.  It is also used in
	TCP 4.2.0-4.4.x to avoid socket_peer deadlocks in STREAMS
	open processing.
</OL>
<P>
Cases 1, 2, and 3 will have read-side code with the following
structure when coded using Dynix/PTX spinlock primitives:
<LISTING>
	s = p_lock(&my_lock, MY_SPL);
	/* look something up. */
	v_lock(&my_lock, s);
	/* use the item looked up. */
</LISTING>
This situation offers the most natural opportunity for use
of these primitives.  The information is used outside of
the lock, and therefore may be stale.  In contrast, code
with the following structure:
<LISTING>
	s = p_lock(&my_lock, MY_SPL);
	/* look something up. */
	/* use the item looked up. */
	v_lock(&my_lock, s);
</LISTING>
is probably <EM>not</EM> a candidate for use of these primitives, as this
code guarantees that the information is up-to-date while it is
being used.  As noted in the previous section, the locking design
would have to be restructured before read-copy update could be used.
<P>
<H1><A NAME="OVERVIEW OF INTERFACE">OVERVIEW OF INTERFACE</A></H1>
<P>
The easiest-to-use members of the read-copy update interface
are the <TT>kmem_deferred_free_embedded()</TT> family of functions.
These functions take a pointer to a structure that is to
be freed, but which may still be in use by other CPUs.
The structure must have been removed from any global lists
so that any CPU not currently referencing it will have no way
of gaining a fresh reference to it in the future.  The
kmem_deferred_free_embedded()
functions place the structure on a list where it is kept
until all CPUs have passed through the idle loop, user
code, a context switch, the beginning of a system call, or
a trap from user space.  At this point, all threads of
kernel execution that were in progress at the time of the
<TT>kmem_deferred_free_embedded()</TT> call have terminated.  Since
any new threads are unable to form a new reference to the
structure, it may now be safely deleted.
<P>
The other commonly-used member is the rc_callback_wrapper()
function that allows a process to wake itself up after all
CPUs can be guaranteed to see a global state change.
<P>
Example uses of both types of interface appear in the following
sections.
<P>
<H1><A NAME="DESIGN RULES FOR kmem_deferred_alloc_embedded()">DESIGN RULES FOR \
<TT>kmem_deferred_alloc_embedded()</TT></A></H1> <P>
Applying the <TT>kmem_deferred_free_embedded()</TT> function to a
data structure that is linked together with many other data
structures can be an exercise in frustration if done in an
ad-hoc manner.  An effective way to manage the complexity
is to restrict the pointer-manipulation to a few functions
as described below.  This approach can also do much to
simplify existing code, independent of the use of the
read-copy update mechanism.
<P>
The first thing to note is that use of kmem_deferred_free_embedded()
assumes that updates are much less frequent than accesses.
Therefore, the code that performs the updates usually need
not be highly optimized, and that the function-call overhead
that will be incurred by grouping the needed pointer manipulations
in a few functions is not of concern.
<P>
The second thing to note is that a common mistake in the
use of <TT>kmem_deferred_free_embedded()</TT> is to treat a structure
slated for deletion as if it were the most up-to-date copy.
The best way to guard against this is to maintain a flag in
the structure that indicates that it is slated for deletion
and to keep a pointer to its replacement, as shown in the
following examples.
<P>
An ``xxx'' structure that supports the read-copy mechanism
might therefore be defined as follows:
<LISTING>
	typedef struct xxx xxx_t;
	
	struct xxx {
		xxx_t	*xxx_next;
		xxx_t	*xxx_prev;
		long	xxx_flags;
		xxx_t	*xxx_replacement;
		long	*xxx_commonctr;	 /* example field to be */
					 /* shared among all versions */
					 /* of an object. */
		/* other fields as needed, e.g... */
		long	xxx_address;	 /* used to identify xxx. */
		lock_t	xxx_lock;	 /* used to lock individual */
					 /* xxx_t. */
	};

	#define XXX_DEFDEL 0x4000 /* Deletion in progress. */
	#define XXX_DEFDUP 0x8000 /* Deletion of duplicate. */

	#define SPLXXX SPLHI	/* For example... */

	xxx_t	xxx_head;	/* Global list header. */
	lock_t	xxx_mutex;	/* Global update mutex. */
</LISTING>
Implementations in languages such as C++ or SmallTalk that support
a more natural expression of object properties should of course
use a more object-oriented approach.
In either case, some algorithms may be structured so as to allow
the <TT>xxx_next</TT> pointer to be used in place of the
<TT>xxx_replacement</TT> pointer.
<P>
Allocation of an ``xxx'' structure should be performed by
a function of the form:
<LISTING>
	xxx_t *
	xxx_alloc(xxx_t *oldp, int flag)
	{
		xxx_t *newp;
		long *commonctr = NULL;

		if (oldp == NULL) {

			/*
			 * There is no old version, so allocate
			 * all fields that are common to all
			 * versions.
			 */

			commonctr = kmem_zalloc(sizeof(*commonctr),
						flag);
			if (commonctr == NULL) {
				return (NULL);
			}
		}

		newp = kmem_zalloc_deferred_embedded(sizeof(xxx_t),
						     flag);
		if (newp == NULL) {
			return (NULL);
		}

		if (oldp != NULL) {
			INSIST(oldp->xxx_flags & XXX_DEFDEL == 0,
			       "xxx_alloc: update of deleted item!");
			/*
			 *+ Internal software error.  Caller attempted
			 *+ to modify a structure already slated for
			 *+ deletion.  Corrective action:  change
			 *+ code to modify the current version.
			 *+ This may involve eliminating a race
			 *+ condition.
			 */

			/* Also, caller must hold xxx_mutex. */

			*newp = *oldp;
			oldp->xxx_flags |= XXX_DEFDEL | XXX_DEFDUP;
			oldp->xxx_replacement = newp;
		} else {

			newp->xxx_commonctr = commonctr;

			/* initialize nonzero fields of newp. */
		}

		return (newp);
	}
</LISTING>
Once a structure has been allocated, the calling code will likely
modify some of the fields as needed.  If the structure is to replace
an existing one, the following function would be used:
<LISTING>
	void
	xxx_replace(xxx_t *newp, xxx_t *oldp)
	{
		/* Caller must hold xxx_mutex. */

		/* Commit to memory before pointer update. */

		RC_MEMSYNC();

		/*
		 * Update all pointers to old version to now
		 * point to new version.
		 */

		if (newp->xxx_next->xxx_prev == oldp) {
			newp->xxx_next->xxx_prev = newp;
		}
		if (newp->xxx_prev->xxx_next == oldp) {
			newp->xxx_prev->xxx_next = newp;
		}

		kmem_deferred_free_embedded(oldp,
					    sizeof(*oldp),
					    xxx_cleanup);
	}
</LISTING>
Another function named <TT>xxx_insert()</TT> would be needed to insert
an entirely new structure into the lists.  A fourth function named
<TT>xxx_delete()</TT> would be used to remove a structure without replacing
it with a new copy.  Its form would be similar to that of
<TT>xxx_replace()</TT>.
<LISTING>
	void
	xxx_insert(xxx_t *newp)
	{
		/* Caller must hold xxx_mutex. */

		/* Commit to memory before pointer update. */

		RC_MEMSYNC();

		/* Link into all relevant lists.  */

		newp->xxx_prev = &xxx_head;
		newp->xxx_next = xxx_head->xxx_next;
		xxx_head.xxx_next->xxx_prev = newp;
		xxx_head.xxx_next = newp;
	}

	void
	xxx_delete(xxx_t *oldp)
	{
		/* Caller must hold xxx_mutex. */

		/* Link out of all relevant lists.  */

		oldp->xxx_prev->xxx_next = oldp->xxx_next;
		oldp->xxx_next->xxx_prev = oldp->xxx_prev;

		/* Mark as deleted. */

		INSIST(oldp->xxx_flags & XXX_DEFDEL == 0,
		       "xxx_delete: deletion of deleted item!");
		/*
		 *+ Internal software error.  Caller attempted to
		 *+ delete a structure already slated for deletion.
		 *+ Corrective action:  change code to modify the
		 *+ current version.  This may involve eliminating
		 *+ a race condition.
		 */

		oldp->xxx_flags |= XXX_DEFDEL;

		/*
		 * Delete the structure when safe to do so.
		 * Also does implicit RC_MEMSYNC().
		 */

		kmem_deferred_free_embedded(oldp,
					    sizeof(*oldp),
					    xxx_cleanup);
	}
</LISTING>
If the structure has pointers to other structures which themselves
may need to be freed, then an <TT>xxx_cleanup()</TT> function may be
used to handle this.  This example has a field <TT>xxx_commonctr</TT>
that is common to all versions of a given structure.  Therefore,
this field must be freed only when the object is deleted -- not
when a version is deleted after being replaced by a newer version.
If the structure did not have any such fields, then the <TT>xxx_cleanup</TT>
passed to the call to <TT>kmem_deferred_free_embedded()</TT> above would
be replaced with <TT>NULL</TT> in order to omit any cleanup processing.
<LISTING>
	void
	xxx_cleanup(void *ptr, size_t nbytes)
	{
		xxx_t *xp = (xxx_t *)ptr;

		if (!(xp->xxx_flags & XXX_DEFDUP)) {

			/*
			 * This is not an old duplicate, so we
			 * must free up all fields common to
			 * all versions.
			 */

			kmem_free(xp->xxx_commonctr,
				  sizeof(xp->xxx_commonctr));

			/*
			 * Insert code here to free up other
			 * fields common to all versions.
			 */
		}

		/*
		 * Insert code here to free up any other structures
		 * pointed to by this one that are not referenced
		 * by other structures.
		 */

		/*
		 * Since we did not free up the structure itself,
		 * return 0 to tell the caller to do it.
		 */

		return (0);
	}
</LISTING>
To illustrate, to update the structure pointed to by <TT>xp</TT> from
within a process context:
<LISTING>
	s = p_lock(&xxx_mutex, SPLXXX);

	/*
	 * If the initial search for xp was done under the lock, then
	 * this "while" loop would not be necessary, see below.
	 */

	while (xp->xxx_replacement != NULL) {
		xp = xp->xxx_replacement;
	}
	newp = xxx_alloc(xp, KM_SLEEP);
	/* insert code here to modify newp's fields as appropriate. */
	xxx_replace(newp, xp);
	v_lock(&xxx_mutex, s);
</LISTING>
An entirely new structure would be added as follows:
<LISTING>
	s = p_lock(&xxx_mutex, SPLXXX);
	newp = xxx_alloc(NULL, KM_SLEEP);
	/* insert code here to initialize newp's fields. */
	xxx_insert(newp);
	v_lock(&xxx_mutex, s);
</LISTING>
An existing structure would be deleted (without replacement) as
follows:
<LISTING>
	s = p_lock(&xxx_mutex, SPLXXX);

	/*
	 * If the initial search for xp was done under the lock, then
	 * this "while" loop would not be necessary, see below.
	 */

	while (xp->xxx_replacement != NULL) {
		xp = xp->xxx_replacement;
	}
	xxx_delete(xp);
	v_lock(&xxx_mutex, s);
</LISTING>
The <TT>while</TT> loop in both the replacement and deletion cases is
not needed if the list traversal that results in the xp pointer
is conducted under the lock.  For example, if the lookup and
deletion code were combined, the following would result:
<LISTING>
	s = p_lock(&xxx_mutex, SPLXXX);
	xp = xxx_lookup(xxx_address);
	if (xp == NULL) {
		v_lock(&xxx_mutex, s);

		/*
		 * Do whatever is necessary to report failure and
		 * return control to the caller.
		 */
	}
	xxx_delete(xp);
	v_lock(&xxx_mutex, s);
</LISTING>
Since the lock is held while looking up the pointer, it is not
possible for the pointer to become obsolete.
<P>
The code for xxx_lookup would be as follows:
<LISTING>
	xxx_t *
	xxx_lookup(long xxx_address)
	{
		xxx_t *xp;

		for (xp = xxx_head; xp != NULL; xp = xp->xxx_next) {
			if (xp->xxx_address == xxx_address) {
				return (xp);
			}
		}
		return (NULL);
	}
</LISTING>
Note that the caller is responsible for locking.  As noted
earlier, updaters would use <TT>xxx_mutex</TT> to lock.  Readers would
just use <TT>RC_RDPROTECT()</TT> and <TT>RC_RDUNPROTECT()</TT> as follows:
<LISTING>
	RC_RDPROTECT();
	xp = xxx_lookup(xxx_address);
	if (xp != NULL) {
		/* Do something with xp. */
	}
	RC_RDUNPROTECT();
</LISTING>
<P>
<H1><A NAME="ADDITIONAL DESIGN RULES TO PROTECT LIST OF ELEMENTS">ADDITIONAL DESIGN \
RULES TO PROTECT LIST OF ELEMENTS</A></H1> <P>
The rclock discipline is often used to protect a list of elements,
but each element itself may need to be protected by a normal
spinlock.  In this case, the following code would be used to
find the element and lock it:
<LISTING>
	RC_RDPROTECT();
	xp = xxx_lookup(xxx_address);
	if (xp != NULL) {
		s = p_lock(&(xp->xxx_lock, SPLXXX);
		if (xp->xxx_flags & XXX_DEFDEL) {
			/*
			 * handle case where item is deleted,
			 * possibly by retrying the search.
			 */
		} else {
			/* Do something with xp. */
		}
		v_lock(&(xp->xxx_lock, s);
	}
	RC_RDUNPROTECT();
</LISTING>
In particular, the following code would delete an item:
<LISTING>
	s = p_lock(&xxx_mutex, SPLXXX);
	xp = xxx_lookup(xxx_address);
	if (xp != NULL) {
		(void)p_lock(&(xp->xxx_lock, SPLXXX);
		ASSERT_DEBUG(!(xp->xxx_flags & XXX_DEFDEL),
			     "item deleted without xxx_mutex");
		xxx_delete(xp);
		v_lock(&(xp->xxx_lock, SPLXXX);
	}
	v_lock(&xxx_mutex, s);
</LISTING>
An new structure would be added as follows:
<LISTING>
	s = p_lock(&xxx_mutex, SPLXXX);
	newp = xxx_alloc(NULL, KM_SLEEP);
	/* insert code here to modify newp's fields as appropriate. */
	(void)p_lock(&(xp->xxx_lock), SPLXXX);
	xxx_insert(newp);
	(void)v_lock(&(xp->xxx_lock), SPLXXX);
	v_lock(&xxx_mutex, s);
</LISTING>
Note that it is not necessary to use the read-copy primitives in order to
update a single element in place, as the xp->xxx_lock may be used
to guard such updates.
<P>
<H1><A NAME="EXAMPLE OF ELIMINATING TABLE-LOOKUP MUTEX">EXAMPLE OF ELIMINATING \
TABLE-LOOKUP MUTEX</A></H1> <P>
It is fairly common to have a list structure protected by a global
mutex whose individual elements are protected by a per-element
mutex.  In most cases, searches of the list are <EM>much</EM> more common
than are insertions or deletions.
<P>
For example, the following code gives (simplistic) search and
delete functions for a circular doubly-linked list of elements:
<LISTING>
	lock_t hash_lock;

	struct element {
		struct element *next;
		struct element *prev;
		lock_t element_lock;
		int address;
		int data;
		int deleted;  /* used later... */
	};

	struct element head[NHASH];  /* waste a bit of space... */

	#define hashit(x) ((x) % NHASH)

	/*
	 * Find element with specified address, lock it, and return
	 * a pointer to it.  Return NULL if no such address.
	 * Drop the global lock if rlsgbl is set.
	 * If element not found, drop all locks.
	 */

	struct element *
	search(int addr, spl_t *s, int rlsgbl)

	{
		struct element *p;
		struct element *p0;

		p0 = &head[hashit(addr)];
		*s = p_lock(&hash_lock, SPLXXX);
		p = p0->next;
		while (p != p0) {
			if (p->address == addr) {
				(void)p_lock(&(p->element_lock), SPLXXX);
				if (rlsgbl) {
					v_lock(&hash_lock, SPLXXX);
				}
				return (p);
			}
			p = p->next;
		}
		v_lock(&hash_lock, *s);
		return ((struct element *)NULL);
	}

	/*
	 * Delete the specified element from the table.  Element
	 * must be locked, global lock must -not- be held.  Drops
	 * spl to that specified when releasing per-element lock.
	 */

	void
	delete(struct element *p, spl_t s)

	{
		int addr;
		spl_t junkspl;

		ASSERT_DEBUG(!p->deleted, "Element already deleted");

		/* Try to get global lock. */

		if (cp_lock(&hash_lock, SPLXXX) == CPLOCKFAIL) {

			/* Someone else has the global lock.
			 * To avoid deadlock, we must release the
			 * element lock and acquire the locks in
			 * the proper order.  In the meantime, someone
			 * may have deleted the element, so we must
			 * check.  We must do an explicit search, as
			 * we cannot trust our pointer to survive
			 * unless we have one of the locks held.
			 * Don't bother coding inline, as this
			 * should be a rather rare case.  In fact,
			 * if efficiency is a concern, the body
			 * of this if-statement should be coded
			 * out-of-line.
			 */

			addr = p->address;
			v_lock(&(p->element_lock), s);
			if ((p = search(addr, &junkspl, 0)) == NULL) {

				/* Someone else deleted us. */

				return;
			}

			/*
			 * The element is still there and we have
			 * all the locks we need.
			 */
		}

		/*
		 * We now have all needed locks, so
		 * delete the element from the list, release
		 * locks, free up the element, and return.
		 */

		p->next->prev = p->prev;
		p->prev->next = p->next;
		v_lock(&(p->element_lock), SPLXXX);
		v_lock(&hash_lock, s);
		kmem_free(p, sizeof(*p));
		return;
	}
</LISTING>
This code has lots of lock round-trips and has some rather ugly
deadlock-avoidance code.  Compare to the following code which
uses <TT>kmem_deferred_free_embedded</TT>:
<LISTING>
	/*
	 * Find element with specified address, lock it, and return
	 * a pointer to it.  Return NULL if no such address.
	 */

	struct element *
	search(int addr, spl_t *s)

	{
		struct element *p;
		struct element *p0;

		p0 = &head[hashit(addr)];
		RC_RDPROTECT();
		p = p0->next;
		while (p != p0) {
			if ((p->address == addr) &&
			    !p->deleted) {
				*s = p_lock(&(p->element_lock), SPLXXX);
				RC_RDUNPROTECT();
				return (p);
			}
			p = p->next;
		}
		RC_RDUNPROTECT();
		return ((struct element *)NULL);
	}

	/*
	 * Delete the specified element from the table.  Element
	 * must be locked.  Drops spl to that specified when
	 * releasing per-element lock.
	 */

	void
	delete(struct element *p, spl_t s)

	{
		int addr;

		/*
		 * Get global lock.  There is no possible deadlock,
		 * since the global lock is now always acquired
		 * after the per-element lock.
		 */

		(void)p_lock(&hash_lock, SPLXXX);

		/*
		 * Delete the element from the list, release
		 * locks, free up the element, and return.
		 */

		p->deleted = 1;
		p->next->prev = p->prev;
		p->prev->next = p->next;
		v_lock(&(p->element_lock), SPLXXX);
		v_lock(&hash_lock, s);
		kmem_deferred_free_embedded(p,
					    sizeof(*p),
					    NULL);
		return;
	}
</LISTING>
There is one less lock round-trip in <TT>search()</TT>, and the code for
<TT>delete()</TT> is much simpler bec



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic