[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freebsd-bugs
Subject:    Re: misc/164130: broken netisr initialization
From:       "Robert N. M. Watson" <rwatson () FreeBSD ! org>
Date:       2012-01-30 10:28:24
Message-ID: A8A57BF5-3EF7-43A3-8106-ED93A82C71F1 () FreeBSD ! org
[Download RAW message or body]


On 17 Jan 2012, at 17:41, Коньков Евгений wrote:

> Loads only netisr3.
> and question: ip works over ethernet. How you can distinguish ip and ether???

netstat -Q is showing you per-protocol (layer) processing statistics. An IP packet \
arriving via ethernet will typically be counted twice: once for ethernet \
input/decapsulation, and once for IP-layer processing. Netisr dispatch serves a \
number of purposes, not least preventing excessive stack depth/recursion and load \
balancing.

There has been a historic tension between deferred (queued) dispatch to a separate \
worker and direct dispatch ("process to completion"). The former offers more \
opportunities for parallelism and reduces latency during interrupt-layer processing. \
However, the latter reduces overhead and overall packet latency for higher-level \
parallelism by avoiding queueing/scheduling overheads, as well as avoiding packets \
migration between caches, reducing cache coherency traffic. Our general experience is \
that many common configurations, especially lower-end systems *and* systems with \
multi-queue 10gbps cards, prefer direct dispatch. However, there are forwarding \
scenarios or ones in which CPU count significantly outnumbers NIC input queue count, \
where queuing to additional workers can markedly improve performance.

In FreeBSD 9.0 we've attempted to improve the vocabulary of expressible policies in \
netisr so that we can explore which work best in various scenarios, giving users more \
flexibility but also attempting to determine a better longer-term model. Ideally, as \
with the VM system, these features would be to some extent self-tuning, but we don't \
have enough information and experience to decide how best to do that yet.

> NETISR_POLICY_FLOW    netisr should maintain flow ordering as defined by
> the mbuf header flow ID field.  If the protocol
> implements nh_m2flow, then netisr will query the
> protocol in the event that the mbuf doesn't have a
> flow ID, falling back on source ordering.
> 
> NETISR_POLICY_CPU     netisr will entirely delegate all work placement
> decisions to the protocol, querying nh_m2cpuid for
> each packet.
> 
> _FLOW: description says that cpuid discovered by flow.
> _CPU: here decision to choose CPU is deligated to protocol. maybe it
> will be clear to name it as: NETISR_POLICY_PROTO ???

The name has to do with the nature of the information returned by the netisr protocol \
handler -- in the former case, the protocol returns a flow identifier, which is used \
by netisr to calculate an affinity. In the latter case, the protocol returns a CPU \
affinity directly.

> and BIG QUESTION: why you allow to somebody (flow, proto) to make any
> decisions??? That is wrong: because of bad their
> implementation/decision may cause to schedule packets only to some CPU.
> So one CPU will overloaded (0%idle) other will be free. (100%idle)

I think you're confusing policy and mechanism. The above KPIs are about providing the \
mechanism to implement a variety of policies. Many of the policies we are interested \
in are not yet implemented, or available only as patches. Keep in mind that workloads \
and systems are highly variable, with variable costs for work dispatch, etc. We run \
on high-end Intel servers, where individual CPUs tend to be very powerful but not all \
that plentiful, but also embedded multi-threadd MIPS devices with many threads, each \
individually quite weak. Deferred dispatch is a better choice for the latter, where \
there are optimised handoff primitives to help avoid queueing overhead, whereas in \
the former case you really want NIC-backed work dispatch, which will generally mean \
you want direct dispatch with multiple ithreads (one per queue) rather than multiple \
netisr threads. Using deferred dispatch in Intel-style environments is generally \
unproductive, since high-end configurations will support multi-queue input already, \
and CPUs are quite powerful.


> > * Enforcing ordering limits the opportunity for concurrency, but maintains
> > * the strong ordering requirements found in some protocols, such as TCP.
> TCP do not require strong ordering requiremets!!! Maybe you mean UDP?

I think most people would disagree with this. Reordering TCP segments leads to \
extremely poor TCP behaviour -- there is an extensive research literature on this, \
and maintaining ordering for TCP flows is a critical network stack design goal.

> To get full concurency you must put new flowid to free CPU and
> remember cpuid for that flow.

Stateful assignment of flows to CPUs is of significant interest to use, although \
currently we only support hash-based assignment without state. In large part, that \
decision is a good one, as multi-queue network cards are highly variable in terms of \
the size of their state tables for offloading flow-specific affinity policies. For \
example, lower-end 10gbps cards may support state tables with 32 entries. High-end \
cards may support state tables with tens of thousands of entries.

> Just hash packetflow to then number of thrreads: net.isr.numthreads
> nws_array[flowid]= hash( flowid, sourceid, ifp->if_index, source )
> if( cpuload( nws_array[flowid] )>99 )
> nws_array[flowid]++;  //queue packet to other CPU
> 
> that will be just ten lines of conde instead of 50 in your case.

We support a more complex KPI because we need to support future policies that are \
more complex. For example, there are out-of-tree changes that align TCP-level and \
netisr-level per-CPU data structures and affinity with NIC RSS support. The algorithm \
you've suggested above explicitly introduces reordering, which would significant \
damage network performance, even though it appears to balance CPU load better.

> Also nitice you have:
> /*
> * Utility routines for protocols that implement their own mapping of flows
> * to CPUs.
> */
> u_int
> netisr_get_cpucount(void)
> {
> 
> return (nws_count);
> }
> 
> but you do not use it! that break incapsulation.

This is a public symbol for use outside of the netisr framework -- for example, in \
the uncommitted RSS code.

> Also I want to ask you: help me please where I can find documention
> about scheduling netisr and full packetflow through kernel:
> packetinput->kernel->packetoutput
> but more description what is going on with packet while it is passing
> router.

Unfortunately, this code is currently largely self-documenting. The Stevens' books \
are getting quite outdated, as are McKusick/Neville-Neil -- however, they at least \
offer structural guides which may be of use to you. Refreshes of these books would be \
extremely helpful.

Robert_______________________________________________
freebsd-bugs@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscribe@freebsd.org"


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic